Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - paul

#16
Rather interestingly, I migrated from four "big name" other NMS solutions to NetXMS.

Today we had a major power spike across the metro area - ORION - asleep at the wheel, overloaded by bloated functions, NetXMS - my Up/Down dashboard - 1 minute auto refresh - loved by all.

NetXMS has its nuances, but for us, today demonstrated that NetXMS can easy hold its own, and more, for standard NMS and SNMP based monitoring.

#17
These nodes do respond to snmp but DNS may not have updated at the point in time the first trap is received.

Agree - the filtering script on discovery looks to be the best option but status polling at one minute intervals may be easier - just use discovery script to wait two minutes and see if device now updated. I could also possibly use sysDescription in the filter script as the - do not add if match.

Will look again into the MAC matching theory and once I have identified what PrimaryName is as a $node variable, will try again.

#18
Went to 2.2.16 as my primary work around.

No filtering scripts but all policies had the default "stop processing" unticked.

I still seem to have a CPU usage increase around 10:00 each day - but now with my 4000 duplicate nodes deleted plus "stop processing" ticked, do not seem to be being impacted.

I know how to probably check (now I have a few minutes to think about it) - open up the alarms with create time when I think problem starts and compare to event time - they are normally the same so any difference indicates Event Processor is lagging.

Will look a bit closer and see if anything stands out.




#19
Forgot to include what it actually looks like - on a dedicated 50 inch screen :)

#20
One more additional change.

Hit this last night around 01:30 - node was down with sys_node_unreachable. I was thinking ... WTF ... is that not down?

The answer actually is yes AND no - https://www.netxms.org/forum/configuration/netxms-polling-fails-no-errornotification/

I don't really care, holistically speaking, if it has an upstream root cause - it is still down from a connectivity perspective and from an impact perspective.

Added my nodeDown script action to the SYS_NODE_UNREACHABLE event policy --- hold on --- there is no alarm policy for this!! Added SYS_NODE_UNREACHABLE event to the SYS_NODE_DOWN Event Processor Policy - which already has by nodeDown script.

Whilst I am being creative - realized that my whole unknown / unmanaged auto bind and auto unbind issues could all be addressed with scripts hung off the SYS_NODE_UNKNOWN or SYS_NODE_UNMANAGED events.  Will go down that path if I need further work, but for now, crossed fingers - this is the last!!
#21
Feature Requests / Just a comment on NX-1642
July 27, 2019, 05:41:41 PM
Both BIG-IP F5's and CITRIX Netscaler's imbed the ascii equivalent of the user defined field name as the trailing oids.

You can see this in the walk data vs the text version
ASCII
47.67.111.109.109.111.110.47   
CHAR
/Common/

Once I had worked that out, monitoring F5's and Netscaler's became much, much easier. Having it indexed would be even better.

Lots of other things are like this - CISCO blade centre blades and thinks like that.

From the archives: - ascii to char conversion
https://ee.hawaii.edu/~tep/EE160/Book/chap4/subsection2.1.1.1.html

I use CITRIX now, but for F5 snmp monitoring - it is  reasonable well documented here ==?  https://techdocs.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/bigip-external-monitoring-implementations-13-1-0/13.html

The only problem is that the mib with the indexed oids are in F5-BIGIP-LOCAL-MIB.txt - which they do not explain how to do the conversion - but they do explain everything else.

The challenge here is that we selectively monitored the server / vserver / node tables for specifics such as UP or DOWN and if membership was 100%.

Perhaps include a SNMP wizard that walks the mib, builds the various tables (ltmPoolMember is one example of a table we used), allowing the user to select which items in the table to monitor.

Our way?  - walk the  whole mib - copy our virtual server name / paste into ascii converter - then copy the ascii string - paste it onto the end of the OID for that table that we wanted to monitor -

#22
Stumbled across this 2.2.16 fix
NX-869 (nxevent sometimes hanges after sending event)

I wonder.....

Will do an upgrade to 2.2.16 or 3.0 in a couple of weeks so will see then.
#23
yeah - node name goes in column 1 and dnsname in column 2.

As my node names match my dns names - just duplicated the 1st column, renamed the header to Adress, ran the bulk import - 670 additional nodes - sweet :)

Found this solution in an old example somewhere.

Sometimes the obvious is obvious - sometimes it just bangs you in the head.
#24
Well - it all looked hopeful - except my nodes have an IP as their primary Host Name and not the Node name - which prevents the IP being updated.

Cannot find anywhere on how to set primaryHostName for a node or even where it is stored. Only reference I found is here:
https://wiki.netxms.org/wiki/NXSL:CreateNode

I want to add to my status or configuration poll hook - or even just as a script - a way to set Primary Host Name to the Node name.

if (($node->primaryHostName != $node->name)) {
   if (($node->name != null) && ($node->primaryHostName != ""))
    {
      ChangeThePrimaryHostNameOfTheObject($node, $node->name);
    }
}


any help would be appreciated.

#25
Be patient enough and all gets revealed!!

https://www.netxms.org/forum/configuration/actions-parameter/

$event-> message gives me the message based on the Event message whereas $alarmMessage gives me the whole alarm text - which is available as  """%m""".

This not only solves my Description vs. Detail for ticket creation - but also gives me a basis to design my event message as the summary and alarm message as the detail.

Happy enough with this as a good reason for both and when and where to use both  :)

#26
Sometimes doing this at 2 in the morning does not give the best outcome.

Solved the last part very simply.....

Step 1 - Unmanage Node
Step 2 - Manually trigger status poll on that node (I unmanaged it so I am already in the right place).

Step 2 triggers the code in the status poll hook to pick up the status = 6, removes the Node from AllNetworkDown AND sets nodeUpDown to Maintenance.

Working well for 24 hours so far - auto adding and removing nodes as they go up and down - Dashboard with down only nodes updating flawlessly as well.

You may wonder - how can I track my unmanaged nodes? - they are still in their original container - I simply go into object view - select nodes - sort by status - very, very simple.


#27
Turned off discovery from traps, exported all the duplicates and then deleted them.

Long Term Solution will be, when I have time, to have a filter script on discovery so that when a node is discovery from a trap, filter script does a DNS resolution and if the name comes back as an existing node, do not add node - just let status polling with the resolve DNStoIP set to yes update the node that now has a new IP.

It means I lose that first trap - but I can live with that. I just hope the above allows the DNStoIP pick up the new IP and then pick up the SYS_NODE_UP against the new IP - which I expect it will.
#28
My 2 cents as a novice user.

tomaskir is right but possibly too blunt. I think what he meant to say was that unless you use the proper access mechanisms supplied for use with NetXMS, you cannot expect to get the same results.

From a console performance perspective, NetXMS I expect is designed to deliver its performance using caching and only by using the proper mechanisms can you expect that things like server cache will reflect any updates you apply.

I am not surprised that the console / server have a discrepancy, but, there is simply no options available as there is nothing that is broken or not working as designed.

Other than trying the web console vs. the management console, there are not a lot of "official" options - in fact - none.

Basically, once the decision was made to alter the database directly, you created your own customized version. Having done so, you appear now to be asking how to make further customizations to get the console working in you customized installation. Your problem is not that the console will not display - your problem is that the customizations that write to the database directly are causing problems. Solution is simple - get rid of them and use what is there already.

My suggestion - similar to tomaskir - step back from the customization by rewriting your SQL that update the database directly into NXSL or to use the API.

Either of these will then update NetXMS DB using the correct method, allowing the console to work and display properly - as per your original intent.

Probably about 20 - 30 lines of NXSL - add as a scheduled task - run it every 5 minutes.

For each Alarm where createtime = current time - 5 minutes - set to acknowledged.
For each Alarm where createtime = current time - 1 hour - set to terminated.

Achieves all that you are attempting to achieve without affecting the automation that emails out the alerts AND gives you a console that is updating.

#29
No transforming - straight from the Performance tab - NetXMS template standard monitors.
NetXMS server: events processed for last minute

As for capacity to process - when we restart we rip through the backlog in a minute which, for a backlog of 300k equates to about 3-5,000 per second.

I think that in reality, we quite possibly really are only processing about 300 "events" per minute - not very much at all.

We do get a large number of traps that I have set not to write to event log and hence no "event processing" - might look to see if anything their.

DISK I/O queue is average of 0.5 apart from when the 10 minutes when the backup runs - spikes out to 15 for a minute or so.  Sits on FLASH storage so expect this to be fast.

CPU - plenty of cores and also plenty of memory - now. I noticed the "physical memory used by process" had a steady climb - I assume it was the every growing Node count. Deleted my 3000 duplicate nodes and got 1GB back - had added 4GB just to be safe - will watch to see how this goes.

Not in huge hurry to reproduce this - just really strange that it happens.

The really annoying part is that as Event Processor queue is not a performance DCI, I cannot track this issue any better.

On a huge plus side - if you are able to process 220 events per second and I am doing about 300 per minute,  I clearly have capacity to "lean" on NetXMS more. I have ORION in my sights and my only concern (apart from the grind) was capacity.

I think that the 4GB extra ram, the 2 extra cores and the deletion of my duplicate Nodes has probably pushed this issue down so it is not triggering like it was.  Will keep an eye on this though - and would really like to see Event Processor Queue as a Template item - NetXMS critically depends upon this - so instrument it - please!!


#30
Nope - the unmanaged ones show even though unmanaged is not ticked :(

This indicates that NetXMS is treating the Node-Down critical alarms as part of the status value still for dashboard element Status Map rather than hard setting the status to be unmanaged.

Whatever this does - it does not exclude Unmanaged Nodes with Critical Alarms. This is the XMS from the dashboard that is not working.
<severityFilter>63</severityFilter>

So I tried an addition to my status poll hook to remove the node if it is unmanaged and it is already set to Down - but this is not working either :(

I am guessing this is failing as the node is unmanaged - so not polled - so not updated.

if (($node->status == 6)) {
   state = GetCustomAttribute($node, "nodeUpDown");
         if (state != null)
            {
             if ((state imatch "Down")) 
                 {
                   SetCustomAttribute($node, "nodeUpDown", "Unmanaged");
                   UnbindObject(FindObject("AllNetworkDown"), $node);
         }
   }
}



Will try this as a scheduled task - 15 minute intervals probably.

I assume I will need to work out some looping logic - for each $node in container "AllNetworkDown"

Or, can I point the script at the container as the object and it will loop through each node - in which case I can just use the script as per above?