Additional status code - down - achieved close enough equivalent

Started by paul, June 21, 2019, 01:31:26 PM

Previous topic - Next topic

paul

When looking at status codes, I am forced to rely on a status of critical via a node_Down alarm. A node being down is both a status and a state of which I would like to know, separately, if a node is down. We have plenty of nodes with critical alarms, but I would like to show down as a unique code / colour / status. 

https://wiki.netxms.org/wiki/NXSL:NetObj - add it as Status ID = 9.

I hit this problem as I have nodes that are not down, their only fault is that they have an outstanding Node_Down alarm - and the node is not down.

By having this option, I could also set to auto-clear any Node_Down alarms where status NE down or "Status ID == 9".

I can also have a specific container for Status ID = 9.


paul

After running NetXMS for a bit now, State is different to Status. Status is an escalating level of severity aggregating upwards.

State is separate and indicates whether a node is contactable via any of its configured mechanisms.

State should be shown as up / down relating to communication with the node and should show "since"

Status shows criticality of alarms assigned to the node.

A dependency should be able to be set globally and overridden at the node level - suspend DCI if State = down. This prevents DCI alarms and also prevents the template DCI's being disabled / removed for nodes that lose connectivity.

For SNMP DCI's , a node that drops connectivity should drop back to status polling (sysdescription only) and once State = UP (response received), DCI polling resumes.

The settings that are already present use Status interchangeably between Status Polling (Up/Down) and Status Alarms (Minor/Major/Critical) when Status Polling should be reflected in a variable called State or NodeState and displayed separately in the General box on the Overview page.

Status polling should also check for any Alarms for Node=Down and automatically clear them if found.

paul

Was having a discussion with an icinga2 user and they have STATE as a severity which gets confusing - a Node that is down is critical - not because it has a status of critical - but because it has a state of down.

I like the simplistic view where I can look at the General panel of a Node and can see just what and where the problem is. Is the Node up or down, what is OK and what is NOT. This is standard Kepner Tregoe Problem Solving / Situational Analysis, in case anyone was wondering :)

A Node STATE is always UP or DOWN (or unmanaged)

A Node STATUS is OK / Warning / Major / Critical - made up of the following - each displayed in the General panel.

Interface Status (user selectable to affect Node STATUS - has a dependency on STATE being UP)
DCI Exception Status (user selectable to affect Node STATUS - has a dependency on Node STATE being up - show last value if STATE = down)
TRAP Exception Status (user selectable to affect Node STATUS) - still relevant if Node is up or down - shows what was happening on way down)

paul

Interim solution - from 2012  https://www.netxms.org/forum/configuration/event-based-on-resolving-an-alarm-timeout/ - thanks again Victor  :) :) :)

Unless there is a newer better solution, this will likely do it

Create script "OnNodeDown"
SetCustomAttribute($node, "nodeUpDown", "Down");

Create script "OnNodeUp"
SetCustomAttribute($node, "nodeUpDown", "Up");

then create action "execute script" to execute "OnNodeDown" script, and add it in event processing policy to the rule processing SYS_NODE_DOWN event.
then create action "execute script" to execute "OnNodeUp" script, and add it in event processing policy to the rule processing SYS_NODE_UP event.

And then, create a container based on the Custom Attribute "nodeUpDown" - auto bind and auto unbind - based on following
updown = GetCustomAttribute($node, "nodeUpDown");
return ( updown = "Down");


Which should leave me with a container with nodes whose only attribute that made them a member is that they are Down.

If I could - I would add this custom value to the General Object view so I could see - easily - when a node I am looking at is down or up.

My "unknown" nodes that respond to both icmp and snmp - but are not recognized - would benefit from this immensely.




paul

#4
OK - the idea was well intended - but failed. I could not get the getcustomattribute or the $node->customattribute working.

Working version is as follows:

Create Container called AllDown.

Create the following - do this in the order listed as each depends on the previous being done.

Script onNodeDown

SetCustomAttribute($node, "nodeUpDown", "Down");
BindObject(FindObject("AllDown"), $node);


Script: onNodeUp
SetCustomAttribute($node, "nodeUpDown", "Up");
UnbindObject(FindObject("AllDown"), $node);


Actions
NXSL Script
Name: NodeDown
Script name: onNodeDown

NXSL Script
Name: NodeUp
Script name: onNodeUp

Event Processing Policy
Show alarm when node is down
Add under Action ==> Server actions ==> NodeDown

Terminate node down alarms when node is up
Add under Action ==> Server actions ==> NodeUp

Click on the save icon.

All done :)

For auto binding and auto unbind on DCI polling - the following needs to be added to the Hook:ConfigurationPoll

Note: It is based on the nodename being of a certain name - in this case it contains the letter a. The container name is called "alldown" (change to AllDown if using the above as well)

if (($node->name imatch "a")) {
   state = GetCustomAttribute($node, "nodeUpDown");
         if (state != null)
            {
             if ((state imatch "Down")) 
                 {
                  BindObject(FindObject("alldown"), $node);
                 }
             if ((state imatch "Up"))
                 {
                   UnbindObject(FindObject("alldown"), $node);
         }
   }
}


The final part is to add a hook to Status Polling so that a Node that does not respond to ICMP and SNMP (or whatever is yes   isSNMP=yes) to set the custom attribute nodeUpDown to down.

Doing it this way gets me round the need to rely on Event Policy processing - something which, for us, has its own problems.






Tursiops

I believe you could put that last script as auto-bind rule onto the containers themselves, as these are checked on a configuration poll.
However, depending on how often your configuration poll runs, neither will provide you a very "live" view of what's down.
Adding the BindObject code directly into Hook::StatusPoll might work better for that?

In regards to having the state in a node's Object Overview, you can create a PushDCI (you can check if it exists as part of the status script and if not, create it), then push the current state into it. That DCI could be displayed on the Object Overview page. Here's the downside: I am not sure if it is possible to perform this last step via NXSL, but using the Java API, e.g. via a cronjob to set this field on relevant DCIs once a day or similar should work via https://www.netxms.org/documentation/javadoc/latest/org/netxms/client/datacollection/DataCollectionItem.html#setShowInObjectOverview-boolean-
I'm not a developer and haven't ventured past NXSL, so I can't provide an actual script for this.

paul

#6
Agreed - status polling hook is the better place. My config polling is once per hour - much too slow to be useful for this.

There is the PING subagent - 10k is the value when no response - but if I am only using SNMP from central NetXMS server, I don't know if subagent is available.
https://www.netxms.org/forum/configuration/subagent-icmp-ping/ .

But this has already been suggested and advised against anyway.
https://www.netxms.org/forum/feature-requests/icmp-ping-as-internal-parameter/

BUT....the idea of putting them on the containers is good as that gives me a "once-per-hour" refresh for any node that was not picked up otherwise.

It nearly got a lookin here:
https://www.netxms.org/forum/configuration/simple-ping-monitoring/
But - it is not explained how NetXMS is remembering that a node is down and not to generate a new SYS_NODE_DOWN and when to generate a SYS_NODE_UP event.

It was even closer here:
https://www.netxms.org/forum/general-support/trying-to-under-stand-polling/
But went off track by advising to use $node->status rather than advising what the internal variable is used. $node-status has 8 possible values - none of which is down - the whole reason users keep asking. This error was repeated here:
https://wiki.netxms.org/wiki/Step_by_step_service_monitoring - again, just because a node is critical, does not mean it is down.

As suggested - I will use the Status Polling hook - runs every minute, and I will just have to persevere trying to identify the internal variable NetXMS uses when ICMP response = none and where NetXMS stores the trigger for SYS_NODE_DOWN and SYS_NODE_UP.

My biggest problem - dynamic IP addressed devices - I am turning off DiscoveryViaTrap = yes - and let DNStoIP on status polling update the IP - will see if once the IP is changed, the node comes back up.

**** Update ****
I exported all my duplicates and just deleted them - will come back to them later.

The status poll hook is working - working fine - but.......
I have put my Nodes that have issues into Maintenance but my Dashboard severity filter for Status Map does not allow me to exclude Maintenance.
I will have to use unmanage which prevents polling - and prevents the Node clearing the nodeUpDown as there is no polling.

For the purpose of pure up/down monitoring - Dashboard view - either the node is unmanaged (no polling) or it is polled and the up/down status shows.

Will see how this goes over the next week - I think this might be close enough.

**** Update 2 ****
Dashboard view was showing the unmanaged devices even though Unmanaged was unticked :(
Thinking back - NetXMS likes specifics - the unmanaged nodes were also in maintenance.
Updated each node - leave maintenance - disappeared from the display (as hoped but not as expected) :)
Finally - working as desired.
************













paul

#7
Nope - the unmanaged ones show even though unmanaged is not ticked :(

This indicates that NetXMS is treating the Node-Down critical alarms as part of the status value still for dashboard element Status Map rather than hard setting the status to be unmanaged.

Whatever this does - it does not exclude Unmanaged Nodes with Critical Alarms. This is the XMS from the dashboard that is not working.
<severityFilter>63</severityFilter>

So I tried an addition to my status poll hook to remove the node if it is unmanaged and it is already set to Down - but this is not working either :(

I am guessing this is failing as the node is unmanaged - so not polled - so not updated.

if (($node->status == 6)) {
   state = GetCustomAttribute($node, "nodeUpDown");
         if (state != null)
            {
             if ((state imatch "Down")) 
                 {
                   SetCustomAttribute($node, "nodeUpDown", "Unmanaged");
                   UnbindObject(FindObject("AllNetworkDown"), $node);
         }
   }
}



Will try this as a scheduled task - 15 minute intervals probably.

I assume I will need to work out some looping logic - for each $node in container "AllNetworkDown"

Or, can I point the script at the container as the object and it will loop through each node - in which case I can just use the script as per above?



paul

Sometimes doing this at 2 in the morning does not give the best outcome.

Solved the last part very simply.....

Step 1 - Unmanage Node
Step 2 - Manually trigger status poll on that node (I unmanaged it so I am already in the right place).

Step 2 triggers the code in the status poll hook to pick up the status = 6, removes the Node from AllNetworkDown AND sets nodeUpDown to Maintenance.

Working well for 24 hours so far - auto adding and removing nodes as they go up and down - Dashboard with down only nodes updating flawlessly as well.

You may wonder - how can I track my unmanaged nodes? - they are still in their original container - I simply go into object view - select nodes - sort by status - very, very simple.



paul

One more additional change.

Hit this last night around 01:30 - node was down with sys_node_unreachable. I was thinking ... WTF ... is that not down?

The answer actually is yes AND no - https://www.netxms.org/forum/configuration/netxms-polling-fails-no-errornotification/

I don't really care, holistically speaking, if it has an upstream root cause - it is still down from a connectivity perspective and from an impact perspective.

Added my nodeDown script action to the SYS_NODE_UNREACHABLE event policy --- hold on --- there is no alarm policy for this!! Added SYS_NODE_UNREACHABLE event to the SYS_NODE_DOWN Event Processor Policy - which already has by nodeDown script.

Whilst I am being creative - realized that my whole unknown / unmanaged auto bind and auto unbind issues could all be addressed with scripts hung off the SYS_NODE_UNKNOWN or SYS_NODE_UNMANAGED events.  Will go down that path if I need further work, but for now, crossed fingers - this is the last!!

paul

Forgot to include what it actually looks like - on a dedicated 50 inch screen :)