Strange "node up" SMS notifications

Started by Andreas E. Mueller, January 22, 2016, 03:28:25 PM

Previous topic - Next topic

Andreas E. Mueller

Hello again,

I have a problem (and I think it may be a bug):

Everytime I restart a specific nxagent proxy node (version 2.0.1 running on FreeBSD 10.2) the NetXMS server is sending "Node up" notifications for each node this agent is monitoring, although they had never a node down status (it is set to 5 polls for status change, this means I would have about 5 minutes time for restarting).

Also important, there are also no "node up" events which apply to these SMS notifications in the Alarm Log. Could it be, that there is somekind of message queue for SMS notifications that has stuck and needs to be cleared? Rebooting nxagentd and netXMS server didn't solve the problem.

Maybe someone has an idea.

Here is the nxagentd.conf of the proxy node:


MasterServers = XXX.XXX.XXX.XXX

MaxSessions = 1024
StartupDelay = 60

EnableProxy = yes
EnableSNMPProxy = yes

SubAgent = /usr/local/lib/libnsm_ping.so
SubAgent = /usr/local/lib/libnsm_ecs.so
SubAgent = /usr/local/lib/libnsm_freebsd.so
SubAgent = /usr/local/lib/libnsm_logwatch.so
SubAgent = /usr/local/lib/libnsm_filemgr.so

*PING
Timeout = 1000
PacketRate = 12
Target = XXX.XXX.XXX.XXX:some_node_name
...


Also note, I've set StartupDelay to 60 seconds, but this didn't help.

Greetings from Germany,
Andreas

tomaskir

Events and alarms are 2 different things.

Check the event log to see if there indeed are node up event for the nodes or not.
If so, check how your EPP is handling those events.

Andreas E. Mueller

Hello and thanks for your help again tomaskir.

Well, you are right, there are "node up" events in the event log (now I start understanding how things really work). The processing for SYS_NODE_UP events is also right, but... before I upgraded from 2.0-RC1 to 2.0.1, when a proxy agent was offline, the nodes which were polled from this agent were just marked as "status unknown", till the proxy agent came up again. But even then, they didn't trigger the node up event. So it was fine, because i was only notified about the proxy node going offline and not about all the polled nodes behind it.

Since the update, the polled nodes are not marked as "status unknown" anymore and trigger node_up immediately after the proxy node came online again.

In the mean time, I didn't change anything in the EPPs since the updates (and even before them). It was running fine.

Is there any way to suppress triggering sys_node_up (or even sys_node_down) when the proxy agent is failing? Or is there something I need to further check in the EPP configuration? I mean... It's only the sys_node_down and sys_node_up events I want to get notified about.

tomaskir

#3
What happens is that sys_node_down events are generated when nodes go down which are behind a proxy, but those events are correlated to the proxy down event.
Since they are correlated, they dont enter EPP, and therefore will not create alarms (since EPP defines alarm creation).

It seems sys_node_up events are not correlated to the proxy up event, which sounds like correct behavior to me.
After all, how long after proxy comes back up should you correlate sys_node_up event to the proxy up event, for which nodes, etc.?
It opens up many weird behaviors and edge cases - it would be really hard to know (if not impossible) which sys_node_up should and should not be correlated.

As for if there was any changes in behavior from 2.0-RC1 to 2.0.1, I dont know.
I personally dont generate sms/emails for when things return back to "good" state, only when they go down.

And why it behaves differently for your before upgrade and after upgrade, I also dont know :(

Andreas E. Mueller

OK, I think I can leave with this first. Just took out the server actions when node is UP again. Maybe it's better to be only informed when node is down. The reason I used to notify node up, is just to know if a colleage has managed to solve the problem (or if there is a temporary uplink interruption) while not in the office.

But anyway, it would be good if somebody who knows about that, would inform us here. Just for the sake of curiosity. :D

Victor Kirhenshtein

Hi!

Do you have zones and agent set as proxy for zone, or it is set as proxy for individual nodes?

Best regards,
Victor

Andreas E. Mueller

Hi Victor,

I have my agents set as proxies per zone. For two reasons:

1. Keeping a clean Zone and network structure.
2. Saving time, because I don't need to set the proxy node in each node individually.

But I think you know the benefits of "zoning" better than me, because you are the developer. Excellent and good understandable coding style by the way. It's not a compliment, just my honest opinion and a true fact. ;) I was a developer too before starting to work in the company I'm now.

I wish I had some time too to work with C++ and Java code occasionally, but I have enough to do with administrating, maintening and scripting Unix and Linux boxes here. But somehow... It's really fun and learning a lot of new things every week.

Greets from Germany,
Andreas

Andreas E. Mueller

Hello again,

well i let it run without node up notifications over the weekend. I had 3 node_down's without that these nodes were really down (a let it scan 5 polls before status change). Then in the event log I have still multiple node_up's from nodes that never were down.

Any suggestion were to look first? As told, the proxy agents are one for each zone. These issues happen after the update to 2.0.1. Has been something changed about the event generation? At least according to the information in this thread, it has nothing to do with the event processing policy.

Greets,
Andreas

Victor Kirhenshtein

Hi,

logic behind node down/node up is following:

if status poll detects that all interfaces and agents are not responding, it marks node as unreachable. Then it checks network path - for nodes behind proxy it involves proxy checking. If server considers proxy node down, it marks current node as having network path problems and generates SYS_NODE_UNREACHABLE event, otherwise SYS_NODE_DOWN is generated. When contact with node restored, SYS_NODE_UP generated. To distinguish between SYS_NODE_UP generated from down nodes or nodes with network path problems, SYS_NODE_UP comes with parameter number 1 set to 0 if node recovered from "down" state or 1 if node recovered from "network path problem" state.

Is it possible that you had proxy node down?

Best regards,
Victor

Andreas E. Mueller

Hi Victor,

I just figured out that the netxms server is ignoring the setting about how many polls are needed for the status change. In my case it's set to 5 polls. as well in the device settings and also as the standard setting globally. (see screenshot).

Is there somekind of server setting I need to set?

Greets,
Andreas

tomaskir


Andreas E. Mueller

OK, I don't know if that maybe helps, but I can definitely say that in version 2.0-M4 it was working. :)