"SYS_NODE_DOWN" Event not processed bei EPP

Started by Dani@M3T, April 25, 2015, 07:17:57 PM

Previous topic - Next topic

Dani@M3T

We use NetXMS V2.0-M3 and see a very strange but critical problem.

All "SYS_NODE_DOWN" events are not processed anymore in Event Process Policies. I checked the EPPs double and triple. Everything still like it worked before (same config). I also tested with the EPP-policy for "SYS_NODE_DOWN" as first rule. "SYS_NODE_UP" and all other events are processed fine. The events "SYS_NODE_DOWN" are generated, that part is ok. It occures on all tested nodes.
I turned on server debug logging (level=9). There I see for every generated event a line like this:
[25-Apr-2015 17:50:04.531] [DEBUG] EPP: processing event 1857868
But no such line for the "SYS_NODE_DOWN" events. I have the debug log but would not like attach to forum posts.

It could be since upgrade to V2.0-M3, but not for sure.
At the moment I don't know where the problem can be. Any help is appreciated.

Victor Kirhenshtein

Hi,

are you sure those events are generated (you can check it with event monitor or event log)? Also, SYS_NODE_UNREACHABLE may be generated instead of SYS_NODE_DOWN if server considers that there is network failure in between.

Best regards,
Victor

Dani@M3T

Hi Victor

I definitely get the "SYS_NODE_DOWN" event (double checked).
When I shut down a test printer for example, first I get the events "SYS_IF_DOWN" and "SYS_NODE_CRITICAL" (in the same second), than 7seconds later "SYS_NODE_DOWN".

thanks
Dani

Dani@M3T

I just updated to V2.0-M4. But the problem still exists.

Victor Kirhenshtein

Hi,

in server log (on debug level 5 or higher) there should be record like

EVENT 28 (ID: ...

for each SYS_NODE DOWN event. Has it also mark CORRELATED?

Best regards,
Victor

Dani@M3T

Yes the line in the log is like this:
[25-Apr-2015 17:50:04.531] [DEBUG] EVENT 28 (ID:1857867 F:0x0001 S:4 TAG:"" CORRELATED) FROM mb460.domain.com: Node down

Victor Kirhenshtein

So that's the problem - server found another event that is considered root cause and so mark this event as correlated. Correlated events do not pass through event processing policy. You can see root cause event ID in event log and check what is root cause event (and if it was correct).

Best regards,
Victor

Dani@M3T

#7
where can I see the root cause event ID exactly? Or should I send you the log (but not in the public forum)?

[24-Apr-2015 16:26:31.220] [DEBUG] EVENT 28 (ID:1852235 F:0x0001 S:4 TAG:"" CORRELATED) FROM node.domain.com: Node down
[24-Apr-2015 16:26:31.220] [DEBUG] Event::expandText(event=0x2a57620 sourceObject=392 template='Interface "%2" changed state to DOWN (IP Addr: %3/%4, IfIndex: %5)' alarmMsg='(null)' alarmKey='(null)')
[24-Apr-2015 16:26:31.220] [DEBUG] CorrelateEvent: event SYS_IF_DOWN id 1852234 source node.domain.com [392]
[24-Apr-2015 16:26:31.220] [DEBUG] CorrelateEvent: finished, rootId=1852235

Victor Kirhenshtein

As it turns out event ID and root cause event ID are not shown in event log. I've fixed it for next release, but currently you can check server log for message like

CorrelateEvent: event SYS_NODE_DOWN id ...

followed by

CorrelateEvent: finished, rootId=...

Number after rootId= will be ID of root cause event. You can find it either in server log or select directly from database by running SQL

SELECT * FROM event_log WHERE event_id=<id>

Best regards,
Victor

Dani@M3T

Thanks Victor. I see in all test cases the same root event ID. In the DB I saw it is a "SYS_NETWORK_CONN_LOST" event (Event Code 50) several days ago.
I don't understand why this very old event is relevant in the event correlation. Of course the NetXMS server has network connection.

tomaskir

Quote from: Dani@M3T on April 30, 2015, 01:17:25 PM
Thanks Victor. I see in all test cases the same root event ID. In the DB I saw it is a "SYS_NETWORK_CONN_LOST" event (Event Code 50) several days ago.
I don't understand why this very old event is relevant in the event correlation. Of course the NetXMS server has network connection.
It seems you have Beacon Probing enabled, and your NetXMS server see Beacon Probe hosts as down.

Check your server config variables regarding Beacon Probing.

Dani@M3T

Yes, Beacon Probing is enabled:
BeaconHosts: 172.16.100.8
BeaconPollingInterval: 1000
BeaconTimeout: 1000

It's configured like this for a few months. The Beacon Host ist the core switch.

I tested, I can ping the core switch ip from the command line on the NetXMS server.

Now I changed the IP address to reversed order: "8.100.16.172". And now I get the alarms of SYS_NODE_DOWN events again! When I switch back to normal IP address, I get a "SYS_NETWORK_CONN_LOST event".
Is this a similar bug as in the PING subagent of V2.0-M3?
And why are only the "SYS_NODE_DOWN" events correlated with this and not all other events too? I get alarms from the same nodes when a DCI threshold is active but not when node down, strange for me.




tomaskir

#12
Definitely a bug :)
I registered it as #817 on bug-tracker.

As to why no correlation for other events - Victor will have to answer that.

Dani@M3T

Thanks Tomas. Ok let's have a look what Victor says.

Dani

Victor Kirhenshtein

Only topology events are correlated - data collection events can be originated elsewhere (from script DCI for example) and may or may not be affected by network connectivity.

Best regards,
Victor