Node down vs Node is unreachable by ICMP

Started by Egert143, February 22, 2022, 08:30:07 AM

Previous topic - Next topic

Egert143

Hello

Could someone please explain how is it decided what error is displayed when node is down?

For example when node is really down, no network connectivity etc, sometimes Netxms shows status as Node down, other times as unreachable by icmp and sometimes both. What is deciding factor in there?

Also for icmp down it seems very sensitive, maybe 1 packet is lost and status is already unreachable.

Egert

Victor Kirhenshtein

Hi,

there are two independent polls that produce those two events. One is status poll, that checks node connectivity using agent, SNMP, and finally ICMP if neither SNMP agent nor NetXMS agent are responding. This poll can generate events SYS_SNMP_UNREACHABLE, SYS_AGENT_UNREACHABLE, SYS_NODE_UNREACHABLE, and SYS_NODE_DOWN. Difference between SYS_NODE_UNREACHABLE and SYS_NODE_DOWN is that "unreachable" generated when server can detect network failure between itself and target node.
Another poll is ICMP poll, it was added in version 4.0. It's main use is to do regular ICMP pings and collect response time and packet loss statistic. In addition, it will generate event SYS_ICMP_UNREACHABLE when node is not responding to ICMP. If that happen after SYS_NODE_DOWN then ICMP unreachable event will be correlated to node down event, but because those two types of poll run asynchronously it is possible that node became actually unreachable after last status poll run and before next ICMP poll run, thus generating ICMP unreachable event that cannot be correlated to node down event.
You can effectively hide those ICMP unreachable events by disabling or removing rules that generate alarms from them, and unchecking "write to event log" option in event template configuration for SYS_ICMP_UNREACHABLE and SYS_ICMP_OK.

Best regards,
Victor

Egert143

Thanks for the explanation.

Is it possible to adjust icmp polling parameters to accept more ping loss before saying icmp unreachable? I adjusted PingTimeout from 1500 to 3000 but it doesent change much.

Egert

Storm-Donovan

Hi Egert,

I had the same issue with Nodes Unreachable by ICMP.  Victor and I determined that the ICMP poll was ignoring the server setting for PollCountForStatusChange (mine is set to 5 and it alarmed on one missed ping).  I believe they are working on fixing this.  Until that happens, I just disabled the Event Processing Policy.

Cheers,
Donovan.

Egert143

Thank for the info! Will be waiting for fix then. :)

Egert

dreamscape

Can I ask if this is still an issue please, our system will report in this order for example for most devices on a regular basis?

09:41:32 - SYS_ICMP_UNREACHABLE
09:41:32 - SYS_NODE_MAJOR
09:42:34 - SYS_ICMP_OK
09:42:34 - SYS_NODE_NORMAL

dreamscape


Victor Kirhenshtein

Hi,

problem with ICMP polls ignoring "poll count for status change" was fixed. Do you have poll count set to more than 1 and still get regular ICMP unreachable?

Best regards,
Victor

dreamscape

Hi Victor,

Sorry I didn't see you had commented.

I've upped my poll count from the default of 1 to 2 but i'm still getting these messages?

See below

ICMP.jpg







Victor Kirhenshtein

Can you try to set debug level to 7 for tag poll.icmp, and when problem repeats, send me extract from the server log filtered by object name?

Best regards,
Victor

dreamscape

Hi Victor,

Please see below for one client, went unreachable, major at 09:58:23, back to normal 10:00:42

Thanks
Nick

2023.05.31 09:56:11.912 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 09:56:11.912 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=0 RTT=0

2023.05.31 09:57:16.994 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 09:57:18.292 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=2 RTT=0

2023.05.31 09:58:22.612 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 09:58:23.800 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=2 RTT=0

2023.05.31 09:59:33.326 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 09:59:33.326 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=0 RTT=2

2023.05.31 10:00:42.308 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 10:00:42.308 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=0 RTT=0

2023.05.31 10:01:52.415 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 10:01:52.415 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=0 RTT=1

2023.05.31 10:02:57.539 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 10:02:57.539 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=0 RTT=0

2023.05.31 10:04:02.641 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 10:04:02.641 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=0 RTT=0

2023.05.31 10:05:07.675 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: calling IcmpPing(192.168.1.85,3,1500,46)

2023.05.31 10:05:07.675 *D* [poll.icmp          ] Node::icmpPollAddress(GARDENW10-1 [14138], PRI, 192.168.1.85):: ping status=0 RTT=0

Victor Kirhenshtein

According to log server got two timeouts - at 09:57:18 and 09:58:23 (status=2 means timeout). Next ping was successful again. Do you think there was a possibility of network issues at that time that could cause loss of ICMP packets? Also, if you change poll count to 3 or 4, will it fix the situation?

Best regards,
Victor

dreamscape

Hi Victor,

Thanks for the reply, I'm unsure on network issues, it's small, around 50 switches.

Will increases the poll count to 3 and see.

Thanks
Nick