Triggerhappy "Node Down" alarm

Started by Borgso, November 06, 2014, 10:48:39 AM

Previous topic - Next topic

Borgso

Have been running netxms 1.2.9 stable for a long time now, but feel the urge to upgrade and get new functions.

From 1.2.15 we have been running test servers, but keep getting the same problem with "Node Down" alarm even if the node is not having problem.

Production server have not this problem.
Production and test server are on same VMWare
Test server is a copy of production and then upgraded.

Removed all custom templates, only using ICMP ping and still problem both with agent and non-agent nodes.
If i decrease the amount of nodes, problem goes away.

Do other have the same problem?
What could i do to debug?

Borgso


trinidadrancheria

Quote from: Borgso on November 11, 2014, 03:35:27 PM
Anyone having same problem?

Yep. Its quite random. Seems like if it misses 1 ping, it calls it down.

Victor Kirhenshtein

Hi,

could you please run server with debug 6 for some time and then send me the log (pointing to fake node down moments)?

Best regards,
Victor

Borgso

This is what shows up when a node is detected down but are pingable

If you want more of the log i can mail you.

[19-Nov-2014 14:40:22.297] [DEBUG] Node 3404 "NODE-001-UNIT" queued for status poll
[19-Nov-2014 14:40:22.298] [DEBUG] Starting status poll for node NODE-001-UNIT (ID: 3404)
[19-Nov-2014 14:40:32.802] [DEBUG] StatusPoll(NODE-001-UNIT): bAllDown=true, dynFlags=0x00001001
[19-Nov-2014 14:40:32.802] [DEBUG] Node::checkNetworkPath(NODE-001-UNIT [3404]): cannot find interface object for primary IP
[19-Nov-2014 14:40:32.802] [DEBUG] Node::checkNetworkPath(NODE-001-UNIT [3404]): trace available, 0 hops, incomplete
[19-Nov-2014 14:40:32.802] [DEBUG] Node::checkNetworkPath(NODE-001-UNIT [3404]): will do second pass
[19-Nov-2014 14:40:32.802] [DEBUG] Finished status poll for node NODE-001-UNIT (ID: 3404)
[19-Nov-2014 14:40:32.803] [DEBUG] CorrelateEvent: event SYS_NODE_DOWN id 3267062 source NODE-001-UNIT [3404]
[19-Nov-2014 14:40:32.803] [DEBUG] EVENT 28 (ID:3267062 F:0x0001 S:4 TAG:"") FROM NODE-001-UNIT: RSE081: Node down
[19-Nov-2014 14:40:32.803] [DEBUG] CorrelateEvent: event SYS_IF_DOWN id 3267061 source NODE-001-UNIT [3404]
[19-Nov-2014 14:40:32.804] [DEBUG] EVENT 5 (ID:3267061 F:0x0001 S:2 TAG:"" CORRELATED) FROM NODE-001-UNIT: Interface "unknown" changed state to DOWN (IP Addr: 10.120.29.84/255.255.255.192, IfIndex: 1)
[19-Nov-2014 14:40:32.804] [DEBUG] CorrelateEvent: event SYS_NODE_CRITICAL id 3267063 source NODE-001-UNIT [3404]
[19-Nov-2014 14:40:32.804] [DEBUG] EVENT 10 (ID:3267063 F:0x0001 S:4 TAG:"") FROM NODE-001-UNIT: Node status changed to CRITICAL

KjellO

Tried a dirty patch to icmp.cpp, which improved the situation a lot. Idea is to do a random delay before pinging, otherwise a lot of pings will be fired almost simultaneously.


--- icmp.cpp_bak        2014-11-28 09:23:44.559231668 +0100
+++ icmp.cpp    2014-11-28 09:28:48.212833890 +0100
@@ -227,8 +227,15 @@

    // Do ping
    nBytes = dwPacketSize - sizeof(IPHDR);
+   UINT32 seed = time(0) * dwAddr; // attempt to create different seeds for each call to each node
+   int iNumRetriesOrig = iNumRetries;
    while(iNumRetries--)
-   {
+   {  // random delay before start pinging
+      int min = 500 * (iNumRetriesOrig - iNumRetries+1); // min = 0 in first run, then wait longer and longer
+      int max = 1000 + min;  // increased random window between retries
+      int delay = min + (rand_r(&seed) % (int)(max - min + 1));
+      ThreadSleepMs(delay);
+
       dwRTT = 0;  // Round-trip time for current request
       request.m_icmpHdr.m_wId = ICMP_REQUEST_ID;
       request.m_icmpHdr.m_wSeq++;
@@ -364,7 +371,7 @@
#endif
       }

-      ThreadSleepMs(500);     // Wait half a second before sending next packet
+      // ThreadSleepMs(500);     // Wait half a second before sending next packet // We do random delay in beginning of loop instead
    }

stop_ping:


This feels like a workaround rather than a fix to the original problem, but it can at least help in pinpointing the problem. Looks like it is something outside NetXMS boundaries that doesn't keep up, like linux kernel, wmware hosts, network switches...

Worth to note that this installation contains a lot of nodes with neither netxms or snmp agents, guess this will cause a lot more icmp pinging during status polls than in installations with mostly agent-running nodes?