Rapid Severity Changes Due To Event Flapping

Started by prichardson, December 15, 2022, 12:23:57 AM

Previous topic - Next topic

prichardson

Overnight we had an object that suddenly started flooding NetXMS with severity changes due to what I can only describe as Event flapping. Specifically...
 
12/13 21:00 - Device goes unreachable due to network failure.
12/13 21:01 - Device returns to normal.
12/13 22:04 - Device goes unreachable again
12/13 22:14 - Device back up and goes normal.
12/13 22:22 - Object Severity starts bouncing from CRITICAL to NORMAL multiple times per second; Event column shows flapping between SYS_NODE_CRITICAL and SYS_NODE_UNKNOWN;  20+ times per second. A network unreachable alert is logged at this time.
12/13 22:22 to 12/14 08:47 - Flapping continues unabated; two more network unreachable entries are logged at 22:28 and 22:54; none after that.  
12/14 08:47 - Placed in Maintenance Mode.
12/14 09:20 - Taken out of Maintenance Mode.  No further bouncing.

We confirmed through other sources there were no issues with the device or the connection to it.

We've seen this happen a couple of times in the past, but the issue occurs so infrequently that this is the first time we've had a chance to look into it.  Forum searching turns up nothing related to this that is useful.  The closest we could find was this thread https://www.netxms.org/forum/general-support/too-much-alarm/ .  But the cause noted there did not apply here.

So we're at a loss as to what would have caused this flood of Event log entries.  Any pointers on where to look for further clues would be helpful.