Change node down threshold for some (not all) nodes

Started by millerpaint, February 11, 2015, 01:59:37 AM

Previous topic - Next topic

millerpaint

Greetings,

What is the best way to change the threshold on how long it takes NetXMS to determine that a node is down?  Right now, it seems to be about 60 seconds system wide, which is perfect for most nodes.  But I have 50+ nodes that I would like to set the threshold to 5 minutes, before an alert is sent.

Currently running version 1.2.17.  Any info is appreciated!


-Kevin C.

Victor Kirhenshtein

Hi,

you can change required poll count for node interfaces. For example, if you set it to 5, server will consider interface as down if 5 consecutive polls will report that. Node will not be marked as down as long as interfaces not marked as down.

Best regards,
Victor

millerpaint

Hi Victor,

I have testing by setting the node interface "required poll count" parameter to 300.  There is now a question mark on the node interface icon on the NetXMS console. Then, I disable the switch port that node is plugged into.  But no change, I still get a node down alert in approximately 60 seconds.

Is there anything else I need to check?


-Kevin C.

Victor Kirhenshtein

Poll count 300 means 5 hours to change status - is it what you intend?

Best regards,
Victor

millerpaint

Hi,

No, that is not my intention for the final configuration.  I just want 5 minutes.

But I had changed it to 5 at first, and that made no difference - it went to "node down" status in ~60 seconds.  So I decided to put in 300 (just a random big number) to see if that would make any difference, and it did not.


-Kevin C.

millerpaint

OK, good news, this is now working in my test environment  :)  A poll count of 5 rendered a node down alarm (and email) in ~8-9 minutes.  I thought it would be closer to 5 minutes, so not sure why it was longer, but I can adjust it down to get to 5 minutes if needed.

In working on this issue, the only thing I noticed is that on many nodes, I had a lot of interfaces named "lost_interface_####".  Not just on the nodes I was testing, but others as well.  I went thru the console using a view filter of "lost_in", and deleted all of those interfaces.

After I did that, I put the poll count of 5 into my test node interface, and it worked !  So maybe I had some database corruption?


-Kevin C.

Victor Kirhenshtein

Looks like result of database corruption. If I remember correctly lost_interface_#### objects was created in some older versions after errors. Current version should not create them. As they likely had unreachable IP addresses, status poll just take longer due to timeouts.

Best regards,
Victor