Fault Tolerant Monitoring

Started by danieljdoughty, October 01, 2008, 08:56:41 PM

Previous topic - Next topic

danieljdoughty

My environment has a fair amount of glitches.  They can come about because the network may be a bit slow during one netxms check or it can be because a process was being restarted.  We really don't want to get notified every time a process is down, only if it's down for an extended period of time.

In openview we would traditionally define something that says, don't page me until the number of cron processes is less than one for 3 tests in a row.  I'm trying to use your threshold methods, like "average value" and "mean deviation" but I'm not having much luck. 

I set it to only page if the average value for 3 consecutive samples will be less than 1 but I got a page after the first minute.  I'm guessing this is due to integer rounding or something along those lines. 

I know I could define this externally in a script, but that makes the solution much harder to deploy.  Is there a better way to set up this sort of logic in netxms?

Thanks,
Dan

p.s.  If need be, I can enter this request in Russian, but would prefer not to.  It's been years since I wrote in Russian and I've never been good at typing in it.

Victor Kirhenshtein

Hello!

It seems to be quite common problem. In upcoming release, I will change threshold configuration so you will be able to specify for what number of consecutive polls value should be above/below threshold to raise an event.

Best regards,
Victor

danieljdoughty

Great, thanks for the update.  Wasn't sure if I was just missing something about how to configure it.

Thanks,
Dan