Network interface monitoring - Incorrect data on some devices (hundreds of TB)

Started by komj, February 08, 2017, 12:08:38 PM

Previous topic - Next topic

komj

Hello all,
firstly, I want to say that netXMS is a great product. Thank you for that :)

Secondly,
In the company I work we are in transition period from other monitoring tools to netXMS. At the moment we are still testing if netXMS is fulfilling our needs and we have a problem that I will describe below.
This happens on multiple devices (different vendors, operating systems, etc.), below I will use only one device as an example.

We configured the monitoring of network interfaces via SNMP as you can see in the attached image (nic-mon-conf.jpg). This is working fine most of the time.

The problem arises after some time network interface is being monitored. Problem is that the traffic reported is in hundreds of terabytes as you can see in the attached image (nic-mon-incorrect.jpg).

I also attached an image where traffic monitoring looks ok (nic-min-ok.jpg).

Also attached the image of the value history for one of the DCI (nic-mon-value-history.jpg). It seems that is the time when problems happens.

I assume that this is not related to the netXMS in any way since the netXMS is just getting the the value from the device. But on the other hand I don't see this behavior with other network monitoring tools, for example The Dude, check attached image (nic-mon-dude.jpg). All these graphs from The Dude and netXMS are from the same device and same time interval.

I know that this problem can be mitigated with transformation script, but I would like to know the root cause of the issue.

With all said, my questions are:
1. Anybody else have this problem?
2. Any ideas why this happens?

Thanks to anybody taking time to read this and to respond.





Victor Kirhenshtein

Hi,

it is known issue caused by incorrect handling of counter reset. It comes from the fact that NetXMS do not know if DCI is cumulative value or can change both ways. So it assumes latter and when new value less than old value you get those extra large values (not negative because for traffic counters usually used unsigned data types). We plan to fix it by introducing new DCI data types - counter 32 bit and counter 64 bit (to match SNMP types) - for them system will now that they are cumulative and handle resets correctly.

Best regards,
Victor

komj

First of all, thank you for your reply.

Secondly,
what do you recommend as a workaround until you implement what you described?

I already mentioned workaround with the transformation script, is that a good solution?

blazarov

Hi,
we are also suffering quite a lot because of that, so we'll appreciate an idea for workaround using transformation script, as well as planned timeline for the bug fix.

I can contribute with my "workaround" so far, but its quite simple and useless most of the times, because you have to set a static "high limit" that might be very different depending on the device in our setup.

sub main()
{
   if ($1 > 85899345920) {
      exit;
   } else {
      return $1 * 8;
   }
}

komj

Yes, I do similar thing.

As far as I could see from all different devices that we have in our network, correct counter values are never above 10 digit number, and incorrect counter values are always above the 14 digit number, therefore I compare the actual counter value to number with 12 digits and workaround seems to work fine until now.

hlaaluu


Tursiops

Hi,

Just like everyone else, we have implemented transformation scripts as workarounds (mostly based on the speed of the interface in our case).
Of course the new counter32/64 types would be a much better solution. I assume they would actually handle resets correctly in that they calculate the traffic that occurred during the reset? As that's something our transformation script doesn't handle at all at present.
With some interfaces resetting every few minutes (yes, lots of traffic), we end up with a lot of dips to "0" at present.

For which release is this planned? :D

Cheers

Victor Kirhenshtein

Hi,

I plan to implement it soon in development branch, so it will make it into next stable release. I'm not sure if it is wise to add new data types to current stable.

Best regards,
Victor

jstump

Any word on this?

I'm hoping to move away from a Dude+PRTG to only NetXMS and I'm not having much luck getting reliable traffic monitoring right now.

Dani@M3T

Victor

Any news about data types counter32 and counter64?

best regards
Dani