All requests time out, 100% CPU usage

Started by mk, November 02, 2014, 07:01:14 PM

Previous topic - Next topic

mk

I just set up NetXMS 1.2.17 on Debian 7 with MySQL and added some hosts and configured some SNMP DCIs. While I was editing a transformation script on a DCI table in a template, the client stopped responding and everything I did would only trigger a "Request Timeout" message. I ended up restarting netxmsd, but that didn't help: whenever I started the client again, I got the same "Request Timeout" messages on everything and the netxmsd process was running at 100% CPU. I somehow managed to close the open tabs in the client and even deleted the device that was associated with the template I was editing. While this did solve the "Request Timeout" issues, I still see 100% CPU usage.
The device eventually came back and became associated to the same template again, and at that time the "Request Timeout" issues came back.

When running netxmsd in debug mode, I saw that it threw several SYS_THREAD_HANG events related to the "Item Poller" and the "Syncer Thread". How do I go about resolving this issue? I'd like to get the CPU usage back down to normal numbers, and I'd of course like to figure out what caused the issue originally.

multix

if you applied template to all nodes, then you must delete template in my opinion.

I did a mistake as you said and could solve this error only by deleting template.

mk

The template is only applied to two nodes, not all ~30. I have another template that's applied to two different nodes which is working fine.

tomaskir

What do the DCI templates contain?

Do you have any complex transform scripts?

mk

#4
Well, after waiting a few hours the CPU load seems to have gone down to normal. So I went back to see if I could cause the issue again. This is the script fragment that triggers it:

idxVoltage = $1->getColumnIndex("Voltage");
for (i = $1->rowCount -1; i >= 0; i++)
{
if ($1->get(i, idxVoltage) == null || $1->get(i, idxVoltage) < 0)
{
$1->deleteRow(i);
}
}

Here I was trying to remove those rows from the DCI table that had a negative value in the Voltage field. This is an 8-outlet power distribution unit that reports 42 outlets via SNMP, but only the first 8 actually report sensible values.

I see that NXSL:Table has the deleteRow method, but it's not documented much and not used in any examples, so is it broken and shouldn't be used? The 1.2.9 release notes mention the introduction of this method:
Quote from: ChangeLog
*
* 1.2.9
*
[...]
- New methods deleteColumn and deleteRow in NXSL class Table

Victor Kirhenshtein

there is an error in your script:

for (i = $1->rowCount -1; i >= 0; i++)

you start with last row (let's say 10 or something like this) and likely want to go down to row 0. But you increment i instead of decrementing it, so it went all way up to 2147483648 (power 31 of 2) where it wraps to -1 due to integer overflow and only then loop stops. So you script done more than two billion iterations - quite a lot for scripting language even on modern hardware.

Best regards,
Victor

mk

How could I possibly have overlooked that...? Of course, it's working just fine with the -- instead of the ++. Thanks!