netxmsd segmentation faults

Started by Tursiops, May 13, 2016, 07:23:58 AM

Previous topic - Next topic

Tursiops

Hi,

For the last couple of days the netxmsd process on our server has been segfaulting randomly. Randomly in that debug logs ( at level 8 ) do not hint at the same thing being done prior to the crash. However it does usually take a couple of hours before the process crashes.

Not sure how to troubleshooting this?
I've been reading up on gdb and am following some suggestions from this post: http://stackoverflow.com/questions/16169022/debugging-a-running-daemon-using-gdb
Not sure if that is going to produce any useful output at all.

Anything else I can do to troubleshoot this?

Thanks

tomaskir

We need more information.

What version of NetXMS are you using?
What OS?

Are you compiling from source using using .deb packages?

Tursiops

NetXMS Version 2.0.3
Running on Ubuntu 14.04.4 LTS, using packages from "deb http://packages.netxms.org/ubuntu trusty main".

nxdbmgr comes back clean. But I guess that doesn't mean it couldn't be some database corruption bringing NetXMS to fall.

tomaskir

Are you using "Interface::setExpectedState" NXSL functions by any chance?
There was a bug that would cause segfaults that is fixed in develop branch.

If not, your best bet in this case is to contact official Raden support, so they can help you out.
https://www.netxms.org/contact/

Victor Kirhenshtein

Hi,

also 2.0.3 may crash on receiving some SNMP traps (fixed in develop branch as well).

Best regards,
Victor

Tursiops

We're not using the NXSL function for the ExpectedState (yet), but we certainly have some SNMP traps coming in.
Maybe some device we're monitoring or added recently is having issues and just happens to be sending one of those "bad" traps.
Guess I'll wait for 2.0.4 and see what happens.

tomaskir

I can provide you with built packages from the develop branch if you wish.

Tursiops

I may have found the culprit. There have been two segfaults right after the same type of DCI alarm was triggered. And a third one just now, "live" while I was connected. An alert popped up relating to that type of DCI and NetXMS died together with displaying the message. However, there were several segfaults without a directly related log message. So to confirm, I've made some config changes and am now monitoring to see if the segfaults actually stop.

Will update the post once I have some more data (i.e. no segfault in the next 24 hours would be a good sign).

Tursiops

Looks like that was it. No more segfaults.
I have attached the template that is causing the problem.

After some digging in the database (prior to removing the template from all relevant devices) I found that the devices (old Intel Modular servers) appear to send "SCM Comm Error" as response to random values at times. The template includes a number of Integer and Int64 fields, including in SNMP tables. Not sure if that's what is causing the problem, i.e. NetXMS receiving "SCM Comm Error", trying to put that into an Int64 field?

Victor Kirhenshtein

Template seems normal, nothing unusual. Even if agent returns texts in some fields server should handle it correctly. There were some fixes in 2.0.4, you can try to re-enable this template when 2.0.4 will be available. If it will not help, then the only way to debug issue will be to run netxmsd under debugger and wait for crash.

Best regards,
Victor

Tursiops

I decided to re-enable the template in 2.0.5 and unfortunately the issue persists.
Once I enable the template, the NetXMS server crashes within a few hours.
When I disable the template again, NetXMS runs stable and without issues.

This is driving me nuts... I'm not exactly a code/debug whiz, so what do I have to do to get some debugging data that would help with fixing this?

Victor Kirhenshtein

Could you please run netxmsd under gdb and send backtrace after crash? Here is the instruction: https://wiki.netxms.org/wiki/Running_NetXMS_under_debugger

Best regards,
Victor