netxmsd segfaults

Started by Tursiops, March 06, 2017, 07:07:18 AM

Previous topic - Next topic

Tursiops

Hi,

On Friday evening, our NetXMS server started segfaulting. I can't pin down what's causing it, but I basically can't run it for more than maybe 15-20 minutes before it crashes again.
I tried to follow the instructions given at https://wiki.netxms.org/wiki/Running_NetXMS_under_debugger to obtain a backtrace, but the result is that the netxmsd service stops responding and so does gdb, i.e. it never returns to the gdb prompt and I can't get a trace. I basically have to open another session and kill the gdp process itself. If I kill netxms, gdb is no longer attached to the process and I can't get a trace either.

Anything else I can try to get that elusive backtrace?

Cheers

Tursiops

Ok, Using LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so netxmsd -q -D4 worked. Assuming the result is what is expected.

Attached a screenshot of the last few lines of standard debug logging and the segfault output.
It doesn't tell me anything, but hopefully it helps in determining what's going on...

Tursiops

I am currently restarting netxmsd in a loop via bash script, i.e. it crashes, it starts again. segfaults are written into separate files each time.
Based on that, it's currently crashing every 7-15 minutes. I've attached another text file with five more segfaults (I skipped the memory map part for these).

Is there anything else I can do/provide to help resolve these crashes?

Tursiops

Appears the issue is related to DCI Tables.
All segfault backtraces mention DCTable.
I then ran netxmsd with -D9 option for testing and found several prepare statements for a DCI Table prior to the crash - but no completed sync.

I had such an issue in the past with a template for Intel Modular servers. When I could not determine the underlying cause, I "simply" switched all tables in all templates to instances and the system had been stable since. Appears I missed one: a single template (for a Netgear ReadyNAS) with a single device bound to it still had DCI tables in use.
It appears the NAS performed an automatic firmware upgrade to version 6.6.1 and within an hour NetXMS started crashing. It is not clear if the template for the NAS had not been applied until after the upgrade (auto-apply rules) or if the upgrade simply triggered something in the SNMP responses that caused the crashes of the server.

Since I disabled the DCI Tables, the server has not crashed (around 90 minutes of uptime now, over 6 times as long as the longest interval between crashes before that). It looks like there is some kind of issue within the DCI table code that can lead to segfaults and server crashes.