Segfault after upgrade to 2.1-RC1

Started by Tursiops, May 15, 2017, 01:43:48 PM

Previous topic - Next topic

Tursiops

Hi,

Looks like our NetXMS server decided to start segfaulting after the upgrade to 2.1-RC1.
Reading through the dump and not being a developer, I have no idea what the underlying cause is, so here goes:

*** Error in `netxmsd': malloc(): smallbin double linked list corrupted: 0x0000000035444410 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f6f596457e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x81d61)[0x7f6f5964fd61]
/lib/x86_64-linux-gnu/libc.so.6(__libc_calloc+0xba)[0x7f6f5965221a]
/usr/lib/x86_64-linux-gnu/libnetxms.so.2(_ZN11NXCPMessageC1EP12NXCP_MESSAGEi+0x227)[0x7f6f599c3b47]
/usr/lib/x86_64-linux-gnu/libnxsrv.so.2(_ZN15AgentConnection14receiverThreadEv+0x592)[0x7f6f59c20b72]
/usr/lib/x86_64-linux-gnu/libnxsrv.so.2(_ZN15AgentConnection21receiverThreadStarterEPv+0x9)[0x7f6f59c21039]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f6f57a2d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6f596d482d]


Anyone else seeing similar crashes? Any idea what's causing them?

Cheers

Victor Kirhenshtein

Hi,

could you please run netxmsd under debugger and post a backtrace (https://wiki.netxms.org/wiki/Running_NetXMS_under_debugger)?

Best regards,
Victor

Tursiops

#2
Hi Victor,

after the crash, the debugger prompt did not re-appear.
It's just stuck showing me loads of lines in the following style:
[New Thread 0x7ffdd22b1700 (LWP 28753)]
These look pretty standard during NetXMS operation and not related to the actual crash.

The netxmsd process itself was still showing in the process list as well, but would not stop unless I sent a kill -9.
At that point, of course, the thread in gdb terminated as well and I could not run a backtrace.

Cheers

Tursiops

The random segfaults continue. I can't get a backtrace using the method described in the Wiki - it never returns to gdb after crashing. Not sure if I am missing something?

Here's another segfault output (no idea if that helps at all):
*** Error in `netxmsd': corrupted double-linked list: 0x00007f33103c6710 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f339ec357e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x81f88)[0x7f339ec3ff88]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f339ec415d4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a315)[0x7f339ebe8315]
/lib/x86_64-linux-gnu/libc.so.6(+0x227fc)[0x7f339ebe07fc]
/lib/x86_64-linux-gnu/libc.so.6(+0x213a3)[0x7f339ebdf3a3]
/lib/x86_64-linux-gnu/libc.so.6(iconv_open+0x22a)[0x7f339ebdee2a]
/usr/lib/x86_64-linux-gnu/libnetxms.so.2(ucs4_to_ucs2+0x3a)[0x7f339ef9fa5a]
/usr/lib/x86_64-linux-gnu/libnetxms.so.2(_ZN11NXCPMessage3setEjhPKvbm+0x548)[0x7f339efb4758]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN6DCItem20fillLastValueMessageEP11NXCPMessagej+0x92)[0x7f339f4a7d52]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN20DataCollectionTarget25fillMessageInternalStage2EP11NXCPMessage+0x124)[0x7f339f4b9504]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN6NetObj11fillMessageEP11NXCPMessage+0x5e)[0x7f339f4e758e]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN13ClientSession12updateThreadEv+0x21c)[0x7f339f524dcc]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN13ClientSession19updateThreadStarterEPv+0x9)[0x7f339f525079]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f339d01d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f339ecc482d]


Cheers

Victor Kirhenshtein

Hi,

can you configure your system to generate core file after crash? I can analyze core with gdb then.

Best regards,
Victor

Tursiops

Hi Victor,

I am trying that now, will let you know how it goes.

I noticed something in the logs on the latest crash (prior to configuration change) which seemed odd:
[ERROR] SQL query failed (Query = "SELECT idata_value,idata_timestamp FROM idata_225 WHERE item_id=1047081695 ORDER BY idata_timestamp DESC LIMIT 32767"): 42P01 ERROR:  relation "idata_225" does not exist LINE 1: SELECT idata_value,idata_timestamp FROM idata_225 WHERE item...

I have no idea why NetXMS would try to query a table that doesn't exist for an item_id which definitely does not exist (way too high, max(item_id) in the items table is currently '668531'). This was literally the last entry in the log (I did not see anything like this in the previous crashes though).

Cheers

Tursiops

#6
Hi Victor,

Had not ever had to do this before, so it was a bit of a learning experience.
Just putting the notes down here in case it ever comes in handy for someone else (or as a reminder for myself if I need this again)

Edited /etc/security/limits.conf and added lines to allow "core" dumps up to 10GB (soft and hard limits). The default is "0", i.e. disabled.
Edited /etc/sysctl.conf and set the kernel.core_pattern to /var/tmp/core-%e-%t.

Worked fine and sent links to core dumps in a private message.

Cheers

Tursiops

Hi,

Just a FYI: The issue persisted with the release of 2.1, but has now been resolved in the latest development build.

A huge thank you to Victor for logging into our system, determining the underlying cause (related to Active Agent Tunnels) and fixing it.

Cheers