Segfault after upgrade to 2.1-RC1

Tursiops · May 15, 2017, 01:43:48 PM

Hi,

Looks like our NetXMS server decided to start segfaulting after the upgrade to 2.1-RC1.
Reading through the dump and not being a developer, I have no idea what the underlying cause is, so here goes:

Code Select

*** Error in `netxmsd': malloc(): smallbin double linked list corrupted: 0x0000000035444410 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f6f596457e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x81d61)[0x7f6f5964fd61]
/lib/x86_64-linux-gnu/libc.so.6(__libc_calloc+0xba)[0x7f6f5965221a]
/usr/lib/x86_64-linux-gnu/libnetxms.so.2(_ZN11NXCPMessageC1EP12NXCP_MESSAGEi+0x227)[0x7f6f599c3b47]
/usr/lib/x86_64-linux-gnu/libnxsrv.so.2(_ZN15AgentConnection14receiverThreadEv+0x592)[0x7f6f59c20b72]
/usr/lib/x86_64-linux-gnu/libnxsrv.so.2(_ZN15AgentConnection21receiverThreadStarterEPv+0x9)[0x7f6f59c21039]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f6f57a2d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6f596d482d]

Anyone else seeing similar crashes? Any idea what's causing them?

Cheers

Victor Kirhenshtein · May 15, 2017, 10:19:33 PM

Hi,

could you please run netxmsd under debugger and post a backtrace (https://wiki.netxms.org/wiki/Running_NetXMS_under_debugger)?

Best regards,
Victor

Tursiops · May 16, 2017, 08:56:53 AM

Hi Victor,

after the crash, the debugger prompt did not re-appear.
It's just stuck showing me loads of lines in the following style:

Code Select

[New Thread 0x7ffdd22b1700 (LWP 28753)]
These look pretty standard during NetXMS operation and not related to the actual crash.

The netxmsd process itself was still showing in the process list as well, but would not stop unless I sent a kill -9.
At that point, of course, the thread in gdb terminated as well and I could not run a backtrace.

Cheers

Tursiops · May 26, 2017, 09:34:27 AM

The random segfaults continue. I can't get a backtrace using the method described in the Wiki - it never returns to gdb after crashing. Not sure if I am missing something?

Here's another segfault output (no idea if that helps at all):

Code Select

*** Error in `netxmsd': corrupted double-linked list: 0x00007f33103c6710 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f339ec357e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x81f88)[0x7f339ec3ff88]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f339ec415d4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a315)[0x7f339ebe8315]
/lib/x86_64-linux-gnu/libc.so.6(+0x227fc)[0x7f339ebe07fc]
/lib/x86_64-linux-gnu/libc.so.6(+0x213a3)[0x7f339ebdf3a3]
/lib/x86_64-linux-gnu/libc.so.6(iconv_open+0x22a)[0x7f339ebdee2a]
/usr/lib/x86_64-linux-gnu/libnetxms.so.2(ucs4_to_ucs2+0x3a)[0x7f339ef9fa5a]
/usr/lib/x86_64-linux-gnu/libnetxms.so.2(_ZN11NXCPMessage3setEjhPKvbm+0x548)[0x7f339efb4758]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN6DCItem20fillLastValueMessageEP11NXCPMessagej+0x92)[0x7f339f4a7d52]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN20DataCollectionTarget25fillMessageInternalStage2EP11NXCPMessage+0x124)[0x7f339f4b9504]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN6NetObj11fillMessageEP11NXCPMessage+0x5e)[0x7f339f4e758e]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN13ClientSession12updateThreadEv+0x21c)[0x7f339f524dcc]
/usr/lib/x86_64-linux-gnu/libnxcore.so.2(_ZN13ClientSession19updateThreadStarterEPv+0x9)[0x7f339f525079]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f339d01d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f339ecc482d]

Cheers

Victor Kirhenshtein · May 30, 2017, 11:37:26 PM

Hi,

can you configure your system to generate core file after crash? I can analyze core with gdb then.

Best regards,
Victor

Tursiops · May 31, 2017, 06:26:12 AM

Hi Victor,

I am trying that now, will let you know how it goes.

I noticed something in the logs on the latest crash (prior to configuration change) which seemed odd:
[ERROR] SQL query failed (Query = "SELECT idata_value,idata_timestamp FROM idata_225 WHERE item_id=1047081695 ORDER BY idata_timestamp DESC LIMIT 32767"): 42P01 ERROR: relation "idata_225" does not exist LINE 1: SELECT idata_value,idata_timestamp FROM idata_225 WHERE item...

I have no idea why NetXMS would try to query a table that doesn't exist for an item_id which definitely does not exist (way too high, max(item_id) in the items table is currently '668531'). This was literally the last entry in the log (I did not see anything like this in the previous crashes though).

Cheers

Tursiops · June 01, 2017, 01:25:11 AM

Hi Victor,

Had not ever had to do this before, so it was a bit of a learning experience.
Just putting the notes down here in case it ever comes in handy for someone else (or as a reminder for myself if I need this again)

Edited /etc/security/limits.conf and added lines to allow "core" dumps up to 10GB (soft and hard limits). The default is "0", i.e. disabled.
Edited /etc/sysctl.conf and set the kernel.core_pattern to /var/tmp/core-%e-%t.

Worked fine and sent links to core dumps in a private message.

Cheers

Tursiops · July 18, 2017, 07:40:21 AM

Hi,

Just a FYI: The issue persisted with the release of 2.1, but has now been resolved in the latest development build.

A huge thank you to Victor for logging into our system, determining the underlying cause (related to Active Agent Tunnels) and fixing it.

Cheers

NetXMS Support Forum

News:

Segfault after upgrade to 2.1-RC1

Tursiops

Victor Kirhenshtein

Tursiops

Tursiops

Victor Kirhenshtein

Tursiops

Tursiops

Tursiops