segfault netxmsd crash

edward.borst · August 25, 2014, 02:47:20 PM

Hi,

We are getting segfaults on the netxmsd after upgrading to 1.2.16.
error is:

netxmsd[32695]: segfault at 7f62ef0d666c ip 00007f62f3806dd0 sp 00007f62ef09ad50 error 4 in libnxcore.so.1.0.0

OS is Oracle Linux (RedHat) 6.3 with kernel: 2.6.32-300.3.1.el6uek.x86_64
glibc version is glibc-2.12-1.80.el6_3.6.i686

Victor Kirhenshtein · August 26, 2014, 12:54:54 PM

Hi,

can you please run netxmsd under gdb and post backtrace after crash? Here is the instruction: http://wiki.netxms.org/wiki/Running_NetXMS_under_debugger

Best regards,
Victor

edward.borst · August 26, 2014, 01:05:00 PM

I am already running it under gdb, but it did not crash yet...
As soon as I have a trace I will send it to you.

regards,
Edward

edward.borst · August 26, 2014, 04:13:15 PM

here it is:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffef576700 (LWP 19648)]
0x00007ffff7d04da7 in AlarmManager::watchdogThread (this=0x7ffff7ff7740) at alarm.cpp:1090
1090 if ((m_pAlarmList.dwTimeout > 0) &&
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libstdc++-4.4.6-4.el6.x86_64 mysql-libs-5.1.66-2.el6_3.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 openssl-1.0.0-27.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x00007ffff7d04da7 in AlarmManager::watchdogThread (this=0x7ffff7ff7740) at alarm.cpp:1090
#1 0x00007ffff7d04f29 in WatchdogThreadStarter (pArg=<value optimized out>) at alarm.cpp:204
#2 0x000000313d407851 in start_thread () from /lib64/libpthread.so.0
#3 0x000000313cce811d in clone () from /lib64/libc.so.6

regards,
Edward

edward.borst · August 28, 2014, 11:12:38 AM

Hi Victor,

Did you have a chance to look at the trace?
it keeps crashing several times a day....

Alex Kirhenshtein · August 28, 2014, 04:54:05 PM

Hello.

Yes, we spent some time on it, but unfortunately can't find root of this problem. Can you please run server under valgrind, until crash?

edward.borst · August 28, 2014, 05:23:31 PM

of course we can.
any options you want to give valgrind to check?

edward.borst · August 28, 2014, 10:12:58 PM

it looks like this issue is coming from pre 1.2.16 alarms.
We had about 6 old alarms with a sticky acknowledge set on it.
I was looking at the alarm.cpp code to see if I could find anything.
I saw that piece of code was added last release.
So, I deleted(terminated) all of the old alarms and we did not have a crash yet.
We have had new alarms coming in with no error.

regards,
Edward

Victor Kirhenshtein · August 28, 2014, 11:01:42 PM

Hi!

you mean that you've found some specific place in added code which caused this crash?

Best regards,
Victor

edward.borst · August 31, 2014, 05:43:57 PM

Hello,
Unfortunately it crashed again. Same error
I'm running it now under valgrind, but that takes a lot of resources. I run it with default settings
Did not have a crash yet

edward.borst · September 01, 2014, 09:07:45 AM

Here is a part from an error from Valgrind:

Code Select

==6816== Thread 7:
==6816== Invalid read of size 4
==6816==    at 0x4C5CDD0: AlarmManager::watchdogThread() (alarm.cpp:1105)
==6816==    by 0x4C5CF28: WatchdogThreadStarter(void*) (alarm.cpp:204)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)
==6816==  Address 0x2c8c57bc is 52,620 bytes inside a block of size 243,984 free'd
==6816==    at 0x4A0610F: realloc (vg_replace_malloc.c:525)
==6816==    by 0x4C5EF8A: AlarmManager::newAlarm(char*, char*, int, int, unsigned int, unsigned int, Event*, unsigned int) (alarm.cpp:359)
==6816==    by 0x4C81704: EPRule::generateAlarm(Event*) (epp.cpp:531)
==6816==    by 0x4C81D4A: EPRule::processEvent(Event*) (epp.cpp:468)
==6816==    by 0x4C81F4B: EventPolicy::processEvent(Event*) (epp.cpp:814)
==6816==    by 0x4C862C9: EventProcessor(void*) (evproc.cpp:225)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)

edward.borst · September 01, 2014, 09:10:34 AM

and here is another one:

Code Select

==6816== Thread 66:
==6816== Conditional jump or move depends on uninitialised value(s)
==6816==    at 0x4CC549E: ClientSession::onAlarmUpdate(unsigned int, NXC_ALARM*) (session.cpp:4989)
==6816==    by 0x4C631ED: EnumerateClientSessions(void (*)(ClientSession*, void*), void*) (client.cpp:337)
==6816==    by 0x4C5F278: AlarmManager::newAlarm(char*, char*, int, int, unsigned int, unsigned int, Event*, unsigned int) (alarm.cpp:384)
==6816==    by 0x4C81704: EPRule::generateAlarm(Event*) (epp.cpp:531)
==6816==    by 0x4C81D4A: EPRule::processEvent(Event*) (epp.cpp:468)
==6816==    by 0x4C81F4B: EventPolicy::processEvent(Event*) (epp.cpp:814)
==6816==    by 0x4C862C9: EventProcessor(void*) (evproc.cpp:225)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)

Hope this helps finding the crash cause.

edward.borst · September 02, 2014, 04:33:52 PM

Hello,

Any progress on this?
Server is still crashing several times...

Would it be an option to downgrade to 1.2.14?
This was our previous version.

Thanks,
Edward

Victor Kirhenshtein · September 02, 2014, 10:48:07 PM

Hi!

can you please provide full valgrind log (assuming there are more records then you post already)? Those errors logged looks like consequences of some memory corruption that happened before.

Best regards,
Victor

edward.borst · September 02, 2014, 10:55:03 PM

Hi,

Here is the full log.
It is huge, so I have compressed it.

Hope this helps.

NetXMS Support Forum

News:

segfault netxmsd crash

edward.borst

Victor Kirhenshtein

edward.borst

edward.borst

edward.borst

Alex Kirhenshtein

edward.borst

edward.borst

Victor Kirhenshtein

edward.borst

edward.borst

edward.borst

edward.borst

Victor Kirhenshtein

edward.borst