segfault netxmsd crash

Started by edward.borst, August 25, 2014, 02:47:20 PM

Previous topic - Next topic

edward.borst

Hi,

We are getting segfaults on the netxmsd after upgrading to 1.2.16.
error is:

netxmsd[32695]: segfault at 7f62ef0d666c ip 00007f62f3806dd0 sp 00007f62ef09ad50 error 4 in libnxcore.so.1.0.0

OS is Oracle Linux (RedHat) 6.3 with kernel: 2.6.32-300.3.1.el6uek.x86_64
glibc version is glibc-2.12-1.80.el6_3.6.i686





Victor Kirhenshtein

Hi,

can you please run netxmsd under gdb and post backtrace after crash? Here is the instruction: http://wiki.netxms.org/wiki/Running_NetXMS_under_debugger

Best regards,
Victor

edward.borst

I am already running it under gdb, but it did not crash yet...
As soon as I have a trace I will send it to you.

regards,
Edward

edward.borst

here it is:


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffef576700 (LWP 19648)]
0x00007ffff7d04da7 in AlarmManager::watchdogThread (this=0x7ffff7ff7740) at alarm.cpp:1090
1090                            if ((m_pAlarmList.dwTimeout > 0) &&
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libstdc++-4.4.6-4.el6.x86_64 mysql-libs-5.1.66-2.el6_3.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 openssl-1.0.0-27.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00007ffff7d04da7 in AlarmManager::watchdogThread (this=0x7ffff7ff7740) at alarm.cpp:1090
#1  0x00007ffff7d04f29 in WatchdogThreadStarter (pArg=<value optimized out>) at alarm.cpp:204
#2  0x000000313d407851 in start_thread () from /lib64/libpthread.so.0
#3  0x000000313cce811d in clone () from /lib64/libc.so.6


regards,
Edward

edward.borst

Hi Victor,

Did you have a chance to look at the trace?
it keeps crashing several times a day....

Alex Kirhenshtein

Hello.

Yes, we spent some time on it, but unfortunately can't find root of this problem. Can you please run server under valgrind, until crash?

edward.borst

of course we can.
any options you want to give valgrind to check?

edward.borst

it looks like this issue is coming from pre 1.2.16 alarms.
We had about 6 old alarms with a sticky acknowledge set on it.
I was looking at the alarm.cpp code to see if I could find anything.
I saw that piece of code was added last release.
So, I deleted(terminated) all of the old alarms and we did not have a crash yet.
We have had new alarms coming in with no error.

regards,
Edward

Victor Kirhenshtein

Hi!

you mean that you've found some specific place in added code which caused this crash?

Best regards,
Victor

edward.borst

Hello,
Unfortunately it crashed again. Same error
I'm running it now under valgrind, but that takes a lot of resources. I run it with default settings
Did not have a crash yet

edward.borst

Here is a part from an error from Valgrind:

==6816== Thread 7:
==6816== Invalid read of size 4
==6816==    at 0x4C5CDD0: AlarmManager::watchdogThread() (alarm.cpp:1105)
==6816==    by 0x4C5CF28: WatchdogThreadStarter(void*) (alarm.cpp:204)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)
==6816==  Address 0x2c8c57bc is 52,620 bytes inside a block of size 243,984 free'd
==6816==    at 0x4A0610F: realloc (vg_replace_malloc.c:525)
==6816==    by 0x4C5EF8A: AlarmManager::newAlarm(char*, char*, int, int, unsigned int, unsigned int, Event*, unsigned int) (alarm.cpp:359)
==6816==    by 0x4C81704: EPRule::generateAlarm(Event*) (epp.cpp:531)
==6816==    by 0x4C81D4A: EPRule::processEvent(Event*) (epp.cpp:468)
==6816==    by 0x4C81F4B: EventPolicy::processEvent(Event*) (epp.cpp:814)
==6816==    by 0x4C862C9: EventProcessor(void*) (evproc.cpp:225)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)

edward.borst

and here is another one:

==6816== Thread 66:
==6816== Conditional jump or move depends on uninitialised value(s)
==6816==    at 0x4CC549E: ClientSession::onAlarmUpdate(unsigned int, NXC_ALARM*) (session.cpp:4989)
==6816==    by 0x4C631ED: EnumerateClientSessions(void (*)(ClientSession*, void*), void*) (client.cpp:337)
==6816==    by 0x4C5F278: AlarmManager::newAlarm(char*, char*, int, int, unsigned int, unsigned int, Event*, unsigned int) (alarm.cpp:384)
==6816==    by 0x4C81704: EPRule::generateAlarm(Event*) (epp.cpp:531)
==6816==    by 0x4C81D4A: EPRule::processEvent(Event*) (epp.cpp:468)
==6816==    by 0x4C81F4B: EventPolicy::processEvent(Event*) (epp.cpp:814)
==6816==    by 0x4C862C9: EventProcessor(void*) (evproc.cpp:225)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)


Hope this helps finding the crash cause.

edward.borst

Hello,

Any progress on this?
Server is still crashing several times...

Would it be an option to downgrade to 1.2.14?
This was our previous version.

Thanks,
Edward

Victor Kirhenshtein

Hi!

can you please provide full valgrind log (assuming there are more records then you post already)? Those errors logged looks like consequences of some memory corruption that happened before.

Best regards,
Victor

edward.borst

Hi,

Here is the full log.
It is huge, so I have compressed it.

Hope this helps.