Print Page - segfault netxmsd crash

Title: segfault netxmsd crash
Post by: edward.borst on August 25, 2014, 02:47:20 PM

Hi,

We are getting segfaults on the netxmsd after upgrading to 1.2.16.
error is:

netxmsd[32695]: segfault at 7f62ef0d666c ip 00007f62f3806dd0 sp 00007f62ef09ad50 error 4 in libnxcore.so.1.0.0

OS is Oracle Linux (RedHat) 6.3 with kernel: 2.6.32-300.3.1.el6uek.x86_64
glibc version is glibc-2.12-1.80.el6_3.6.i686

Title: Re: segfault netxmsd crash
Post by: Victor Kirhenshtein on August 26, 2014, 12:54:54 PM

Hi,

can you please run netxmsd under gdb and post backtrace after crash? Here is the instruction: http://wiki.netxms.org/wiki/Running_NetXMS_under_debugger (http://wiki.netxms.org/wiki/Running_NetXMS_under_debugger)

Best regards,
Victor

Title: Re: segfault netxmsd crash
Post by: edward.borst on August 26, 2014, 01:05:00 PM

I am already running it under gdb, but it did not crash yet...
As soon as I have a trace I will send it to you.

regards,
Edward

Title: Re: segfault netxmsd crash
Post by: edward.borst on August 26, 2014, 04:13:15 PM

here it is:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffef576700 (LWP 19648)]
0x00007ffff7d04da7 in AlarmManager::watchdogThread (this=0x7ffff7ff7740) at alarm.cpp:1090
1090 if ((m_pAlarmList.dwTimeout > 0) &&
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libstdc++-4.4.6-4.el6.x86_64 mysql-libs-5.1.66-2.el6_3.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 openssl-1.0.0-27.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x00007ffff7d04da7 in AlarmManager::watchdogThread (this=0x7ffff7ff7740) at alarm.cpp:1090
#1 0x00007ffff7d04f29 in WatchdogThreadStarter (pArg=<value optimized out>) at alarm.cpp:204
#2 0x000000313d407851 in start_thread () from /lib64/libpthread.so.0
#3 0x000000313cce811d in clone () from /lib64/libc.so.6

regards,
Edward

Title: Re: segfault netxmsd crash
Post by: edward.borst on August 28, 2014, 11:12:38 AM

Hi Victor,

Did you have a chance to look at the trace?
it keeps crashing several times a day....

Title: Re: segfault netxmsd crash
Post by: Alex Kirhenshtein on August 28, 2014, 04:54:05 PM

Hello.

Yes, we spent some time on it, but unfortunately can't find root of this problem. Can you please run server under valgrind, until crash?

Title: Re: segfault netxmsd crash
Post by: edward.borst on August 28, 2014, 05:23:31 PM

of course we can.
any options you want to give valgrind to check?

Title: Re: segfault netxmsd crash
Post by: edward.borst on August 28, 2014, 10:12:58 PM

it looks like this issue is coming from pre 1.2.16 alarms.
We had about 6 old alarms with a sticky acknowledge set on it.
I was looking at the alarm.cpp code to see if I could find anything.
I saw that piece of code was added last release.
So, I deleted(terminated) all of the old alarms and we did not have a crash yet.
We have had new alarms coming in with no error.

regards,
Edward

Title: Re: segfault netxmsd crash
Post by: Victor Kirhenshtein on August 28, 2014, 11:01:42 PM

Hi!

you mean that you've found some specific place in added code which caused this crash?

Best regards,
Victor

Title: Re: segfault netxmsd crash
Post by: edward.borst on August 31, 2014, 05:43:57 PM

Hello,
Unfortunately it crashed again. Same error
I'm running it now under valgrind, but that takes a lot of resources. I run it with default settings
Did not have a crash yet

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 01, 2014, 09:07:45 AM

Here is a part from an error from Valgrind:

Code Select

==6816== Thread 7:
==6816== Invalid read of size 4
==6816==    at 0x4C5CDD0: AlarmManager::watchdogThread() (alarm.cpp:1105)
==6816==    by 0x4C5CF28: WatchdogThreadStarter(void*) (alarm.cpp:204)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)
==6816==  Address 0x2c8c57bc is 52,620 bytes inside a block of size 243,984 free'd
==6816==    at 0x4A0610F: realloc (vg_replace_malloc.c:525)
==6816==    by 0x4C5EF8A: AlarmManager::newAlarm(char*, char*, int, int, unsigned int, unsigned int, Event*, unsigned int) (alarm.cpp:359)
==6816==    by 0x4C81704: EPRule::generateAlarm(Event*) (epp.cpp:531)
==6816==    by 0x4C81D4A: EPRule::processEvent(Event*) (epp.cpp:468)
==6816==    by 0x4C81F4B: EventPolicy::processEvent(Event*) (epp.cpp:814)
==6816==    by 0x4C862C9: EventProcessor(void*) (evproc.cpp:225)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 01, 2014, 09:10:34 AM

and here is another one:

Code Select

==6816== Thread 66:
==6816== Conditional jump or move depends on uninitialised value(s)
==6816==    at 0x4CC549E: ClientSession::onAlarmUpdate(unsigned int, NXC_ALARM*) (session.cpp:4989)
==6816==    by 0x4C631ED: EnumerateClientSessions(void (*)(ClientSession*, void*), void*) (client.cpp:337)
==6816==    by 0x4C5F278: AlarmManager::newAlarm(char*, char*, int, int, unsigned int, unsigned int, Event*, unsigned int) (alarm.cpp:384)
==6816==    by 0x4C81704: EPRule::generateAlarm(Event*) (epp.cpp:531)
==6816==    by 0x4C81D4A: EPRule::processEvent(Event*) (epp.cpp:468)
==6816==    by 0x4C81F4B: EventPolicy::processEvent(Event*) (epp.cpp:814)
==6816==    by 0x4C862C9: EventProcessor(void*) (evproc.cpp:225)
==6816==    by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816==    by 0x313CCE811C: clone (in /lib64/libc-2.12.so)

Hope this helps finding the crash cause.

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 02, 2014, 04:33:52 PM

Hello,

Any progress on this?
Server is still crashing several times...

Would it be an option to downgrade to 1.2.14?
This was our previous version.

Thanks,
Edward

Title: Re: segfault netxmsd crash
Post by: Victor Kirhenshtein on September 02, 2014, 10:48:07 PM

Hi!

can you please provide full valgrind log (assuming there are more records then you post already)? Those errors logged looks like consequences of some memory corruption that happened before.

Best regards,
Victor

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 02, 2014, 10:55:03 PM

Hi,

Here is the full log.
It is huge, so I have compressed it.

Hope this helps.

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 04, 2014, 09:57:28 AM

any news on this?

I'm open for any ideas to speed up the resolve of this issue.

include some more info in the code?
reinstall system? move to other OS?

Thanks,
Edward

Title: Re: segfault netxmsd crash
Post by: Alex Kirhenshtein on September 05, 2014, 05:45:57 PM

Hello.

Could you please also provide us with disassembly of two methods: AlarmManager::watchdogThread and AlarmManager::newAlarm?

This can be done with gdb:

Code Select

$ gdb /opt/netxms/bin/netxmsd
(gdb) info functions AlarmManager::watchdogThread
(gdb) disassemble AlarmManager::watchdogThread
(gdb) disassemble AlarmManager::newAlarm

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 05, 2014, 10:17:20 PM

Hi Alex,

Thanks for responding!
attached is the output for the disassemble.

Maybe good to know that I found out that the crash always occur strait after a new alarm.
Not after each alarm, but after running a couple of hours (I think after a couple of new alarms)
Also it looks that it is related to a critical event. (node down or so)
I'm trying to reproduce the crash, but that is really difficult.

Today we have had not many new alarms, and it is running all day now.
I expect that as soon as start some maintenance (rebooting machines etc) we will have a crash again.

Hope this helps.
Regards,
Edward

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 05, 2014, 10:21:15 PM

attached is a debug log (level 8) where you can see a new alarm coming in.
this was a node down alarm. direct after this we got the sigsegv.

Title: Re: segfault netxmsd crash
Post by: Victor Kirhenshtein on September 06, 2014, 11:55:49 AM

Hi,

it seems one of the most mysterious bugs we've ever encounter. I've done load tests with lot of alarm generation and termination, and system works as expected. Seems it is some rare combination of events and/or environment. Is it possible to get remote access to your system to debug it in place? Alternatively, can you try to get core dump after crash (most likely it will note be generated by default, you'll have to enable it with ulimit) and sent to us along with compiled binaries? If you run it as virtual machine and can provide a VM image it also could help us with debugging.

Best regards,
Victor

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 06, 2014, 12:42:42 PM

Hi Victor,

Thanks for the reply!

I'm currently running it under GDB, so as soon as I have a crash I will generate a core dump.
Is that ok for you if I do it this way?

Another question:
Is is possible to fake some node down events for existing nodes?
That way I can try to reproduce the crash.

Regards,
Edward

Title: Re: segfault netxmsd crash
Post by: Victor Kirhenshtein on September 06, 2014, 01:51:40 PM

You can gdnerate events using nxevent command line tool.

Best regards,
Victor

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 06, 2014, 10:29:56 PM

I have a core dump generated.
How would you like me to send/upload it?
it is compressed 9M. (only the core file)
comressed tar from the binaries is 70M.

Best regards,
Edward

Title: Re: segfault netxmsd crash
Post by: Alex Kirhenshtein on September 06, 2014, 10:54:57 PM

You can upload it to our anonymous ftp at ftp://netxms.org/upload/. This FTP is upload-only and support encrypted TLS sessions (optional, you'll need compatible ftp client like FileZilla).

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 06, 2014, 11:49:47 PM

Thanks,

I have uploaded the following files:
core.tgz
server-binaries.tgz

do you need anything else?
regards,
Edward

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 07, 2014, 12:37:34 AM

I have uploaded a new core file.
This one is generated by abrt
regards,
Edward

Title: Re: segfault netxmsd crash
Post by: Victor Kirhenshtein on September 07, 2014, 03:42:46 PM

Hi,

so far our analysis shows that root of the problem is that some internal data structures are not initialized properly (if you are familiar with C++ - it seems that constructor for global instance of AlarmManager class is not called), although it is not clear why this happens. I try your binaries on CentOS 6.3 (closest that I can find) and it works as expected. Is it an option to do system upgrade to latest versions of kernel and glibc and try again? If this will not help, I'll re-write alarm manager initialization and provide you with intermediate build for testing.

Best regards,
Victor

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 08, 2014, 01:56:48 PM

Hi Victor,

I have upgraded my system from Oracle Linux 6.3 to 6.5
glibc unfortunately is not updated in the release.
Also netxmsd crashed after giving it a flood of alarms.

So, situation is not changed so far.
Regards,
Edward

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 08, 2014, 11:52:57 PM

Hi,

Some news:
I have migrated my complete system from Oracle Linux to Debian 7.6
and guess what? the problem is not reproduced yet...

I will leave this new system running for now. see how it goes.
only one thing: I see netxmsd running 100% continue.
Maybe we have to look at that some time.

Regards,
Edward

Title: Re: segfault netxmsd crash
Post by: Victor Kirhenshtein on September 09, 2014, 06:42:48 PM

Hi,

when system will be on 100%, please attach to netxmsd process with gdb and send me result of command

thread apply all bt

Best regards,
Victor

Title: Re: segfault netxmsd crash
Post by: edward.borst on September 09, 2014, 08:27:49 PM

Attached is the log from gdb.
look for thread 9332, because it is this one which is currently running at 100%

Regards,
Edward

Title: Re: segfault netxmsd crash
Post by: Victor Kirhenshtein on September 09, 2014, 08:34:53 PM

Hi,

it seems that it is update of IP topology network map. I'll take a look at it.

Best regards,
Victor

NetXMS Support Forum

English Support => General Support => Topic started by: edward.borst on August 25, 2014, 02:47:20 PM