Hi,
We are getting segfaults on the netxmsd after upgrading to 1.2.16.
error is:
netxmsd[32695]: segfault at 7f62ef0d666c ip 00007f62f3806dd0 sp 00007f62ef09ad50 error 4 in libnxcore.so.1.0.0
OS is Oracle Linux (RedHat) 6.3 with kernel: 2.6.32-300.3.1.el6uek.x86_64
glibc version is glibc-2.12-1.80.el6_3.6.i686
Hi,
can you please run netxmsd under gdb and post backtrace after crash? Here is the instruction: http://wiki.netxms.org/wiki/Running_NetXMS_under_debugger (http://wiki.netxms.org/wiki/Running_NetXMS_under_debugger)
Best regards,
Victor
I am already running it under gdb, but it did not crash yet...
As soon as I have a trace I will send it to you.
regards,
Edward
here it is:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffef576700 (LWP 19648)]
0x00007ffff7d04da7 in AlarmManager::watchdogThread (this=0x7ffff7ff7740) at alarm.cpp:1090
1090 if ((m_pAlarmList.dwTimeout > 0) &&
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libstdc++-4.4.6-4.el6.x86_64 mysql-libs-5.1.66-2.el6_3.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 openssl-1.0.0-27.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x00007ffff7d04da7 in AlarmManager::watchdogThread (this=0x7ffff7ff7740) at alarm.cpp:1090
#1 0x00007ffff7d04f29 in WatchdogThreadStarter (pArg=<value optimized out>) at alarm.cpp:204
#2 0x000000313d407851 in start_thread () from /lib64/libpthread.so.0
#3 0x000000313cce811d in clone () from /lib64/libc.so.6
regards,
Edward
Hi Victor,
Did you have a chance to look at the trace?
it keeps crashing several times a day....
Hello.
Yes, we spent some time on it, but unfortunately can't find root of this problem. Can you please run server under valgrind, until crash?
of course we can.
any options you want to give valgrind to check?
it looks like this issue is coming from pre 1.2.16 alarms.
We had about 6 old alarms with a sticky acknowledge set on it.
I was looking at the alarm.cpp code to see if I could find anything.
I saw that piece of code was added last release.
So, I deleted(terminated) all of the old alarms and we did not have a crash yet.
We have had new alarms coming in with no error.
regards,
Edward
Hi!
you mean that you've found some specific place in added code which caused this crash?
Best regards,
Victor
Hello,
Unfortunately it crashed again. Same error
I'm running it now under valgrind, but that takes a lot of resources. I run it with default settings
Did not have a crash yet
Here is a part from an error from Valgrind:
==6816== Thread 7:
==6816== Invalid read of size 4
==6816== at 0x4C5CDD0: AlarmManager::watchdogThread() (alarm.cpp:1105)
==6816== by 0x4C5CF28: WatchdogThreadStarter(void*) (alarm.cpp:204)
==6816== by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816== by 0x313CCE811C: clone (in /lib64/libc-2.12.so)
==6816== Address 0x2c8c57bc is 52,620 bytes inside a block of size 243,984 free'd
==6816== at 0x4A0610F: realloc (vg_replace_malloc.c:525)
==6816== by 0x4C5EF8A: AlarmManager::newAlarm(char*, char*, int, int, unsigned int, unsigned int, Event*, unsigned int) (alarm.cpp:359)
==6816== by 0x4C81704: EPRule::generateAlarm(Event*) (epp.cpp:531)
==6816== by 0x4C81D4A: EPRule::processEvent(Event*) (epp.cpp:468)
==6816== by 0x4C81F4B: EventPolicy::processEvent(Event*) (epp.cpp:814)
==6816== by 0x4C862C9: EventProcessor(void*) (evproc.cpp:225)
==6816== by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816== by 0x313CCE811C: clone (in /lib64/libc-2.12.so)
and here is another one:
==6816== Thread 66:
==6816== Conditional jump or move depends on uninitialised value(s)
==6816== at 0x4CC549E: ClientSession::onAlarmUpdate(unsigned int, NXC_ALARM*) (session.cpp:4989)
==6816== by 0x4C631ED: EnumerateClientSessions(void (*)(ClientSession*, void*), void*) (client.cpp:337)
==6816== by 0x4C5F278: AlarmManager::newAlarm(char*, char*, int, int, unsigned int, unsigned int, Event*, unsigned int) (alarm.cpp:384)
==6816== by 0x4C81704: EPRule::generateAlarm(Event*) (epp.cpp:531)
==6816== by 0x4C81D4A: EPRule::processEvent(Event*) (epp.cpp:468)
==6816== by 0x4C81F4B: EventPolicy::processEvent(Event*) (epp.cpp:814)
==6816== by 0x4C862C9: EventProcessor(void*) (evproc.cpp:225)
==6816== by 0x313D407850: start_thread (in /lib64/libpthread-2.12.so)
==6816== by 0x313CCE811C: clone (in /lib64/libc-2.12.so)
Hope this helps finding the crash cause.
Hello,
Any progress on this?
Server is still crashing several times...
Would it be an option to downgrade to 1.2.14?
This was our previous version.
Thanks,
Edward
Hi!
can you please provide full valgrind log (assuming there are more records then you post already)? Those errors logged looks like consequences of some memory corruption that happened before.
Best regards,
Victor
Hi,
Here is the full log.
It is huge, so I have compressed it.
Hope this helps.
any news on this?
I'm open for any ideas to speed up the resolve of this issue.
include some more info in the code?
reinstall system? move to other OS?
Thanks,
Edward
Hello.
Could you please also provide us with disassembly of two methods: AlarmManager::watchdogThread and AlarmManager::newAlarm?
This can be done with gdb:$ gdb /opt/netxms/bin/netxmsd
(gdb) info functions AlarmManager::watchdogThread
(gdb) disassemble AlarmManager::watchdogThread
(gdb) disassemble AlarmManager::newAlarm
Hi Alex,
Thanks for responding!
attached is the output for the disassemble.
Maybe good to know that I found out that the crash always occur strait after a new alarm.
Not after each alarm, but after running a couple of hours (I think after a couple of new alarms)
Also it looks that it is related to a critical event. (node down or so)
I'm trying to reproduce the crash, but that is really difficult.
Today we have had not many new alarms, and it is running all day now.
I expect that as soon as start some maintenance (rebooting machines etc) we will have a crash again.
Hope this helps.
Regards,
Edward
attached is a debug log (level 8) where you can see a new alarm coming in.
this was a node down alarm. direct after this we got the sigsegv.
Hi,
it seems one of the most mysterious bugs we've ever encounter. I've done load tests with lot of alarm generation and termination, and system works as expected. Seems it is some rare combination of events and/or environment. Is it possible to get remote access to your system to debug it in place? Alternatively, can you try to get core dump after crash (most likely it will note be generated by default, you'll have to enable it with ulimit) and sent to us along with compiled binaries? If you run it as virtual machine and can provide a VM image it also could help us with debugging.
Best regards,
Victor
Hi Victor,
Thanks for the reply!
I'm currently running it under GDB, so as soon as I have a crash I will generate a core dump.
Is that ok for you if I do it this way?
Another question:
Is is possible to fake some node down events for existing nodes?
That way I can try to reproduce the crash.
Regards,
Edward
You can gdnerate events using nxevent command line tool.
Best regards,
Victor
I have a core dump generated.
How would you like me to send/upload it?
it is compressed 9M. (only the core file)
comressed tar from the binaries is 70M.
Best regards,
Edward
You can upload it to our anonymous ftp at ftp://netxms.org/upload/. This FTP is upload-only and support encrypted TLS sessions (optional, you'll need compatible ftp client like FileZilla).
Thanks,
I have uploaded the following files:
core.tgz
server-binaries.tgz
do you need anything else?
regards,
Edward
I have uploaded a new core file.
This one is generated by abrt
regards,
Edward
Hi,
so far our analysis shows that root of the problem is that some internal data structures are not initialized properly (if you are familiar with C++ - it seems that constructor for global instance of AlarmManager class is not called), although it is not clear why this happens. I try your binaries on CentOS 6.3 (closest that I can find) and it works as expected. Is it an option to do system upgrade to latest versions of kernel and glibc and try again? If this will not help, I'll re-write alarm manager initialization and provide you with intermediate build for testing.
Best regards,
Victor
Hi Victor,
I have upgraded my system from Oracle Linux 6.3 to 6.5
glibc unfortunately is not updated in the release.
Also netxmsd crashed after giving it a flood of alarms.
So, situation is not changed so far.
Regards,
Edward
Hi,
Some news:
I have migrated my complete system from Oracle Linux to Debian 7.6
and guess what? the problem is not reproduced yet...
I will leave this new system running for now. see how it goes.
only one thing: I see netxmsd running 100% continue.
Maybe we have to look at that some time.
Regards,
Edward
Hi,
when system will be on 100%, please attach to netxmsd process with gdb and send me result of command
thread apply all bt
Best regards,
Victor
Attached is the log from gdb.
look for thread 9332, because it is this one which is currently running at 100%
Regards,
Edward
Hi,
it seems that it is update of IP topology network map. I'll take a look at it.
Best regards,
Victor