segfault netxmsd crash

Started by edward.borst, August 25, 2014, 02:47:20 PM

Previous topic - Next topic

edward.borst

any news on this?

I'm open for any ideas to speed up the resolve of this issue.

include some more info in the code?
reinstall system? move to other OS?

Thanks,
Edward



Alex Kirhenshtein

Hello.

Could you please also provide us with disassembly of two methods: AlarmManager::watchdogThread and AlarmManager::newAlarm?

This can be done with gdb:$ gdb /opt/netxms/bin/netxmsd
(gdb) info functions AlarmManager::watchdogThread
(gdb) disassemble AlarmManager::watchdogThread
(gdb) disassemble AlarmManager::newAlarm

edward.borst

Hi Alex,

Thanks for responding!
attached is the output for the disassemble.


Maybe good to know that I found out that the crash always occur strait after a new alarm.
Not after each alarm, but after running a couple of hours (I think after a couple of new alarms)
Also it looks that it is related to a critical event. (node down or so)
I'm trying to reproduce the crash, but that is really difficult.

Today we have had not many new alarms, and it is running all day now.
I expect that as soon as start some maintenance (rebooting machines etc) we will have a crash again.

Hope this helps.
Regards,
Edward

edward.borst

attached is a debug log (level 8) where you can see a new alarm coming in.
this was a node down alarm. direct after this we got the sigsegv.


Victor Kirhenshtein

Hi,

it seems one of the most mysterious bugs we've ever encounter. I've done load tests with lot of alarm generation and termination, and system works as expected. Seems it is some rare combination of events and/or environment. Is it possible to get remote access to your system to debug it in place? Alternatively, can you try to get core dump after crash (most likely it will note be generated by default, you'll have to enable it with ulimit) and sent to us along with compiled binaries? If you run it as virtual machine and can provide a VM image it also could help us with debugging.

Best regards,
Victor

edward.borst

Hi Victor,

Thanks for the reply!

I'm currently running it under GDB, so as soon as I have a crash I will generate a core dump.
Is that ok for you if I do it this way?


Another question:
Is is possible to fake some node down events for existing nodes?
That way I can try to reproduce the crash.

Regards,
Edward

Victor Kirhenshtein

You can gdnerate events using nxevent command line tool.

Best regards,
Victor

edward.borst

I have a core dump generated.
How would you like me to send/upload it?
it is compressed 9M. (only the core file)
comressed tar from the binaries is 70M.

Best regards,
Edward

Alex Kirhenshtein

You can upload it to our anonymous ftp at ftp://netxms.org/upload/. This FTP is upload-only and support encrypted TLS sessions (optional, you'll need compatible ftp client like FileZilla).

edward.borst

Thanks,

I have uploaded the following files:
core.tgz
server-binaries.tgz

do you need anything else?
regards,
Edward

edward.borst

I have uploaded a new core file.
This one is generated by abrt
regards,
Edward

Victor Kirhenshtein

Hi,

so far our analysis shows that root of the problem is that some internal data structures are not initialized properly (if you are familiar with C++ - it seems that constructor for global instance of AlarmManager class is not called), although it is not clear why this happens. I try your binaries on CentOS 6.3 (closest that I can find) and it works as expected. Is it an option to do system upgrade to latest versions of kernel and glibc and try again? If this will not help, I'll re-write alarm manager initialization and provide you with intermediate build for testing.

Best regards,
Victor

edward.borst

Hi Victor,

I have upgraded my system from Oracle Linux 6.3 to 6.5
glibc unfortunately is not updated in the release.
Also netxmsd crashed after giving it a flood of alarms.

So, situation is not changed so far.
Regards,
Edward

edward.borst

Hi,

Some news:
I have migrated my complete system from Oracle Linux to Debian 7.6
and guess what? the problem is not reproduced yet...

I will leave this new system running for now. see how it goes.
only one thing: I see netxmsd running 100% continue.
Maybe we have to look at that some time.

Regards,
Edward

Victor Kirhenshtein

Hi,

when system will be on 100%, please attach to netxmsd process with gdb and send me result of command

thread apply all bt

Best regards,
Victor