NetXMS server seg. faults on libnxsnmp.so

Started by Anders, June 08, 2011, 10:26:48 AM

Previous topic - Next topic

Anders

Hi,

My NetXMSd (Server 1.0.11) seems to be crashing once or twice every day with the following message from SYSLOG (I don't know if the SQL queries are related, but they seem to happen just before the crash):


Jun  7 21:41:39 HOSTNAME netxmsd[19204]: SQL query failed (Query = "SELECT var_value FROM config WHERE var_name='CapabilityExpirationTime'"):
Jun  7 21:41:39 HOSTNAME netxmsd[19204]: SQL query failed (Query = "SELECT var_value FROM config WHERE var_name='SMTPRetryCount'"):
Jun  7 21:41:39 HOSTNAME netxmsd[19204]: SQL query failed (Query = "SELECT var_value FROM config WHERE var_name='SMTPServer'"):
Jun  7 21:41:40 HOSTNAME kernel: [3067348.518912] netxmsd[19351]: segfault at 0 ip 0011432c sp af3812a0 error 6 in libnxsnmp.so.0.0.2[110000+e000]


Only thing that I can think of that is related, is that I recently added a MIB-file for the D-Link Enterprise Access Point DAP-2690 which can be found in this ZIP-file: ftp://ftp.dlink.com/Wireless/dap2690/Firmware/dap2690_FW_102.zip

I'm thankful for any advice. I did however find an earlier post with instructions on how to create a crash dump, I will post a crash dump when the next incident occurs.

Thanks!

Victor Kirhenshtein

Hi!

Failed SQL queries suggests that there are something wrong with DB connection, but I don't really sure that it's related. Crash dump will be very helpful.

Best regards,
Victor

viesic

I have the same problem, we are monitoring over 200 printers with snmp (page counters, toner levels, etc).
server version was 1.0.11 and ubuntu server. I don`t remember how all this started, but now when i start the netxmsd, the process quickly (1 - 2 hours) eats up all available memory and swap space, and crashes with the same message in syslog, indicating segfault in libnxsnmp.so
Tried compile new server on freebsd with same db, and results was the same (segfault in ~2 hours). updated server version to newest 1.1.1 - at a glance all worked correctly, but later server crashed anyway:
Jun  8 21:46:02 netxms kernel: pid 37095 (netxmsd), uid 0: exited on signal 11 (core dumped)
I will try to get a crash dump too from 1.0.11 ubuntu server.

Anders

Hi,

Hmm... I've added the following entries to my configuration file: netxmsd.conf but for some reason it doesn't produce any dump upon crashing.


CreateCrashDumps = yes
FullCrashDumps = yes
DumpDirectory = /tmp/netxms # Yes, this folder exists and I've even tried to changing the permission to 777 just to be safe.


Any idea what I could  be doing wrong?

Alex Kirhenshtein

Right now "CreateCrashDumps" is Windows only parameter.
You can run server under gdb:gdb /path/to/netxmsd
Then in GDB prompt:(gdb) run -c /path/to/netxmsd.conf -D3

When it crash, do thread apply all bt

Anders


Victor Kirhenshtein

Hi!

Is it possible that there are low memory condition on NetXMS server? From crash dump it looks like some internal memory allocation fails.

Best regards,
Victor

Anders

Hi Victor,

You are right, it seems to be some sort of low memory condition.

The machine that's hosting my NetXMS server is equiped with 2GB of ram (I doubt that NetXMS should require more than that?).
A quick process listing shows that netxmsd uses 84% (!) of the available system memory.

Could this maybe be caused by a to high NumberOfDataCollectors value? I have mine set to 50, as I'm monitoring about 50-60 machines with about 5-20 DCIs / machine, every 1min - 5min.

Victor Kirhenshtein

Hi!

Looks like memory leak in netxmsd. Can you please add to monitoring memory used by netxmsd process and send me how it changes over time?

Best regards,
Victor

Anders

Sure... just to clarify, you want me to monitor the memory usage over time of the netxmsd process?

Victor Kirhenshtein


Anders

I activated monitoring of the memory usage for the netxmsd process yesterday just after your message, this is what has been logged so far (see the attachment).

For now, I have built a script that automatically checks for a netxmsd process and if it isn't running, it will be started, therefor the "rollercoaster" look on the memory usage graph :)

Victor Kirhenshtein

Hi!

I have found and fixed some memory leaks, but no one of them seems to be able to eat memory so fast. However, can you try to install the following version: https://www.netxms.org/download/dev/netxms-1.0.12-rc-19062011.tar.gz? If this will not help, could you please run your server under valgrind for 5-10 minutes and send me the log? Command for running server under valgrind is following:

valgrind --leak-check=full --undef-value-errors=no --log-file=/tmp/netxmsd_valgrind.log /opt/netxms/bin/netxmsd -D 5

To stop server, enter command "down" on server's prompt.

Best regards,
Victor

Anders

Sure, I've just made a deb-package with the following configuration options, I will keep you informed about the progress...


dh_auto_configure -- --prefix=/usr/ --with-openssl --with-gd --with-pgsql --with-nxhttpd --with-agent --with-server --with-client

Anders

Seems as if the 1.0.12-RC19062011 solved my problem :)

The netxmsd daemon has been stable on about 25-30mb of memory for over 24h now.