Server running out of memory and killing NetXMS

Started by Millenium7, January 23, 2023, 12:18:04 AM

Previous topic - Next topic

Millenium7

This might seem obvious.... add more memory. Thing is i've bumped the VM from 4GB (which was fine a few months ago) up to 8GB without any much more to the network and its still running out of memory after a few days. So i'm unsure if there's a memory leak, or a patch has made netXMS consume a whole lot more memory all of a sudden, or what.....

This is on a hosted instance so memory upgrades are not cheap
How do I go about troubleshooting and finding out why NetXMS is consuming so much memory?
Otherwise are there any obvious things to look for in my main config or polling templates to adjust to bring the memory usage down?

Dawid Kellerman

Hi
I am not able to help you specifically with this problem.
You will need to set the Debug higher either in the config file or via the management console in the "Server Console" under tools. If you need more help just search debug here in the forums.

That will log more info so that you see what is happening and it will enable members that know more than me to assist you.

Good Luck

Filipp Sudanov

Hi!

What exactly version of netxms are you running?
Can you show output of "sh st" from Tools -> Server Console?


Are default System -> NetXMS server templates up to date? They are not imported automatically - you can set server configuration parameter Server.ImportConfigurationOnStartup to "Always" and restart the server.

From the server template it would be worth to see graph for DCI "NetXMS server: physical memory used by process" (together with "System: available physical memory").
And graph for "Server QueueSize DBWriter.Total: Current" - that's how much data is waiting to be written to the DB. If DB is slow, this might grow and consume memory.


What about DB, is it running on the same VM?
Anything else running on this VM? NetXMS Web UI?

Millenium7

#3
4.2.461 (problem has been happening for a few versions now)

Have just updated and rebooted the server but will get back to you


NetXMS Server, agent, webGUI and DB all running on same server (has been fine for years though)

NetXMS server usage process is interesting. I only have retention going back 31 days so it doesn't paint a full picture (and i've only restarted it once in the past month, it was down for a few days), but it appears to be very steadily rising until it crashes. It's a very linear climbing line

Filipp Sudanov

You may try to run netxmsd under valgrind:

valgrind --leak-check=full --undef-value-errors=no --log-file=vg.log netxmsd -D1

Running it for a day will collect enough information, then when process is stopped, valgrind will collect log file with information about memory leaks.
The problem is that valgrind requires much more RAM then netxmsd itself, as it keeps information about every allocated piece of memory, so chances are that it won't start at all.


The other option is to compile netxms from sources, adding --enable-sanitizer to ./configure command. This would print information about memory leaks to the console and consumes a bit less memory then valgrind approach.


Also can you give a list of things that you use in your monitoring - agent tunnels, snmp, agent monitoring, actions that execute commands on management server, notification channels - any functionality that is not configured out of the box.

Millenium7

Memory usage by NetXMS is a linear upward usage pattern. And System Available Physical memory is exactly the opposite. So it definitely appears like a memory leak





Database writes are pretty consistent and spike up to 1.4k and then back down to 0, only very occasionally going higher (highest was 3.4k, just a one off)

What's interesting is 'Agent communications: unsupported requests' as well as 'Agent communications: failed requests' they perfectly correlate with the memory usage graph. Only going up, never down (until server restarted)
Vast majority of what I monitor is through SNMP. I don't use any other NetXMS agents - just the server itself

Could the agent be continually holding open failed connections indefinitely?

Filipp Sudanov

Agent communication DCIs are counters (similar as in SNMP), they are always increasing, that's fine. The interesting part is how fast it's increasing. As an option, you can add additional DCIs with enabled delta calculation.

Would be interesting to see graphs for these two Agent communication DCI for the same period as memory usage above.