Out of memory NetXMS v1.2.5

Started by millerpaint, February 02, 2013, 02:08:37 AM

Previous topic - Next topic

Victor Kirhenshtein

Hi!

We are trying to find it. Can you describe what features you are using? Most important is to know about network discovery and scripts. Also, can it be that you have routers with very large routing tables (like full BGP table) monitored by NetXMS?

Best regards,
Victor

testos


It seems like a hardware failure.
Centos kernel is not supposed to panic under any circumstance.
There's no "normal" circumstances beyond a hardware failure or kernel bug that should cause one.
Try to run memtest86 or memtest86+ and performs a BIOS address test in physical machine.

Alex Kirhenshtein

Quote from: testos on February 07, 2013, 12:51:25 PM
Centos kernel is not supposed to panic under any circumstance.

That's pretty much standard trace issued by oomkiller.

Kevin: try to use attached script to get proper valgrind output. Script idea is that it monitors RSS used by netxmsd, and if it's greater than 1Gb, script try to do graceful shutdown using nxadm – before netxmsd is killed by OOMKiller.
You need to change NXADM variable in the script according to your installation prefix.
Start script, then start netxmsd under valgrind.

millerpaint

Hi,

Thanks for the feedback guys!

Testos,  I do not believe this is hardware related, as this is a virtual server running on an IBM x3550 M4 (ESXi 5.0) along side 12 other productions VM's, and they are having no issues.

Victor, discovery is limited to 50 subnets on our MPLS network, which is pretty much all that we have.  I do specify each subnet, I guess I could eliminate that list and just discover all subnets.  In the beginning, I started out adding 10 subnets at a time, I didn't want to risk overloading our network.

After that, I am filtering the discovery results for a specific IP address range on each subnet (IP .1 thru .100).  I'm really not using much scripting yet, just changing the names of nodes to match SNMP host names, a couple of email alerts, and that's about it.  The routing tables should not be huge on any of the routers that NetXMS discovers.  That being said, our network provider may be doing things on their Cisco routers (which they own) that I am unaware of.

Alex, I will edit and then run the attached script per your recommendation.


-Kevin C.

millerpaint

Hi,

One thing seems strange, it crashes with out of memory right away running under valgrind.  If I run netxmsd normally, it can run for hours before crashing.


-Kevin C.

millerpaint

OK, good news Alex.  The script you provided shut down NetXMS gracefully after it reached the 1GB of RAM threshold.  It took 12-15 minutes running under valgrind before it crashed.  I have attached the valgrind log.


Thanks,

-Kevin C.

Victor Kirhenshtein

Hi!

Quote from: millerpaint on February 07, 2013, 06:33:43 PM
One thing seems strange, it crashes with out of memory right away running under valgrind.  If I run netxmsd normally, it can run for hours before crashing.

This is normal, when running under valgrind program takes tens times more memory then when it run normally. Valgrind allocates extra memory around each dynamically allocated block to detect boundary violations, etc.

Best regards,
Victor

millerpaint

QuoteThis is normal, when running under valgrind program takes tens times more memory then when it run normally. Valgrind allocates extra memory around each dynamically allocated block to detect boundary violations, etc.

OK, that makes sense Victor.

I have attached a screenshot of my Network Discovery panel, so you can see more about the details of my configuration.  I also have (2) SNMP community strings listed, but they are not visible in the screenshot image.

I am using top to monitor the memory consumption of netxmsd - it seems to be consuming 1/10th of 1% of available RAM every few seconds, running in normal mode.


-Kevin C.

Alex Kirhenshtein

Unfortunately, there nothing related to your issue in this log.
Could you please change 1Gb limit to 2Gb in the script and rerun it?

Quote from: millerpaint on February 07, 2013, 06:57:38 PM
OK, good news Alex.  The script you provided shut down NetXMS gracefully after it reached the 1GB of RAM threshold.  It took 12-15 minutes running under valgrind before it crashed.  I have attached the valgrind log.

millerpaint

Hi Alex,

I modified the script for a 2GB threshold and re-ran it, the new log file is attached.  Hopefully this will provide some clues as to what is going on.

When monitoring with top, it seems to start consuming RAM when the timer reaches 14:41:

netxmsd starts out using .5% available RAM of server.  Then:
14:41 - .6%
15:50 - .7%
16:53 - .8%
17:69 - .9%
18:60 - 1%
19:20 - 1.1%
20:05 - 1.2%
etc.


Thanks for your help!

-Kevin C.

millerpaint

FYI, I have completely disabled auto-discovery, and it is still running out of memory.


-Kevin C.

Victor Kirhenshtein

Hi!

We are still trying to figure out what could cause such a high memory consumption. Can you please do the following:

1. Run netxmsd under valgrind with additional options:

valgrind --leak-check=full --undef-value-errors=no --show-reachable=yes --log-file=netxmsd-valgrind.log netxmsd -D3


(or modify script sent by Alex by adding --show-reachable=yes to valgrind's command line).

2. Run valgrind's heap profiller:


valgrind --tool=massif --time-unit=ms --stacks=yes --threshold=0.5 --max-snapshots=1000 --log-file=netxmsd-massif.log netxmsd -D3


and send me profiler's result (it will be named massif.out.<pid>).

Also, can you please show me you configuration hook script and transformation scripts?

Best regards,
Victor

millerpaint

Hi Victor,

OK, I can do step 1 additional options with no problem.

With Step 2, you are asking me to run valgrind's heap profiler.  I have questions on that:

1) Is step 2 option run after I have completed running the step 1 test, and it runs out of memory?
2) Do I need to start Alex's script before running step 2?


-Kevin C.

Victor Kirhenshtein

Hi!

Yes, you should run step 2 after step 1 is completed. You can use Alex's script for step 2 too.

Best regards,
Victor

millerpaint

Hi Victor,

I am unable to attach the valgrind log from Step 1, it is about 600k in size, and your forum will not allow me to post it.

Can you please raise the limit of your attachment size on this forum, or else let me know your email address?


Thanks,

-Kevin C.