Out of memory NetXMS v1.2.5

Victor Kirhenshtein · February 07, 2013, 08:55:19 AM

Hi!

We are trying to find it. Can you describe what features you are using? Most important is to know about network discovery and scripts. Also, can it be that you have routers with very large routing tables (like full BGP table) monitored by NetXMS?

Best regards,
Victor

testos · February 07, 2013, 12:51:25 PM

It seems like a hardware failure.
Centos kernel is not supposed to panic under any circumstance.
There's no "normal" circumstances beyond a hardware failure or kernel bug that should cause one.
Try to run memtest86 or memtest86+ and performs a BIOS address test in physical machine.

Alex Kirhenshtein · February 07, 2013, 04:43:45 PM

Quote from: testos on February 07, 2013, 12:51:25 PM
Centos kernel is not supposed to panic under any circumstance.

That's pretty much standard trace issued by oomkiller.

Kevin: try to use attached script to get proper valgrind output. Script idea is that it monitors RSS used by netxmsd, and if it's greater than 1Gb, script try to do graceful shutdown using nxadm – before netxmsd is killed by OOMKiller.
You need to change NXADM variable in the script according to your installation prefix.
Start script, then start netxmsd under valgrind.

millerpaint · February 07, 2013, 05:55:16 PM

Hi,

Thanks for the feedback guys!

Testos, I do not believe this is hardware related, as this is a virtual server running on an IBM x3550 M4 (ESXi 5.0) along side 12 other productions VM's, and they are having no issues.

Victor, discovery is limited to 50 subnets on our MPLS network, which is pretty much all that we have. I do specify each subnet, I guess I could eliminate that list and just discover all subnets. In the beginning, I started out adding 10 subnets at a time, I didn't want to risk overloading our network.

After that, I am filtering the discovery results for a specific IP address range on each subnet (IP .1 thru .100). I'm really not using much scripting yet, just changing the names of nodes to match SNMP host names, a couple of email alerts, and that's about it. The routing tables should not be huge on any of the routers that NetXMS discovers. That being said, our network provider may be doing things on their Cisco routers (which they own) that I am unaware of.

Alex, I will edit and then run the attached script per your recommendation.

-Kevin C.

millerpaint · February 07, 2013, 06:33:43 PM

Hi,

One thing seems strange, it crashes with out of memory right away running under valgrind. If I run netxmsd normally, it can run for hours before crashing.

-Kevin C.

millerpaint · February 07, 2013, 06:57:38 PM

OK, good news Alex. The script you provided shut down NetXMS gracefully after it reached the 1GB of RAM threshold. It took 12-15 minutes running under valgrind before it crashed. I have attached the valgrind log.

Thanks,

-Kevin C.

Victor Kirhenshtein · February 07, 2013, 07:31:33 PM

Hi!

Quote from: millerpaint on February 07, 2013, 06:33:43 PM
One thing seems strange, it crashes with out of memory right away running under valgrind. If I run netxmsd normally, it can run for hours before crashing.

This is normal, when running under valgrind program takes tens times more memory then when it run normally. Valgrind allocates extra memory around each dynamically allocated block to detect boundary violations, etc.

Best regards,
Victor

millerpaint · February 07, 2013, 09:31:06 PM

QuoteThis is normal, when running under valgrind program takes tens times more memory then when it run normally. Valgrind allocates extra memory around each dynamically allocated block to detect boundary violations, etc.

OK, that makes sense Victor.

I have attached a screenshot of my Network Discovery panel, so you can see more about the details of my configuration. I also have (2) SNMP community strings listed, but they are not visible in the screenshot image.

I am using top to monitor the memory consumption of netxmsd - it seems to be consuming 1/10th of 1% of available RAM every few seconds, running in normal mode.

-Kevin C.

Alex Kirhenshtein · February 08, 2013, 03:07:20 PM

Unfortunately, there nothing related to your issue in this log.
Could you please change 1Gb limit to 2Gb in the script and rerun it?

Quote from: millerpaint on February 07, 2013, 06:57:38 PM
OK, good news Alex. The script you provided shut down NetXMS gracefully after it reached the 1GB of RAM threshold. It took 12-15 minutes running under valgrind before it crashed. I have attached the valgrind log.

millerpaint · February 08, 2013, 08:57:40 PM

Hi Alex,

I modified the script for a 2GB threshold and re-ran it, the new log file is attached. Hopefully this will provide some clues as to what is going on.

When monitoring with top, it seems to start consuming RAM when the timer reaches 14:41:

netxmsd starts out using .5% available RAM of server. Then:
14:41 - .6%
15:50 - .7%
16:53 - .8%
17:69 - .9%
18:60 - 1%
19:20 - 1.1%
20:05 - 1.2%
etc.

Thanks for your help!

-Kevin C.

millerpaint · February 08, 2013, 11:46:39 PM

FYI, I have completely disabled auto-discovery, and it is still running out of memory.

-Kevin C.

Victor Kirhenshtein · February 09, 2013, 10:27:08 PM

Hi!

We are still trying to figure out what could cause such a high memory consumption. Can you please do the following:

1. Run netxmsd under valgrind with additional options:

Code Select


valgrind --leak-check=full --undef-value-errors=no --show-reachable=yes --log-file=netxmsd-valgrind.log netxmsd -D3

(or modify script sent by Alex by adding --show-reachable=yes to valgrind's command line).

2. Run valgrind's heap profiller:

Code Select


valgrind --tool=massif --time-unit=ms --stacks=yes --threshold=0.5 --max-snapshots=1000 --log-file=netxmsd-massif.log netxmsd -D3

and send me profiler's result (it will be named massif.out.<pid>).

Also, can you please show me you configuration hook script and transformation scripts?

Best regards,
Victor

millerpaint · February 10, 2013, 05:04:02 AM

Hi Victor,

OK, I can do step 1 additional options with no problem.

With Step 2, you are asking me to run valgrind's heap profiler. I have questions on that:

1) Is step 2 option run after I have completed running the step 1 test, and it runs out of memory?
2) Do I need to start Alex's script before running step 2?

-Kevin C.

Victor Kirhenshtein · February 10, 2013, 12:29:39 PM

Hi!

Yes, you should run step 2 after step 1 is completed. You can use Alex's script for step 2 too.

Best regards,
Victor

millerpaint · February 12, 2013, 06:49:59 PM

Hi Victor,

I am unable to attach the valgrind log from Step 1, it is about 600k in size, and your forum will not allow me to post it.

Can you please raise the limit of your attachment size on this forum, or else let me know your email address?

Thanks,

-Kevin C.

NetXMS Support Forum

News:

Out of memory NetXMS v1.2.5

Victor Kirhenshtein

testos

Alex Kirhenshtein

millerpaint

millerpaint

millerpaint

Victor Kirhenshtein

millerpaint

Alex Kirhenshtein

millerpaint

millerpaint

Victor Kirhenshtein

millerpaint

Victor Kirhenshtein

millerpaint