Greetings,
NetXMS has been crashing with an out of memory condition on our CentOS server:
CentOS 6.3
Out of memory: Kill process 1852 (netxmsd) score 934 or sacrifice child
Killed process 1852, UID 0, (netxmsd) total-vm:8698304kB, anon-rss:3652072kB, file-rss:424kB
This virtual CentOS server has 4GB of RAM currently allocated - and we have 50 subnets, maybe 20 nodes on average per subnet.  NetXMS and MySQL db are the only things running on it.  4GB should be enough RAM I would think?
-Kevin C.
			
			
			
				Hi!
Looks like memory leak in NetXMS server. Is it possible to you to run it under valgrind and send me the log? Correct command will be
valgrind --leak-check=full --undef-value-errors=no --log-file=netxmsd-valgrind.log netxmsd -D3
This will run netxmsd under valgrind in foreground. Let it run for few hours, then shutdown by entering "down" in server's console, and send me netxmsd-valgrind.log.
Best regards,
Victor
			
			
			
				OK, will do, and then I will post the log when finished.
-Kevin C.
			
			
			
				When I run this under valgrind, it seems to hang on "Loading nodes....." on the Linux console, and I am unable to launch the NetXMS GUI (connection refused).
Is this expected behavior at this point?
-Kevin C.
			
			
			
				Normally it should not be like this. Sometimes valgrind slows down application significantly - if you have lot of nodes, initialization could take time. Try to wait for 5-10 minutes.
Best regards,
Victor
			
			
			
				OK, it did move beyond that point after some time.  I have attached the log as requested.
-Kevin C.
			
			
			
				Great. Now let it run for some time (watch that memory used by netxmsd grows, so memory leak is catch), then stop the server and send me the log again. Valgrind will record lost memory blocks only after process termination.
Best regards,
Victor
			
			
			
				OK, the process terminated, and I have attached the latest log file.
-Kevin C.
			
			
			
				Do you shutdown it, or it just crashed because of no memory? Because log is the same. Don't wait for the crash - you have to run it for some time and shutdown correctly - otherwise valgrind will not be able to analyze address space of the process and find memory leaks.
Best regards,
Victor
			
			
			
				Yes, it crashed.  I have started it once again, and will shut it down manually before it runs out of memory.
-Kevin C.
			
			
			
				It crashed again, within 25 minutes and before I had a chance to stop it   :(
I will start it again, and stop it after 15 minutes.
-Kevin C.
			
			
			
				OK, I stopped it after running for ~ 7 minutes - it was crashing in 10 minutes.  The log file is attached
-Kevin C.
			
			
			
				What version of MySQL are you using?
Check if there is something strange in similar files:
/var/log/messages
/var/log/nxagentd
/var/log/netxmsd.log
/var/log/mysqld.log
etc.
			
			
			
				The version of MySQL I am running is 5.1.61.
I'm not sure exactly what to look for in the log files.  There is some detailed information in messages:
=====================================================
Feb  5 10:02:58 netmgmt abrtd: Init complete, entering main loop
Feb  5 14:57:25 netmgmt kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Feb  5 14:57:25 netmgmt kernel: mysqld cpuset=/ mems_allowed=0
Feb  5 14:57:25 netmgmt kernel: Pid: 1604, comm: mysqld Not tainted 2.6.32-279.9.1.el6.x86_64 #1
Feb  5 14:57:25 netmgmt kernel: Call Trace:
Feb  5 14:57:25 netmgmt kernel: [<ffffffff810c4c71>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811173e0>] ? dump_header+0x90/0x1b0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81214a0c>] ? security_real_capable_noaudit+0x3c/0x70
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81117862>] ? oom_kill_process+0x82/0x2a0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811177a1>] ? select_bad_process+0xe1/0x120
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81117ca0>] ? out_of_memory+0x220/0x3c0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811279be>] ? __alloc_pages_nodemask+0x89e/0x940
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8115c51a>] ? alloc_pages_current+0xaa/0x110
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811147e7>] ? __page_cache_alloc+0x87/0x90
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8112a40b>] ? __do_page_cache_readahead+0xdb/0x210
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8112a561>] ? ra_submit+0x21/0x30
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81115b13>] ? filemap_fault+0x4c3/0x500
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81136b6f>] ? __inc_zone_state+0x1f/0x70
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8113ef14>] ? __do_fault+0x54/0x510
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8113f4c7>] ? handle_pte_fault+0xf7/0xb50
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811913fc>] ? core_sys_select+0x1ec/0x2c0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81140104>] ? handle_mm_fault+0x1e4/0x2b0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff810444c9>] ? __do_page_fault+0x139/0x480
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81278bec>] ? rb_erase+0x1bc/0x310
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81012bd9>] ? read_tsc+0x9/0x20
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8109cea9>] ? ktime_get_ts+0xa9/0xe0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8118fe58>] ? poll_select_copy_remaining+0xf8/0x150
Feb  5 14:57:25 netmgmt kernel: [<ffffffff810d6ad3>] ? audit_syscall_entry+0x63/0x2a0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8150380e>] ? do_page_fault+0x3e/0xa0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81500bc5>] ? page_fault+0x25/0x30
Feb  5 14:57:25 netmgmt kernel: Mem-Info:
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA per-cpu:
Feb  5 14:57:25 netmgmt kernel: CPU    0: hi:    0, btch:   1 usd:   0
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA32 per-cpu:
Feb  5 14:57:25 netmgmt kernel: CPU    0: hi:  186, btch:  31 usd:  61
Feb  5 14:57:25 netmgmt kernel: Node 0 Normal per-cpu:
Feb  5 14:57:25 netmgmt kernel: CPU    0: hi:  186, btch:  31 usd:  74
Feb  5 14:57:25 netmgmt kernel: active_anon:671184 inactive_anon:251894 isolated_anon:0
Feb  5 14:57:25 netmgmt kernel: active_file:74 inactive_file:923 isolated_file:0
Feb  5 14:57:25 netmgmt kernel: unevictable:0 dirty:0 writeback:0 unstable:0
Feb  5 14:57:25 netmgmt kernel: free:21743 slab_reclaimable:2109 slab_unreclaimable:13036
Feb  5 14:57:25 netmgmt kernel: mapped:125 shmem:0 pagetables:4921 bounce:0
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA free:15684kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15292kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  5 14:57:25 netmgmt kernel: lowmem_reserve[]: 0 3000 4010 4010
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA32 free:54324kB min:50372kB low:62964kB high:75556kB active_anon:2236200kB inactive_anon:559016kB active_file:260kB inactive_file:3692kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:492kB shmem:0kB slab_reclaimable:212kB slab_unreclaimable:260kB kernel_stack:0kB pagetables:6660kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:401 all_unreclaimable? yes
Feb  5 14:57:25 netmgmt kernel: lowmem_reserve[]: 0 0 1010 1010
Feb  5 14:57:25 netmgmt kernel: Node 0 Normal free:16964kB min:16956kB low:21192kB high:25432kB active_anon:448536kB inactive_anon:448560kB active_file:36kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:0kB slab_reclaimable:8224kB slab_unreclaimable:51884kB kernel_stack:3744kB pagetables:13024kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:73 all_unreclaimable? yes
Feb  5 14:57:25 netmgmt kernel: lowmem_reserve[]: 0 0 0 0
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA: 1*4kB 4*8kB 2*16kB 2*32kB 3*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15684kB
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA32: 123*4kB 123*8kB 35*16kB 12*32kB 5*64kB 1*128kB 3*256kB 3*512kB 36*1024kB 6*2048kB 0*4096kB = 54324kB
Feb  5 14:57:25 netmgmt kernel: Node 0 Normal: 433*4kB 227*8kB 105*16kB 45*32kB 25*64kB 10*128kB 7*256kB 1*512kB 3*1024kB 1*2048kB 0*4096kB = 16972kB
Feb  5 14:57:25 netmgmt kernel: 1745 total pagecache pages
Feb  5 14:57:25 netmgmt kernel: 733 pages in swap cache
Feb  5 14:57:25 netmgmt kernel: Swap cache stats: add 1035805, delete 1035072, find 1381/1742
Feb  5 14:57:25 netmgmt kernel: Free swap  = 0kB
Feb  5 14:57:25 netmgmt kernel: Total swap = 4128760kB
Feb  5 14:57:25 netmgmt kernel: 1048560 pages RAM
Feb  5 14:57:25 netmgmt kernel: 67324 pages reserved
Feb  5 14:57:25 netmgmt kernel: 267 pages shared
Feb  5 14:57:25 netmgmt kernel: 955440 pages non-shared
Feb  5 14:57:25 netmgmt kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Feb  5 14:57:25 netmgmt kernel: [  498]     0   498     2795        0   0     -17         -1000 udevd
Feb  5 14:57:25 netmgmt kernel: [ 1111]     0  1111     6909       28   0     -17         -1000 auditd
Feb  5 14:57:25 netmgmt kernel: [ 1136]     0  1136    62271       44   0       0             0 rsyslogd
Feb  5 14:57:25 netmgmt kernel: [ 1178]    32  1178     4743       15   0       0             0 rpcbind
Feb  5 14:57:25 netmgmt kernel: [ 1196]    29  1196     5836        1   0       0             0 rpc.statd
Feb  5 14:57:25 netmgmt kernel: [ 1208]     0  1208     1143       10   0       0             0 mdadm
Feb  5 14:57:25 netmgmt kernel: [ 1234]     0  1234     6290        1   0       0             0 rpc.idmapd
Feb  5 14:57:25 netmgmt kernel: [ 1328]    81  1328     7944        1   0       0             0 dbus-daemon
Feb  5 14:57:25 netmgmt kernel: [ 1340]     0  1340    47289        1   0       0             0 cupsd
Feb  5 14:57:25 netmgmt kernel: [ 1365]     0  1365     1019        0   0       0             0 acpid
Feb  5 14:57:25 netmgmt kernel: [ 1374]    68  1374     6323      111   0       0             0 hald
Feb  5 14:57:25 netmgmt kernel: [ 1375]     0  1375     4526        1   0       0             0 hald-runner
Feb  5 14:57:25 netmgmt kernel: [ 1403]     0  1403     5055        1   0       0             0 hald-addon-inpu
Feb  5 14:57:25 netmgmt kernel: [ 1414]    68  1414     4451        1   0       0             0 hald-addon-acpi
Feb  5 14:57:25 netmgmt kernel: [ 1435]     0  1435    96427       31   0       0             0 automount
Feb  5 14:57:25 netmgmt kernel: [ 1451]     0  1451     1564        0   0       0             0 mcelog
Feb  5 14:57:25 netmgmt kernel: [ 1463]     0  1463    16018        0   0     -17         -1000 sshd
Feb  5 14:57:25 netmgmt kernel: [ 1471]    38  1471     7540       30   0       0             0 ntpd
Feb  5 14:57:25 netmgmt kernel: [ 1507]     0  1507    27050        1   0       0             0 mysqld_safe
Feb  5 14:57:25 netmgmt kernel: [ 1596]    27  1596   176877     2710   0       0             0 mysqld
Feb  5 14:57:25 netmgmt kernel: [ 1687]     0  1687    19669       22   0       0             0 master
Feb  5 14:57:25 netmgmt kernel: [ 1697]    89  1697    19732       22   0       0             0 qmgr
Feb  5 14:57:25 netmgmt kernel: [ 1711]     0  1711    27543        1   0       0             0 abrtd
Feb  5 14:57:25 netmgmt kernel: [ 1719]     0  1719    27016        1   0       0             0 abrt-dump-oops
Feb  5 14:57:25 netmgmt kernel: [ 1727]     0  1727    29301       23   0       0             0 crond
Feb  5 14:57:25 netmgmt kernel: [ 1738]     0  1738     5363        5   0       0             0 atd
Feb  5 14:57:25 netmgmt kernel: [ 1751]     0  1751    19274        1   0       0             0 login
Feb  5 14:57:25 netmgmt kernel: [ 1753]     0  1753     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1756]     0  1756     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1757]     0  1757     3091        0   0     -17         -1000 udevd
Feb  5 14:57:25 netmgmt kernel: [ 1759]     0  1759     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1760]     0  1760     3091        0   0     -17         -1000 udevd
Feb  5 14:57:25 netmgmt kernel: [ 1762]     0  1762     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1764]     0  1764     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1771]     0  1771   143729        1   0       0             0 console-kit-dae
Feb  5 14:57:25 netmgmt kernel: [ 1837]     0  1837    27083        1   0       0             0 bash
Feb  5 14:57:25 netmgmt kernel: [ 1857]     0  1857    92804      612   0       0             0 nxagentd
Feb  5 14:57:25 netmgmt kernel: [ 1871]     0  1871  2179534   918835   0       0             0 netxmsd
Feb  5 14:57:25 netmgmt kernel: [ 4627]    89  4627    19689       16   0       0             0 pickup
Feb  5 14:57:25 netmgmt kernel: Out of memory: Kill process 1871 (netxmsd) score 937 or sacrifice child
Feb  5 14:57:25 netmgmt kernel: Killed process 1871, UID 0, (netxmsd) total-vm:8718136kB, anon-rss:3674960kB, file-rss:380kB
==============================================
			
			
			
				Does anyone have any ideas on what may be causing this out of memory condition?
Any help would be greatly appreciated.
-Kevin C.
			
			
			
				Hi!
We are trying to find it. Can you describe what features you are using? Most important is to know about network discovery and scripts. Also, can it be that you have routers with very large routing tables (like full BGP table) monitored by NetXMS?
Best regards,
Victor
			
			
			
				
It seems like a hardware failure.
Centos kernel is not supposed to panic under any circumstance.
There's no "normal" circumstances beyond a hardware failure or kernel bug that should cause one. 
Try to run memtest86 or memtest86+ and performs a BIOS address test in physical machine.
			
			
			
				Quote from: testos on February 07, 2013, 12:51:25 PM
Centos kernel is not supposed to panic under any circumstance.
That's pretty much standard trace issued by oomkiller.
Kevin: try to use attached script to get proper valgrind output. Script idea is that it monitors RSS used by netxmsd, and if it's greater than 1Gb, script try to do graceful shutdown using nxadm – before netxmsd is killed by OOMKiller.
You need to change NXADM variable in the script according to your installation prefix.
Start script, then start netxmsd under valgrind.
			
 
			
			
				Hi,
Thanks for the feedback guys!
Testos,  I do not believe this is hardware related, as this is a virtual server running on an IBM x3550 M4 (ESXi 5.0) along side 12 other productions VM's, and they are having no issues.
Victor, discovery is limited to 50 subnets on our MPLS network, which is pretty much all that we have.  I do specify each subnet, I guess I could eliminate that list and just discover all subnets.  In the beginning, I started out adding 10 subnets at a time, I didn't want to risk overloading our network.
After that, I am filtering the discovery results for a specific IP address range on each subnet (IP .1 thru .100).  I'm really not using much scripting yet, just changing the names of nodes to match SNMP host names, a couple of email alerts, and that's about it.  The routing tables should not be huge on any of the routers that NetXMS discovers.  That being said, our network provider may be doing things on their Cisco routers (which they own) that I am unaware of.
Alex, I will edit and then run the attached script per your recommendation.
-Kevin C.
			
			
			
				Hi,
One thing seems strange, it crashes with out of memory right away running under valgrind.  If I run netxmsd normally, it can run for hours before crashing.
-Kevin C.
			
			
			
				OK, good news Alex.  The script you provided shut down NetXMS gracefully after it reached the 1GB of RAM threshold.  It took 12-15 minutes running under valgrind before it crashed.  I have attached the valgrind log.
Thanks,
-Kevin C.
			
			
			
				Hi!
Quote from: millerpaint on February 07, 2013, 06:33:43 PM
One thing seems strange, it crashes with out of memory right away running under valgrind.  If I run netxmsd normally, it can run for hours before crashing.
This is normal, when running under valgrind program takes tens times more memory then when it run normally. Valgrind allocates extra memory around each dynamically allocated block to detect boundary violations, etc.
Best regards,
Victor
			
				QuoteThis is normal, when running under valgrind program takes tens times more memory then when it run normally. Valgrind allocates extra memory around each dynamically allocated block to detect boundary violations, etc.
OK, that makes sense Victor.
I have attached a screenshot of my Network Discovery panel, so you can see more about the details of my configuration.  I also have (2) SNMP community strings listed, but they are not visible in the screenshot image.
I am using top to monitor the memory consumption of netxmsd - it seems to be consuming 1/10th of 1% of available RAM every few seconds, running in normal mode.
-Kevin C.
			
 
			
			
				Unfortunately, there nothing related to your issue in this log.
Could you please change 1Gb limit to 2Gb in the script and rerun it?
Quote from: millerpaint on February 07, 2013, 06:57:38 PM
OK, good news Alex.  The script you provided shut down NetXMS gracefully after it reached the 1GB of RAM threshold.  It took 12-15 minutes running under valgrind before it crashed.  I have attached the valgrind log.
			
				Hi Alex,
I modified the script for a 2GB threshold and re-ran it, the new log file is attached.  Hopefully this will provide some clues as to what is going on.
When monitoring with top, it seems to start consuming RAM when the timer reaches 14:41:
netxmsd starts out using .5% available RAM of server.  Then:
14:41 - .6%
15:50 - .7%
16:53 - .8%
17:69 - .9%
18:60 - 1%
19:20 - 1.1%
20:05 - 1.2%
etc.
Thanks for your help!
-Kevin C.
			
			
			
				FYI, I have completely disabled auto-discovery, and it is still running out of memory.
-Kevin C.
			
			
			
				Hi!
We are still trying to figure out what could cause such a high memory consumption. Can you please do the following:
1. Run netxmsd under valgrind with additional options:
valgrind --leak-check=full --undef-value-errors=no --show-reachable=yes --log-file=netxmsd-valgrind.log netxmsd -D3
(or modify script sent by Alex by adding --show-reachable=yes to valgrind's command line).
2. Run valgrind's heap profiller:
valgrind --tool=massif --time-unit=ms --stacks=yes --threshold=0.5 --max-snapshots=1000 --log-file=netxmsd-massif.log netxmsd -D3
and send me profiler's result (it will be named massif.out.<pid>).
Also, can you please show me you configuration hook script and transformation scripts?
Best regards,
Victor
			
			
			
				Hi Victor,
OK, I can do step 1 additional options with no problem.
With Step 2, you are asking me to run valgrind's heap profiler.  I have questions on that:
1) Is step 2 option run after I have completed running the step 1 test, and it runs out of memory?
2) Do I need to start Alex's script before running step 2?
-Kevin C.
			
			
			
				Hi!
Yes, you should run step 2 after step 1 is completed. You can use Alex's script for step 2 too.
Best regards,
Victor
			
			
			
				Hi Victor,
I am unable to attach the valgrind log from Step 1, it is about 600k in size, and your forum will not allow me to post it.
Can you please raise the limit of your attachment size on this forum, or else let me know your email address?
Thanks,
-Kevin C.
			
			
			
				You can send it to 
[email protected].
Best regards,
Victor
			
				OK thanks - all log files you have requested from valgrind have been sent to that email address.
-Kevin C.