Out of memory NetXMS v1.2.5

Started by millerpaint, February 02, 2013, 02:08:37 AM

Previous topic - Next topic

millerpaint

Greetings,

NetXMS has been crashing with an out of memory condition on our CentOS server:

CentOS 6.3
Out of memory: Kill process 1852 (netxmsd) score 934 or sacrifice child
Killed process 1852, UID 0, (netxmsd) total-vm:8698304kB, anon-rss:3652072kB, file-rss:424kB

This virtual CentOS server has 4GB of RAM currently allocated - and we have 50 subnets, maybe 20 nodes on average per subnet.  NetXMS and MySQL db are the only things running on it.  4GB should be enough RAM I would think?


-Kevin C.


Victor Kirhenshtein

Hi!

Looks like memory leak in NetXMS server. Is it possible to you to run it under valgrind and send me the log? Correct command will be

valgrind --leak-check=full --undef-value-errors=no --log-file=netxmsd-valgrind.log netxmsd -D3

This will run netxmsd under valgrind in foreground. Let it run for few hours, then shutdown by entering "down" in server's console, and send me netxmsd-valgrind.log.

Best regards,
Victor

millerpaint

OK, will do, and then I will post the log when finished.


-Kevin C.

millerpaint

When I run this under valgrind, it seems to hang on "Loading nodes....." on the Linux console, and I am unable to launch the NetXMS GUI (connection refused).


Is this expected behavior at this point?

-Kevin C.

Victor Kirhenshtein

Normally it should not be like this. Sometimes valgrind slows down application significantly - if you have lot of nodes, initialization could take time. Try to wait for 5-10 minutes.

Best regards,
Victor

millerpaint

OK, it did move beyond that point after some time.  I have attached the log as requested.


-Kevin C.

Victor Kirhenshtein

Great. Now let it run for some time (watch that memory used by netxmsd grows, so memory leak is catch), then stop the server and send me the log again. Valgrind will record lost memory blocks only after process termination.

Best regards,
Victor

millerpaint

OK, the process terminated, and I have attached the latest log file.


-Kevin C.

Victor Kirhenshtein

Do you shutdown it, or it just crashed because of no memory? Because log is the same. Don't wait for the crash - you have to run it for some time and shutdown correctly - otherwise valgrind will not be able to analyze address space of the process and find memory leaks.

Best regards,
Victor

millerpaint

Yes, it crashed.  I have started it once again, and will shut it down manually before it runs out of memory.


-Kevin C.

millerpaint

It crashed again, within 25 minutes and before I had a chance to stop it   :(

I will start it again, and stop it after 15 minutes.


-Kevin C.

millerpaint

OK, I stopped it after running for ~ 7 minutes - it was crashing in 10 minutes.  The log file is attached


-Kevin C.

testos

What version of MySQL are you using?
Check if there is something strange in similar files:
/var/log/messages
/var/log/nxagentd
/var/log/netxmsd.log
/var/log/mysqld.log
etc.

millerpaint

The version of MySQL I am running is 5.1.61.

I'm not sure exactly what to look for in the log files.  There is some detailed information in messages:

=====================================================
Feb  5 10:02:58 netmgmt abrtd: Init complete, entering main loop
Feb  5 14:57:25 netmgmt kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Feb  5 14:57:25 netmgmt kernel: mysqld cpuset=/ mems_allowed=0
Feb  5 14:57:25 netmgmt kernel: Pid: 1604, comm: mysqld Not tainted 2.6.32-279.9.1.el6.x86_64 #1
Feb  5 14:57:25 netmgmt kernel: Call Trace:
Feb  5 14:57:25 netmgmt kernel: [<ffffffff810c4c71>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811173e0>] ? dump_header+0x90/0x1b0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81214a0c>] ? security_real_capable_noaudit+0x3c/0x70
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81117862>] ? oom_kill_process+0x82/0x2a0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811177a1>] ? select_bad_process+0xe1/0x120
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81117ca0>] ? out_of_memory+0x220/0x3c0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811279be>] ? __alloc_pages_nodemask+0x89e/0x940
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8115c51a>] ? alloc_pages_current+0xaa/0x110
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811147e7>] ? __page_cache_alloc+0x87/0x90
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8112a40b>] ? __do_page_cache_readahead+0xdb/0x210
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8112a561>] ? ra_submit+0x21/0x30
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81115b13>] ? filemap_fault+0x4c3/0x500
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81136b6f>] ? __inc_zone_state+0x1f/0x70
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8113ef14>] ? __do_fault+0x54/0x510
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8113f4c7>] ? handle_pte_fault+0xf7/0xb50
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
Feb  5 14:57:25 netmgmt kernel: [<ffffffff811913fc>] ? core_sys_select+0x1ec/0x2c0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81140104>] ? handle_mm_fault+0x1e4/0x2b0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff810444c9>] ? __do_page_fault+0x139/0x480
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81278bec>] ? rb_erase+0x1bc/0x310
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81012bd9>] ? read_tsc+0x9/0x20
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8109cea9>] ? ktime_get_ts+0xa9/0xe0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8118fe58>] ? poll_select_copy_remaining+0xf8/0x150
Feb  5 14:57:25 netmgmt kernel: [<ffffffff810d6ad3>] ? audit_syscall_entry+0x63/0x2a0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff8150380e>] ? do_page_fault+0x3e/0xa0
Feb  5 14:57:25 netmgmt kernel: [<ffffffff81500bc5>] ? page_fault+0x25/0x30
Feb  5 14:57:25 netmgmt kernel: Mem-Info:
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA per-cpu:
Feb  5 14:57:25 netmgmt kernel: CPU    0: hi:    0, btch:   1 usd:   0
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA32 per-cpu:
Feb  5 14:57:25 netmgmt kernel: CPU    0: hi:  186, btch:  31 usd:  61
Feb  5 14:57:25 netmgmt kernel: Node 0 Normal per-cpu:
Feb  5 14:57:25 netmgmt kernel: CPU    0: hi:  186, btch:  31 usd:  74
Feb  5 14:57:25 netmgmt kernel: active_anon:671184 inactive_anon:251894 isolated_anon:0
Feb  5 14:57:25 netmgmt kernel: active_file:74 inactive_file:923 isolated_file:0
Feb  5 14:57:25 netmgmt kernel: unevictable:0 dirty:0 writeback:0 unstable:0
Feb  5 14:57:25 netmgmt kernel: free:21743 slab_reclaimable:2109 slab_unreclaimable:13036
Feb  5 14:57:25 netmgmt kernel: mapped:125 shmem:0 pagetables:4921 bounce:0
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA free:15684kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15292kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  5 14:57:25 netmgmt kernel: lowmem_reserve[]: 0 3000 4010 4010
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA32 free:54324kB min:50372kB low:62964kB high:75556kB active_anon:2236200kB inactive_anon:559016kB active_file:260kB inactive_file:3692kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:492kB shmem:0kB slab_reclaimable:212kB slab_unreclaimable:260kB kernel_stack:0kB pagetables:6660kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:401 all_unreclaimable? yes
Feb  5 14:57:25 netmgmt kernel: lowmem_reserve[]: 0 0 1010 1010
Feb  5 14:57:25 netmgmt kernel: Node 0 Normal free:16964kB min:16956kB low:21192kB high:25432kB active_anon:448536kB inactive_anon:448560kB active_file:36kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:0kB slab_reclaimable:8224kB slab_unreclaimable:51884kB kernel_stack:3744kB pagetables:13024kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:73 all_unreclaimable? yes
Feb  5 14:57:25 netmgmt kernel: lowmem_reserve[]: 0 0 0 0
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA: 1*4kB 4*8kB 2*16kB 2*32kB 3*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15684kB
Feb  5 14:57:25 netmgmt kernel: Node 0 DMA32: 123*4kB 123*8kB 35*16kB 12*32kB 5*64kB 1*128kB 3*256kB 3*512kB 36*1024kB 6*2048kB 0*4096kB = 54324kB
Feb  5 14:57:25 netmgmt kernel: Node 0 Normal: 433*4kB 227*8kB 105*16kB 45*32kB 25*64kB 10*128kB 7*256kB 1*512kB 3*1024kB 1*2048kB 0*4096kB = 16972kB
Feb  5 14:57:25 netmgmt kernel: 1745 total pagecache pages
Feb  5 14:57:25 netmgmt kernel: 733 pages in swap cache
Feb  5 14:57:25 netmgmt kernel: Swap cache stats: add 1035805, delete 1035072, find 1381/1742
Feb  5 14:57:25 netmgmt kernel: Free swap  = 0kB
Feb  5 14:57:25 netmgmt kernel: Total swap = 4128760kB
Feb  5 14:57:25 netmgmt kernel: 1048560 pages RAM
Feb  5 14:57:25 netmgmt kernel: 67324 pages reserved
Feb  5 14:57:25 netmgmt kernel: 267 pages shared
Feb  5 14:57:25 netmgmt kernel: 955440 pages non-shared
Feb  5 14:57:25 netmgmt kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Feb  5 14:57:25 netmgmt kernel: [  498]     0   498     2795        0   0     -17         -1000 udevd
Feb  5 14:57:25 netmgmt kernel: [ 1111]     0  1111     6909       28   0     -17         -1000 auditd
Feb  5 14:57:25 netmgmt kernel: [ 1136]     0  1136    62271       44   0       0             0 rsyslogd
Feb  5 14:57:25 netmgmt kernel: [ 1178]    32  1178     4743       15   0       0             0 rpcbind
Feb  5 14:57:25 netmgmt kernel: [ 1196]    29  1196     5836        1   0       0             0 rpc.statd
Feb  5 14:57:25 netmgmt kernel: [ 1208]     0  1208     1143       10   0       0             0 mdadm
Feb  5 14:57:25 netmgmt kernel: [ 1234]     0  1234     6290        1   0       0             0 rpc.idmapd
Feb  5 14:57:25 netmgmt kernel: [ 1328]    81  1328     7944        1   0       0             0 dbus-daemon
Feb  5 14:57:25 netmgmt kernel: [ 1340]     0  1340    47289        1   0       0             0 cupsd
Feb  5 14:57:25 netmgmt kernel: [ 1365]     0  1365     1019        0   0       0             0 acpid
Feb  5 14:57:25 netmgmt kernel: [ 1374]    68  1374     6323      111   0       0             0 hald
Feb  5 14:57:25 netmgmt kernel: [ 1375]     0  1375     4526        1   0       0             0 hald-runner
Feb  5 14:57:25 netmgmt kernel: [ 1403]     0  1403     5055        1   0       0             0 hald-addon-inpu
Feb  5 14:57:25 netmgmt kernel: [ 1414]    68  1414     4451        1   0       0             0 hald-addon-acpi
Feb  5 14:57:25 netmgmt kernel: [ 1435]     0  1435    96427       31   0       0             0 automount
Feb  5 14:57:25 netmgmt kernel: [ 1451]     0  1451     1564        0   0       0             0 mcelog
Feb  5 14:57:25 netmgmt kernel: [ 1463]     0  1463    16018        0   0     -17         -1000 sshd
Feb  5 14:57:25 netmgmt kernel: [ 1471]    38  1471     7540       30   0       0             0 ntpd
Feb  5 14:57:25 netmgmt kernel: [ 1507]     0  1507    27050        1   0       0             0 mysqld_safe
Feb  5 14:57:25 netmgmt kernel: [ 1596]    27  1596   176877     2710   0       0             0 mysqld
Feb  5 14:57:25 netmgmt kernel: [ 1687]     0  1687    19669       22   0       0             0 master
Feb  5 14:57:25 netmgmt kernel: [ 1697]    89  1697    19732       22   0       0             0 qmgr
Feb  5 14:57:25 netmgmt kernel: [ 1711]     0  1711    27543        1   0       0             0 abrtd
Feb  5 14:57:25 netmgmt kernel: [ 1719]     0  1719    27016        1   0       0             0 abrt-dump-oops
Feb  5 14:57:25 netmgmt kernel: [ 1727]     0  1727    29301       23   0       0             0 crond
Feb  5 14:57:25 netmgmt kernel: [ 1738]     0  1738     5363        5   0       0             0 atd
Feb  5 14:57:25 netmgmt kernel: [ 1751]     0  1751    19274        1   0       0             0 login
Feb  5 14:57:25 netmgmt kernel: [ 1753]     0  1753     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1756]     0  1756     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1757]     0  1757     3091        0   0     -17         -1000 udevd
Feb  5 14:57:25 netmgmt kernel: [ 1759]     0  1759     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1760]     0  1760     3091        0   0     -17         -1000 udevd
Feb  5 14:57:25 netmgmt kernel: [ 1762]     0  1762     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1764]     0  1764     1015        1   0       0             0 mingetty
Feb  5 14:57:25 netmgmt kernel: [ 1771]     0  1771   143729        1   0       0             0 console-kit-dae
Feb  5 14:57:25 netmgmt kernel: [ 1837]     0  1837    27083        1   0       0             0 bash
Feb  5 14:57:25 netmgmt kernel: [ 1857]     0  1857    92804      612   0       0             0 nxagentd
Feb  5 14:57:25 netmgmt kernel: [ 1871]     0  1871  2179534   918835   0       0             0 netxmsd
Feb  5 14:57:25 netmgmt kernel: [ 4627]    89  4627    19689       16   0       0             0 pickup
Feb  5 14:57:25 netmgmt kernel: Out of memory: Kill process 1871 (netxmsd) score 937 or sacrifice child
Feb  5 14:57:25 netmgmt kernel: Killed process 1871, UID 0, (netxmsd) total-vm:8718136kB, anon-rss:3674960kB, file-rss:380kB

==============================================

millerpaint

Does anyone have any ideas on what may be causing this out of memory condition?

Any help would be greatly appreciated.


-Kevin C.