I know from the many NMS systems that I have implemented / tested that something is up. Just not sure how to start digging to find the issue and hoping to get some pointers. We are running on a quad core Xeon / 4GB RAM machine running Ubuntu 12.04, NetXMS 1.2.1 (from .deb files). Currently have about 200 nodes and 1500 DCI's setup. The machine is running NOTHING other than NetXMS and services to support the installation. Here is a screen grab from top:
# top
top - 11:15:55 up 18:36, 1 user, load average: 14.94, 14.19, 14.36
Tasks: 30 total, 1 running, 29 sleeping, 0 stopped, 0 zombie
Cpu(s): 23.2%us, 48.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 28.0%st
Mem: 4194304k total, 436432k used, 3757872k free, 0k buffers
Swap: 2097152k total, 0k used, 2097152k free, 176968k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
284 root 20 0 2282m 24m 3592 S 400 0.6 3679:06 netxmsd
236 mysql 20 0 1654m 211m 8000 S 24 5.2 617:58.69 mysqld
291 root 20 0 632m 4884 1460 S 10 0.1 194:04.51 nxagentd
1 root 20 0 24024 2024 1340 S 0 0.0 0:00.15 init
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd/105
...
Have considered recompiling from source, but prefer to use packages if available for ease of installation and hopefully optimal settings in build. Can anybody else confirm they are running the .deb files under the latest Ubuntu LTS?
I do have a question though about the software itself. Are containers inside the "Infrastructure Services" node just considered logical groupings? That is what I have assumed so we do have nodes that can appear under 3+ containers. We have setup an "all node" container, grouped by device type, grouped by location and a few other ways we like to analyze the nodes in our network. I am assuming software is polling node once, no matter how many times it appears under various containers. If that is NOT the case then I probably have too much activity going on.
Thanks!
Situation is definitely not normal. Can you please open server debug console (from management console or with nxadm -i), and send me output of the following commands:
show sessions
show pollers
show queue
show stats
show flags
show watchdog
You are correct, containers in infrastructure tree are for logical grouping. Polling is not related to node location in container, and node is polled only once no matter how many times it appears in the tree.
Best regards,
Victor
The two clients are my Android phone and Windows machine. The android app is very nice by the way, but is definitely battery intensive.
netxmsd: show sessions
ID STATE CIPHER USER [CLIENT]
0 idle AES-256 [email protected] [nxjclient/1.2.1 (Linux 2.6.32.9-00005-g2440aba; libnxcl 1.2.1)]
1 idle AES-128 [email protected] [nxjclient/1.2.1 (Windows 7 6.1; libnxcl 1.2.1)]
2 active sessions
netxmsd: show pollers
PT TIME STATE
S 11/Jul/2012 11:55:57 wait
S 11/Jul/2012 11:55:59 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:55:54 wait
S 11/Jul/2012 11:55:57 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:55:54 wait
S 11/Jul/2012 11:55:58 wait
S 11/Jul/2012 11:56:02 wait
S 11/Jul/2012 11:55:57 wait
S 11/Jul/2012 11:55:58 wait
S 11/Jul/2012 11:56:02 wait
S 11/Jul/2012 11:55:53 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:55:54 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:56:02 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:55:59 wait
S 11/Jul/2012 11:55:58 wait
S 11/Jul/2012 11:56:01 wait
S 11/Jul/2012 11:55:58 wait
C 11/Jul/2012 11:54:11 wait
C 11/Jul/2012 11:51:45 wait
C 11/Jul/2012 11:55:59 wait
C 11/Jul/2012 11:55:57 poll: xxx00-au-nn.excel.net [2268] - capability check
C 11/Jul/2012 11:54:04 wait
C 11/Jul/2012 11:56:01 poll: xxx00-au-nn.excel.net [2275] - capability check
C 11/Jul/2012 11:51:51 wait
C 11/Jul/2012 11:52:04 wait
C 11/Jul/2012 11:55:18 wait
C 11/Jul/2012 11:55:18 wait
C 11/Jul/2012 11:51:56 wait
C 11/Jul/2012 11:55:35 wait
C 11/Jul/2012 11:56:03 wait
C 11/Jul/2012 11:55:25 poll: xxx03-au-xx.excel.net [2210] - capability check
C 11/Jul/2012 11:56:03 wait
R 11/Jul/2012 11:51:33 wait
R 11/Jul/2012 11:50:57 wait
R 11/Jul/2012 11:51:41 wait
R 11/Jul/2012 11:52:23 wait
R 11/Jul/2012 11:51:29 wait
R 11/Jul/2012 11:51:29 wait
R 11/Jul/2012 11:51:29 wait
R 11/Jul/2012 11:56:00 wait
R 11/Jul/2012 11:52:04 wait
R 11/Jul/2012 11:55:49 poll: xxx05-rtr-00.excel.net [2043]
D 11/Jul/2012 11:53:08 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
N 10/Jul/2012 16:39:09 wait
T 11/Jul/2012 11:54:14 wait
T 11/Jul/2012 11:55:19 wait
T 11/Jul/2012 11:55:13 wait
T 11/Jul/2012 11:55:19 wait
T 11/Jul/2012 11:55:35 wait
T 11/Jul/2012 11:54:41 wait
T 11/Jul/2012 11:56:03 wait
T 11/Jul/2012 11:56:01 poll: xxx00-au-nn.excel.net [2268]
T 11/Jul/2012 11:55:08 wait
T 11/Jul/2012 11:54:13 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
B 10/Jul/2012 16:41:09 wait
A 11/Jul/2012 05:11:51 wait
netxmsd: show queue
Condition poller : 0
Configuration poller : 0
Topology poller : 0
Data collector : 0
Database writer : 0
Database writer (IData) : 0
Event processor : 0
Network discovery poller : 0
Node poller : 1587
Routing table poller : 0
Status poller : 0
netxmsd: show stats
Total number of objects: 1580
Number of monitored nodes: 172
Number of collectable DCIs: 1396
netxmsd: show flags
Flags: 0x4310067D
AF_DAEMON = 1
AF_USE_SYSLOG = 0
AF_ENABLE_NETWORK_DISCOVERY = 1
AF_ACTIVE_NETWORK_DISCOVERY = 1
AF_LOG_SQL_ERRORS = 1
AF_DELETE_EMPTY_SUBNETS = 1
AF_ENABLE_SNMP_TRAPD = 1
AF_ENABLE_ZONING = 0
AF_SYNC_NODE_NAMES_WITH_DNS = 0
AF_CHECK_TRUSTED_NODES = 1
AF_WRITE_FULL_DUMP = 0
AF_RESOLVE_NODE_NAMES = 1
AF_CATCH_EXCEPTIONS = 0
AF_INTERNAL_CA = 0
AF_DB_LOCKED = 1
AF_ENABLE_MULTIPLE_DB_CONN = 1
AF_DB_CONNECTION_LOST = 0
AF_NO_NETWORK_CONNECTIVITY = 0
AF_EVENT_STORM_DETECTED = 0
AF_SERVER_INITIALIZED = 1
AF_SHUTDOWN = 0
netxmsd: show watchdog
Thread Interval Status
----------------------------------------------------------------------------
Item Poller 20 Running
Syncer Thread 130 Running
Poll Manager 60 Running
Could it be that you have router(s) with very large routing tables, like full BGP table?
Best regards,
Victor
Yes, I have multiple full bgp tables on the core router. Also, depending what you consider large some of the internal routers have a lot of routes as well.
What at this point is the routing table and topology polling being used for? Since you are not auto creating dependencies, I am not sure other than potentially discovery.
Potentially we can just turn this off on all devices. I did turn those two off on the core router and have not seen any change, but maybe it has a lot queued up it needs to process. Will give it some time.
Turn off routing and discovery polls on those BGP routers (you did that already if I understand correctly). Restart netxmsd after that if possible.
Routing table polls used for discovery and also can be used for routing change monitoring via Net.IP.NextHop internal parameter.
Best regards,
Victor
Ok, so on the core router I checked:
- Disable routing table polling
- Disable topology polling
- Disable network discovery polling
Stopped netxmsd and verified that load did indeed drop to very near zero which it did almost right away. Restarted the server and then waited a few hours in case it had to calm down. Had no effect, load averages were at the same level.
Downloaded the source code to the machine and compiled using:
./configure --prefix=/usr --with-server --with-mysql --with-agent
All went well and then installed/restarted again. Had no positive / negative effect other than what appears to be a minimal amount more RAM being used - so you apparently have better compiler settings than I do :)
Open to any further suggestions. Thanks!
Some time passed already since July 1st, but just in case - could it be this problem: https://www.netxms.org/forum/announcements/high-cpu-usage-after-172012/ (https://www.netxms.org/forum/announcements/high-cpu-usage-after-172012/)?
Could you also please attach debugger to netxmsd process and dump stack trace for all threads:
# gdb /usr/bin/netxmsd
(gdb) attach <netxmsd_pid>
(gdb) thread apply all bt
(gdb) detach
Best regards,
Victor
Yeah, I had already run in to the July 1st post stuff and have insured this is not the issue we are running into.
I have attached the gdb output, thanks!
Ok, this has been resolved. While I had checked for the leap second bug on the machine, I had not checked the host it was virtualized on top of. This was being affected by the issue and after fixing the host the NetXMS server machine returned to normal, in fact crazy LOW CPU usage (0.06) even though it is running perfectly from what I can tell. Odd, because other machines on this host did not seem to be having this issue.
Sorry about this and do appreciate all of the assistance!