Server Performance

Started by lweidig, July 11, 2012, 07:29:33 PM

Previous topic - Next topic

lweidig

I know from the many NMS systems that I have implemented / tested that something is up.  Just not sure how to start digging to find the issue and hoping to get some pointers.  We are running on a quad core Xeon / 4GB RAM machine running Ubuntu 12.04, NetXMS 1.2.1 (from .deb files).  Currently have about 200 nodes and 1500 DCI's setup.  The machine is running NOTHING other than NetXMS and services to support the installation.  Here is a screen grab from top:


# top
top - 11:15:55 up 18:36,  1 user,  load average: 14.94, 14.19, 14.36
Tasks:  30 total,   1 running,  29 sleeping,   0 stopped,   0 zombie
Cpu(s): 23.2%us, 48.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si, 28.0%st
Mem:   4194304k total,   436432k used,  3757872k free,        0k buffers
Swap:  2097152k total,        0k used,  2097152k free,   176968k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                             
  284 root      20   0 2282m  24m 3592 S  400  0.6   3679:06 netxmsd                                             
  236 mysql     20   0 1654m 211m 8000 S   24  5.2 617:58.69 mysqld                                               
  291 root      20   0  632m 4884 1460 S   10  0.1 194:04.51 nxagentd                                             
    1 root      20   0 24024 2024 1340 S    0  0.0   0:00.15 init                                                 
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd/105
...


Have considered recompiling from source, but prefer to use packages if available for ease of installation and hopefully optimal settings in build.  Can anybody else confirm they are running the .deb files under the latest Ubuntu LTS? 

I do have a question though about the software itself.  Are containers inside the "Infrastructure Services" node just considered logical groupings?  That is what I have assumed so we do have nodes that can appear under 3+ containers.  We have setup an "all node" container, grouped by device type, grouped by location and a few other ways we like to analyze the nodes in our network.    I am assuming software is polling node once, no matter how many times it appears under various containers.  If that is NOT the case then I probably have too much activity going on. 

Thanks!

Victor Kirhenshtein

Situation is definitely not normal. Can you please open server debug console (from management console or with nxadm -i), and send me output of the following commands:

show sessions
show pollers
show queue
show stats
show flags
show watchdog


You are correct, containers in infrastructure tree are for logical grouping. Polling is not related to node location in container, and node is polled only once no matter how many times it appears in the tree.

Best regards,
Victor

lweidig

The two clients are my Android phone and Windows machine.  The android app is very nice by the way, but is definitely battery intensive. 



netxmsd: show sessions
ID  STATE                    CIPHER   USER [CLIENT]
0   idle                     AES-256  [email protected].249 [nxjclient/1.2.1 (Linux 2.6.32.9-00005-g2440aba; libnxcl 1.2.1)]
1   idle                     AES-128  [email protected].131 [nxjclient/1.2.1 (Windows 7 6.1; libnxcl 1.2.1)]

2 active sessions

netxmsd: show pollers
PT  TIME                   STATE
S   11/Jul/2012 11:55:57   wait
S   11/Jul/2012 11:55:59   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:55:54   wait
S   11/Jul/2012 11:55:57   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:55:54   wait
S   11/Jul/2012 11:55:58   wait
S   11/Jul/2012 11:56:02   wait
S   11/Jul/2012 11:55:57   wait
S   11/Jul/2012 11:55:58   wait
S   11/Jul/2012 11:56:02   wait
S   11/Jul/2012 11:55:53   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:55:54   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:56:02   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:55:59   wait
S   11/Jul/2012 11:55:58   wait
S   11/Jul/2012 11:56:01   wait
S   11/Jul/2012 11:55:58   wait
C   11/Jul/2012 11:54:11   wait
C   11/Jul/2012 11:51:45   wait
C   11/Jul/2012 11:55:59   wait
C   11/Jul/2012 11:55:57   poll: xxx00-au-nn.excel.net [2268] - capability check
C   11/Jul/2012 11:54:04   wait
C   11/Jul/2012 11:56:01   poll: xxx00-au-nn.excel.net [2275] - capability check
C   11/Jul/2012 11:51:51   wait
C   11/Jul/2012 11:52:04   wait
C   11/Jul/2012 11:55:18   wait
C   11/Jul/2012 11:55:18   wait
C   11/Jul/2012 11:51:56   wait
C   11/Jul/2012 11:55:35   wait
C   11/Jul/2012 11:56:03   wait
C   11/Jul/2012 11:55:25   poll: xxx03-au-xx.excel.net [2210] - capability check
C   11/Jul/2012 11:56:03   wait
R   11/Jul/2012 11:51:33   wait
R   11/Jul/2012 11:50:57   wait
R   11/Jul/2012 11:51:41   wait
R   11/Jul/2012 11:52:23   wait
R   11/Jul/2012 11:51:29   wait
R   11/Jul/2012 11:51:29   wait
R   11/Jul/2012 11:51:29   wait
R   11/Jul/2012 11:56:00   wait
R   11/Jul/2012 11:52:04   wait
R   11/Jul/2012 11:55:49   poll: xxx05-rtr-00.excel.net [2043]
D   11/Jul/2012 11:53:08   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
N   10/Jul/2012 16:39:09   wait
T   11/Jul/2012 11:54:14   wait
T   11/Jul/2012 11:55:19   wait
T   11/Jul/2012 11:55:13   wait
T   11/Jul/2012 11:55:19   wait
T   11/Jul/2012 11:55:35   wait
T   11/Jul/2012 11:54:41   wait
T   11/Jul/2012 11:56:03   wait
T   11/Jul/2012 11:56:01   poll: xxx00-au-nn.excel.net [2268]
T   11/Jul/2012 11:55:08   wait
T   11/Jul/2012 11:54:13   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
B   10/Jul/2012 16:41:09   wait
A   11/Jul/2012 05:11:51   wait

netxmsd: show queue
Condition poller                 : 0
Configuration poller             : 0
Topology poller                  : 0
Data collector                   : 0
Database writer                  : 0
Database writer (IData)          : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 1587
Routing table poller             : 0
Status poller                    : 0

netxmsd: show stats
Total number of objects:     1580
Number of monitored nodes:   172
Number of collectable DCIs:  1396

netxmsd: show flags
Flags: 0x4310067D
  AF_DAEMON                        = 1
  AF_USE_SYSLOG                    = 0
  AF_ENABLE_NETWORK_DISCOVERY      = 1
  AF_ACTIVE_NETWORK_DISCOVERY      = 1
  AF_LOG_SQL_ERRORS                = 1
  AF_DELETE_EMPTY_SUBNETS          = 1
  AF_ENABLE_SNMP_TRAPD             = 1
  AF_ENABLE_ZONING                 = 0
  AF_SYNC_NODE_NAMES_WITH_DNS      = 0
  AF_CHECK_TRUSTED_NODES           = 1
  AF_WRITE_FULL_DUMP               = 0
  AF_RESOLVE_NODE_NAMES            = 1
  AF_CATCH_EXCEPTIONS              = 0
  AF_INTERNAL_CA                   = 0
  AF_DB_LOCKED                     = 1
  AF_ENABLE_MULTIPLE_DB_CONN       = 1
  AF_DB_CONNECTION_LOST            = 0
  AF_NO_NETWORK_CONNECTIVITY       = 0
  AF_EVENT_STORM_DETECTED          = 0
  AF_SERVER_INITIALIZED            = 1
  AF_SHUTDOWN                      = 0

netxmsd: show watchdog
Thread                                           Interval Status
----------------------------------------------------------------------------
Item Poller                                      20       Running
Syncer Thread                                    130      Running
Poll Manager                                     60       Running

Victor Kirhenshtein

Could it be that you have router(s) with very large routing tables, like full BGP table?

Best regards,
Victor

lweidig

Yes, I have multiple full bgp tables on the core router.  Also, depending what you consider large some of the internal routers have a lot of routes as well. 

What at this point is the routing table and topology polling being used for?  Since you are not auto creating dependencies, I am not sure other than  potentially discovery.

Potentially we can just turn this off on all devices.  I did turn those two off on the core router and have not seen any change, but maybe it has a lot queued up it needs to process.  Will give it some time.

Victor Kirhenshtein

Turn off routing and discovery polls on those BGP routers (you did that already if I understand correctly). Restart netxmsd after that if possible.

Routing table polls used for discovery and also can be used for routing change monitoring via Net.IP.NextHop internal parameter.

Best regards,
Victor

lweidig

Ok, so on the core router I checked:
   - Disable routing table polling
   - Disable topology polling
   - Disable network discovery polling

Stopped netxmsd and verified that load did indeed drop to very near zero which it did almost right away.  Restarted the server and then waited a few hours in case it had to calm down.  Had no effect, load averages were at the same level.

Downloaded the source code to the machine and compiled using:
   ./configure --prefix=/usr --with-server --with-mysql --with-agent
All went well and then installed/restarted again.  Had no positive / negative effect other than what appears to be a minimal amount more RAM being used - so you apparently have better compiler settings than I do :) 

Open to any further suggestions.  Thanks!

Victor Kirhenshtein

Some time passed already since July 1st, but just in case - could it be this problem: https://www.netxms.org/forum/announcements/high-cpu-usage-after-172012/?

Could you also please attach debugger to netxmsd process and dump stack trace for all threads:


# gdb /usr/bin/netxmsd
(gdb) attach <netxmsd_pid>
(gdb) thread apply all bt
(gdb) detach


Best regards,
Victor

lweidig

Yeah, I had already run in to the July 1st post stuff and have insured this is not the issue we are running into.

I have attached the gdb output, thanks!

lweidig

Ok, this has been resolved.  While I had checked for the leap second bug on the machine, I had not checked the host it was virtualized on top of.  This was being affected by the issue and after fixing the host the NetXMS server machine returned to normal, in fact crazy LOW CPU usage (0.06) even though it is running perfectly from what I can tell.  Odd, because other machines on this host did not seem to be having this issue.   

Sorry about this and do appreciate all of the assistance!