Performance issues with netxms

Started by cyril, August 03, 2017, 06:40:12 PM

Previous topic - Next topic

cyril

Hi! We have been experiencing performance issues with netxms lately. There were no particular point at which netxms started to lag but for now we have these problems:

  • 'Force DCI poll' button does not work
  • Large delays in DCI polling. e.g. 10 minutes between polls where 60 second poll is configured
info:

nxadm -c 'sh stat'
Total number of objects:     3924
Number of monitored nodes:   394
Number of collectable DCIs:  12063

nxadm -c 'sh q'
Data collector                   : 6921 (floats around 3-7K)
DCI cache loader                 : 0
Database writer                  : 0
Database writer (IData)          : 0
Database writer (raw DCI values) : 0
Event processor                  : 0
Node poller                      : 0
Syslog processing                : 0
Syslog writer                    : 0

Number of DCI collectors: 200 (increasing this number from 25 did not give much effect)
Average time to queue DCI for polling for last minute: 4
load average: 0.97, 0.93, 0.92
2 CPUs
free -h
             total       used       free     shared    buffers     cached
Mem:          2.0G       1.9G       106M       540M       138M       716M
-/+ buffers/cache:       1.0G       961M
Swap:         2.0G       330M       1.6G

iostat 60 5
Linux 3.16.0-4-amd64 (netxms) 08/03/2017 _x86_64_ (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.28    0.01    5.13    2.92    0.00   73.67

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              66.76       140.82       562.47  685000987 2735994668

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.50    0.00    3.95    2.40    0.00   79.14

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              45.10       141.40       346.67       8484      20800

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          33.37    0.00   10.06    2.58    0.00   54.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             150.00        22.33      1707.20       1340     102432

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.64    0.00    2.47    0.29    0.00   88.60

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              26.60         1.33       289.00         80      17340

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          27.19    0.00    7.45    1.59    0.00   63.77

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              58.25         9.47       583.13        568      34988


Tursiops

Hi,

We have around 5x the number of DCIs you have and are running on 4000 data collectors (of course this will all depend on how often you actually poll and your poller timeouts). I found that in general when I see the issue you are describing, I simply need to keep upping the number of collectors until things work as expected, and at some point possibly the database writers to ensure they can keep up.

Cheers

cyril

Thanks, increasing of data collectors mostly solved the issue. But there is one left. We have netxms agent proxy which handles many custom ExternalParameter actions and it lags too. Is there any way to tune it?

Tursiops

Some ExternalParameters we call can run for quite a while (i.e. in those cases lag is expected), so we increased the ExecTimeout on those agents. Otherwise the commands ran into timeouts, the agent terminated them and we never got the results. Not sure if that is what you are experiencing? If you have really long running scripts/commands, you may also want to look at ExternalParametersProvider (we haven't used this ourselves yet, so can't give any advise on actual configuration/usage).

If your proxy is connecting to a lot of other agents/systems and polling a lot of DCIs, you may want to increase the number of MaxSessions on that proxy. I think the default is 32. We have a couple of sites where we had to increase this.