News:

We really need your input in this questionnaire

Main Menu

Large-scale NetXMS deployments?

Started by jermudgeon, March 17, 2020, 08:10:13 PM

Previous topic - Next topic

jermudgeon

Given the lack of a comprehensive NetXMS scaling guide, I'm interested in talking with people who have NetXMS deployed in

1) Deployments with horizontal load balancing of polling, discovery, etc.
2) Deployments with large devices — hundreds of interfaces, thousands of MAC addresses
3) Deployments with at least 5000 nodes, scaling up to 20k-30k nodes

In particular, I'm looking for solid scaling values for:

A) netxmsd RAM usage per node
B) average nodes / poller (thread pool POLLERS) (in the context of avg DCI per node)

I'm having issues around the 100k DCI mark, but it doesn't apear to be DCIs themselves that are the bottleneck.

TIA
Jeremy

Filipp Sudanov

#1
Below are some thoughts and information about 3 real NetXMS installations that could give insight to some of your questions. We don't have information on number of pollers, but I think that "medium" and "larger" systems are using proxy functionality to poll a lot of data.

----

It's really hard to get an exact estimate of resources needed, since they depend on what exactly your NetXMS installation does. E.g. running a lot NXSL scripts would require more CPUs, etc.
A typical recommendation to start with would be a system with 4 CPU Cores and 8GB of RAM. SSD storage for the database would be preferred.
The amount of space for the database depends on how often you collect the DCIs and for how long you store them (that is polling interval and retention time).
On average a rough guess would be that every datapoint requires about 50-100 bytes of disk space, it depends of the data type and your database vacuuming settings.

Here are some data (a bit rounded) for three real NetXMS installations.
The number of nodes does not play much role, way more important is the number of collected DCIs, as collecting and saving them actually takes resources.
Only regularly collected DCIs are present in the data (no scheduled DCIs, no syslog, traps, etc), but they usually make up most of the data.
The "Medium" system has the database on a separate machine, while the other two have it on the same machine as NetXMS server.

All systems are using SSD storage and Postgres database without Timescale (but admins are considering to move to Timescale).

Smaller system
Objects:                          20500
Monitored nodes:                  2200
Collectible DCIs:                 34000
Collected datapoints per second:  370
Stored datapoints, millions:      9000
RAM:                              24
CPU cores:                        8
DB size:                          200


  Medium system
Objects:                          56000
Monitored nodes:                  2700
Collectible DCIs:                 186000
Collected datapoints per second:  980
Stored datapoints, millions:      15000
NetXMS server RAM:                6
NetXMS server CPU cores:          4
DB server RAM:                    68
DB server CPU cores:              6
DB size:                          1500


  Larger system
Objects                           41500
Monitored nodes:                  5400
Collectible DCIs:                 372000
Collected datapoints per second:  2290
Stored datapoints, millions:      44000
RAM:                              160
CPU cores:                        20
DB size:                          1500

jermudgeon

Thanks Filipp — that's very helpful. Do you have object counts for the three deployments? I'm having more issues with high object counts than with high DCI counts.

I am currently testing in the ~100k DCI range, and that's gone well for a number of months — working with TimeScaleDB.

However, I began adding nodes (without adding significant DCIs) with many interfaces, and as my object count rose over 1 million (1,000,000) I began seeing escalating CPU usage that didn't seem to scale linearly, even with topology and route table scanning turned off.

In addition, I'm having fairly pesky performance deleting objects, but not yet clear enough behavior to engage support.

Filipp Sudanov

Added object count to above reply. All three installations are well below 1 million.

jermudgeon

Thanks again, Filipp. That's quite interesting; I'm seeing an average of 10x more objects per node than in your examples. This could be due to device types, of course. I will see if I can determine relative counts for different types of objects.