Hi,
We are investigating an issue where objects disappeared after a NetXMS crash and we are seeing unusual behaviour with the database connection pool and parts of the UI. We would appreciate guidance on possible causes.
Environment
NetXMS: v5.2.3
Database: MariaDB 10.11
NetXMS server and DB run on separate VMs with separate disks
NetXMS server specs:
16 CPU cores
64 GB RAM
System size:
Objects: 23022
Nodes: 1129
Interfaces: 21081
Access Points: 0
Sensors: 0
Collectible DCIs: 14890
MaxTransactionSize = 1000
Incident
The NetXMS server crashed after the root partition on the NetXMS server VM became full. The database is hosted on a different VM with its own storage and did not run out of space.
When bringing NetXMS back online:
Some objects that had been created several days earlier had disappeared.
Other newer objects and data were still present (inc alarms for dci that had disappered), suggesting that only some metadata was lost or never committed.
Shutdown behaviour
When attempting to restart the NetXMS service:
The server did not shut down gracefully.
It had to be killed manually.
Database connection pool observation
During investigation we ran show dbcp and noticed that for extended periods (10–30 seconds) it reports:
0 database connections in use
This occurs even though the system is actively monitoring ~23k objects and ~14.8k collectible DCIs.
show dbstat
SQL query counters:
Total .......... 1100783
SELECT ......... 306641
Non-SELECT ..... 794106
Long running ... 0
Failed ......... 0
Background writer requests:
DCI data ....... 246179
DCI raw data ... 246179
Others ......... 234
Additional anomalies
We are also seeing unusual behaviour in both the web client and the desktop client:
Some log views fail to load or take an extremely long time to load pages or do not load at all.
Pagination for logs sometimes does not work or stalls.
These behaviours appear in both the web UI and the native NetXMS client.
This makes us suspect there may be an issue with database access, internal queues, or blocked threads.
Questions
Could a full root partition on the NetXMS server prevent metadata commits to the database even if the DB is on a separate system?
Is it possible that object metadata was only held in server memory and never committed, resulting in the objects disappearing after the crash?
What could cause the DB connection pool to show 0 connections in use for long periods on a system of this size?
Could this indicate blocked writer threads, transaction batching issues, or internal locks?
Could the log loading and pagination issues in both clients be related to the same underlying database or thread issue?
Any guidance on what diagnostics we should run (thread state, DB writer queues, etc.) would be greatly appreciated.
Thanks
Darren
Additional info that I missed from the main post
show thread
MAIN
Threads.............. 8 (8/512)
Load average......... 0.00 0.01 0.00
Current load......... 0%
Usage................ 1%
Active requests...... 0
Scheduled requests... 1
Total requests....... 104912
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
AGENT
Threads.............. 32 (32/256)
Load average......... 0.00 0.00 0.00
Current load......... 0%
Usage................ 12%
Active requests...... 0
Scheduled requests... 0
Total requests....... 0
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
POLLERS
Threads.............. 80 (10/500)
Load average......... 15.51 28.01 35.31
Current load......... 2%
Usage................ 16%
Active requests...... 2
Scheduled requests... 0
Total requests....... 367629
Thread starts........ 427
Thread stops......... 357
Wait time EMA........ 274 ms
Wait time SMA........ 152 ms
Wait time SD......... 272 ms
Queue size EMA....... 7
Queue size SMA....... 0
Queue size SD........ 0
FILE-TRANSFER
Threads.............. 2 (2/16)
Load average......... 0.00 0.00 0.00
Current load......... 0%
Usage................ 12%
Active requests...... 0
Scheduled requests... 0
Total requests....... 0
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
NPE
Threads.............. 1 (1/1024)
Load average......... 0.00 0.00 0.00
Current load......... 0%
Usage................ 0%
Active requests...... 0
Scheduled requests... 0
Total requests....... 0
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
DATACOLL
Threads.............. 17 (10/300)
Load average......... 0.00 0.07 0.42
Current load......... 0%
Usage................ 5%
Active requests...... 0
Scheduled requests... 0
Total requests....... 275896
Thread starts........ 17
Thread stops......... 10
Wait time EMA........ 80 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
CLIENT
Threads.............. 16 (16/2048)
Load average......... 0.00 0.00 0.00
Current load......... 12%
Usage................ 0%
Active requests...... 2
Scheduled requests... 0
Total requests....... 84567
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
DISCOVERY
Threads.............. 8 (8/64)
Load average......... 0.00 0.00 0.00
Current load......... 0%
Usage................ 12%
Active requests...... 0
Scheduled requests... 0
Total requests....... 0
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
SCHEDULER
Threads.............. 1 (1/64)
Load average......... 0.00 0.00 0.00
Current load......... 0%
Usage................ 1%
Active requests...... 0
Scheduled requests... 1
Total requests....... 40
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
MOBILE
Threads.............. 4 (4/256)
Load average......... 0.00 0.00 0.00
Current load......... 0%
Usage................ 1%
Active requests...... 0
Scheduled requests... 0
Total requests....... 0
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
PACKAGE-MANAGER
Threads.............. 2 (2/25)
Load average......... 0.00 0.00 0.00
Current load......... 0%
Usage................ 8%
Active requests...... 0
Scheduled requests... 0
Total requests....... 0
Thread starts........ 0
Thread stops......... 0
Wait time EMA........ 0 ms
Wait time SMA........ 0 ms
Wait time SD......... 0 ms
Queue size EMA....... 0
Queue size SMA....... 0
Queue size SD........ 0
show queue
Data collector : 105
DCI cache loader : 0
Template updater : 0
Database writer : 0
Database writer (IData) : 2
Database writer (raw DCI values) : 6119
Event processor : 0
Event log writer : 0
Poller : 0
Node discovery poller : 0
SNMP trap processor : 0
SNMP trap writer : 0
Syslog processor : 0
Syslog writer : 0
Scheduler : 0
Windows event processor : 0
Windows event writer : 0