Hi,
We are investigating an issue where objects disappeared after a NetXMS crash and we are seeing unusual behaviour with the database connection pool and parts of the UI. We would appreciate guidance on possible causes.
Environment
NetXMS: v5.2.3
Database: MariaDB 10.11
NetXMS server and DB run on separate VMs with separate disks
NetXMS server specs:
16 CPU cores
64 GB RAM
System size:
Objects: 23022
Nodes: 1129
Interfaces: 21081
Access Points: 0
Sensors: 0
Collectible DCIs: 14890
MaxTransactionSize = 1000
Incident
The NetXMS server crashed after the root partition on the NetXMS server VM became full. The database is hosted on a different VM with its own storage and did not run out of space.
When bringing NetXMS back online:
Some objects that had been created several days earlier had disappeared.
Other newer objects and data were still present (inc alarms for dci that had disappered), suggesting that only some metadata was lost or never committed.
Shutdown behaviour
When attempting to restart the NetXMS service:
The server did not shut down gracefully.
It had to be killed manually.
Database connection pool observation
During investigation we ran show dbcp and noticed that for extended periods (10–30 seconds) it reports:
0 database connections in use
This occurs even though the system is actively monitoring ~23k objects and ~14.8k collectible DCIs.
show dbstat
SQL query counters:
Total .......... 1100783
SELECT ......... 306641
Non-SELECT ..... 794106
Long running ... 0
Failed ......... 0
Background writer requests:
DCI data ....... 246179
DCI raw data ... 246179
Others ......... 234
Additional anomalies
We are also seeing unusual behaviour in both the web client and the desktop client:
Some log views fail to load or take an extremely long time to load pages or do not load at all.
Pagination for logs sometimes does not work or stalls.
These behaviours appear in both the web UI and the native NetXMS client.
This makes us suspect there may be an issue with database access, internal queues, or blocked threads.
Questions
Could a full root partition on the NetXMS server prevent metadata commits to the database even if the DB is on a separate system?
Is it possible that object metadata was only held in server memory and never committed, resulting in the objects disappearing after the crash?
What could cause the DB connection pool to show 0 connections in use for long periods on a system of this size?
Could this indicate blocked writer threads, transaction batching issues, or internal locks?
Could the log loading and pagination issues in both clients be related to the same underlying database or thread issue?
Any guidance on what diagnostics we should run (thread state, DB writer queues, etc.) would be greatly appreciated.
Thanks
Darren
We are investigating an issue where objects disappeared after a NetXMS crash and we are seeing unusual behaviour with the database connection pool and parts of the UI. We would appreciate guidance on possible causes.
Environment
NetXMS: v5.2.3
Database: MariaDB 10.11
NetXMS server and DB run on separate VMs with separate disks
NetXMS server specs:
16 CPU cores
64 GB RAM
System size:
Objects: 23022
Nodes: 1129
Interfaces: 21081
Access Points: 0
Sensors: 0
Collectible DCIs: 14890
MaxTransactionSize = 1000
Incident
The NetXMS server crashed after the root partition on the NetXMS server VM became full. The database is hosted on a different VM with its own storage and did not run out of space.
When bringing NetXMS back online:
Some objects that had been created several days earlier had disappeared.
Other newer objects and data were still present (inc alarms for dci that had disappered), suggesting that only some metadata was lost or never committed.
Shutdown behaviour
When attempting to restart the NetXMS service:
The server did not shut down gracefully.
It had to be killed manually.
Database connection pool observation
During investigation we ran show dbcp and noticed that for extended periods (10–30 seconds) it reports:
0 database connections in use
This occurs even though the system is actively monitoring ~23k objects and ~14.8k collectible DCIs.
show dbstat
SQL query counters:
Total .......... 1100783
SELECT ......... 306641
Non-SELECT ..... 794106
Long running ... 0
Failed ......... 0
Background writer requests:
DCI data ....... 246179
DCI raw data ... 246179
Others ......... 234
Additional anomalies
We are also seeing unusual behaviour in both the web client and the desktop client:
Some log views fail to load or take an extremely long time to load pages or do not load at all.
Pagination for logs sometimes does not work or stalls.
These behaviours appear in both the web UI and the native NetXMS client.
This makes us suspect there may be an issue with database access, internal queues, or blocked threads.
Questions
Could a full root partition on the NetXMS server prevent metadata commits to the database even if the DB is on a separate system?
Is it possible that object metadata was only held in server memory and never committed, resulting in the objects disappearing after the crash?
What could cause the DB connection pool to show 0 connections in use for long periods on a system of this size?
Could this indicate blocked writer threads, transaction batching issues, or internal locks?
Could the log loading and pagination issues in both clients be related to the same underlying database or thread issue?
Any guidance on what diagnostics we should run (thread state, DB writer queues, etc.) would be greatly appreciated.
Thanks
Darren