Netxmsd hangs during startup

Started by bdefloo, February 21, 2024, 07:14:51 PM

Previous topic - Next topic

bdefloo

Hi,

We have NetXMS 4.4.2 running on a Windows 2019 server for a few months, however after the server was restarted yesterday we are no longer able to connect. The netxmsd.exe process is running and taking ~50% CPU, but looking at the netxmsd.log it appears to freeze mid startup. (see attached)

I ran nxdbmgr check and check-data-tables with clean results, tried reinstalling NetXMS, installed on a different machine and pointed to same database, but all with the same results.
I suspect something got corrupted in the database and is causing the server to lock up.

We're using MSSQL Server 2016 and the ODBC driver. The database is accessible through SQL management studio and appears to be otherwise healthy (as also attested by nxdbmgr)

Are there any further troubleshooting steps I can take?
I can provide debuglevel 9 logging or other detailed information via private message if needed. I haven't found any "smoking gun" error messages so far, unfortunately.

Should I try an update to a later version?

Thanks in advance!

bdefloo

After looking at https://www.netxms.org/forum/general-support/the-table-alarm_event-is-full-in-netxms-4-5-3-1-version/ I checked the alarm_event table and found it contained 52 million rows and took up most of the database size.

I truncated this table, and the NetXMS Core service started right back up, so it appears this was the cause of the issue!
Unfortunately I didn't have an error message to point in that direction.

I have Alarms.HistoryRetentionTime and Events.LogRetentionTimeon 15 days, is there another server parameter relevant for the alarm_event table?

Victor Kirhenshtein

This is quite strange. Do you have alarms with huge repeat counts?

Best regards,
Victor

bdefloo

#3
I found the cause, someone had checked the "inverse rule" checkbox on the event condition of the "show alarm when node is down" EPP rule.
This caused every event except the actual node down one to trigger and be correlated to a NODE_DOWN_%i alarm per node.

I've corrected this and will monitor the table growth.
We probably will have to do some alarm cleanup as well.