netxmsd 2.2.16 process hang on Centos 7.6

Started by scomoletti, September 18, 2019, 05:29:39 PM

Previous topic - Next topic

scomoletti

I have a small netxms instance running in docker used primarily for 3rd party ems integration/alarm consolidation. It has 19 nodes all of which have 1 table dci executing a local script on the server to collect output into the table. It worked fine for about a month when we had a disk space issue. Couldnt identify which process was at fault but restarting netxmsd and mariadb resulted in 15G of disk being recovered. We did not clean up any files manually.

After the restart mariadb repaired the itself and netxmsd started without issue and ran for several hours after which we started having problems again. Initial symptom was inability to login via nmxc web client or full client. looking at logs showed no errors. nxadm from shell worked fine and had no issues with any of the show commands. I set debug to 9 but all I saw in the log was ItemPoller calling queueitems for the 19 nodes once per second followed by agent.conn sending/receiving 7 messages for DCI_DATA. Show watchdog indicated that both Syncer Thread and Poller Manager were not responding. Item poller was running and ad hoc/recurrent schedulers were sleeping as normal. I wasn't able to find anything else useful to indicate what caused it to hang.  A restart of netxmsd corrected the issue.

I'm thinking that there is some db damage which caused housekeeping to hang maybe? I have rescheduled it to run at a time when I'll be around to watch it. Anyone have any ideas how to troubleshoot this one better? My next step if the problem continues is to export the configs and rebuild the db. not worried about the history or anything but I'd much rather know whats going on before I blow everything away.

Victor Kirhenshtein

I would suggest upgrading to 2.2.17 or 3.0. We have fixed various deadlocks since 2.2.16, there are good chances that your problem is fixed.

Best regards,
Victor

scomoletti

Already testing with 3.0 which I planned as our next upgrade but knowing this I may jump to 2.2.17 before that. Thanks!