I'm running a NetXMS server at version 3.9.178 with multiple zones for our customers, one zone has a single proxy node configured. Whenever I try to update to a newer build the server service starts up but the "Poll Manager" goes to "not responding" within a few seconds. I am able to connect with the console right after the service start, if I'm fast, afterwards any new connection hangs after supplying the login credentials.
I'm not sure but I suspect it has to do with this item from the changelog:
*
* 3.9.229
*
[...]
- Fixed server deadlock related to multi-zone configuration
[...]
What is the best way to analyse this problem? I tried raising the debug level but I'm not sure what I should look for, there are many nodes configured and many connections are happening after server start.
The server runs on Windows Server 2019 with MySQL as database.
I invested a few hours to investigate and narrow down the issue:
- It has nothing to do with agent communication, it happens also in an isolated testing enviroment without outside communication.
- The console login hang has nothing to do with the "Poll Manager", this happens also if the Poll Manager does not go into "not responding" state.
- If I put all nodes to "Unmanged" everything works fine. If I enable a few nodes, for example four nodes in two different zones, following happens:
- Sometimes, a few seconds after the server has been started it is impossible to login, this happens not every time. It hangs at "Sychronizing objects".
- The command "show pollers" shows two "zone" pollers which seem never come to an end.
- There is nothing helpful in the logs, even at level 9, at least to my eyes.
I tried the latest build 3.9.298 with the same results. On build 3.9.178 and older everything is working fine, this behavior started with 3.9.229.
If I can test anything or do some debugging please let me know.
Please take dump file for netxmsd.exe process. It's in windows task manager -> details. You can share a link to it in a private message.
Thanks, I've sent you a private message.
Here's a build that should have this issue fixed: https://cloud.radensolutions.com/s/8cgm7fBWHjr92ZS
Thank you, looks good! At least in my offline testing enviroment I can't reproduce any more server locks. I will update the production environment when the next official build is released.
Just letting you know: We upgraded the server in the meantime and everything is working as expected! Thank you for your assistance, great work!