"Poll Manager" hangs since 3.9.229 on server start

Started by Benjamin Dill, September 30, 2021, 10:01:20 AM

Previous topic - Next topic

Benjamin Dill

I'm running a NetXMS server at version 3.9.178 with multiple zones for our customers, one zone has a single proxy node configured. Whenever I try to update to a newer build the server service starts up but the "Poll Manager" goes to "not responding" within a few seconds. I am able to connect with the console right after the service start, if I'm fast, afterwards any new connection hangs after supplying the login credentials.

I'm not sure but I suspect it has to do with this item from the changelog:
*
* 3.9.229
*
[...]
- Fixed server deadlock related to multi-zone configuration
[...]


What is the best way to analyse this problem? I tried raising the debug level but I'm not sure what I should look for, there are many nodes configured and many connections are happening after server start.
The server runs on Windows Server 2019 with MySQL as database.

Benjamin Dill

#1
I invested a few hours to investigate and narrow down the issue:

  • It has nothing to do with agent communication, it happens also in an isolated testing enviroment without outside communication.
  • The console login hang has nothing to do with the "Poll Manager", this happens also if the Poll Manager does not go into "not responding" state.
  • If I put all nodes to "Unmanged" everything works fine. If I enable a few nodes, for example four nodes in two different zones, following happens:

    • Sometimes, a few seconds after the server has been started it is impossible to login, this happens not every time. It hangs at "Sychronizing objects".
    • The command "show pollers" shows two "zone" pollers which seem never come to an end.
  • There is nothing helpful in the logs, even at level 9, at least to my eyes.
I tried the latest build 3.9.298 with the same results. On build 3.9.178 and older everything is working fine, this behavior started with 3.9.229.
If I can test anything or do some debugging please let me know.

Filipp Sudanov

Please take dump file for netxmsd.exe process. It's in windows task manager -> details. You can share a link to it in a private message.

Benjamin Dill


Filipp Sudanov


Benjamin Dill

Thank you, looks good! At least in my offline testing enviroment I can't reproduce any more server locks. I will update the production environment when the next official build is released.

Benjamin Dill

Just letting you know: We upgraded the server in the meantime and everything is working as expected! Thank you for your assistance, great work!