Hi,
Recently updated to 1.2.7, but having an issue with Syncer Thread. A few minutes after server start, it stops responding and won't recover again.
Item Poller 20 Running
Syncer Thread 130 Not responding
Poll Manager 60 Running
nxadm command show pollers indicates that some pollers seems to be stuck, in particular topology pollers. Show queues gives that Topology poller has a large number, the other queues are fine.
Is is possible to completely disable topology polling at the server level?
Data collection and alarms seems to work. What is the impact of non-responding Syncer Thread? Any ideas on how to fix this?
Thanks in advance!
Hi!
Due to changes in 1.2.7, you can occasionally get "syncer thread not responding" message. That's ok, and you can safely ignore it, unless it's stuck in not responding state forever. Can you double check that syncer thread really never come back from not responding state?
Best regards,
Victor
Unfortunately it seems stuck. 9 hours now since last restart of netxmsd, and has been in Not responding state since. Besides from showing up in alarm browser, this is also logged to syslog. But no more occurrences since the initial one right after server start, so I'm quite sure that it been stuck all day.
You mention that some pollers seems to be stuck. Please send me list of these pollers and information about nodes they trying to process.
Best regards,
Victor
Further investigations.... started a backup on a cold standby server. Now, the Syncer thread seems fine and no pollers stuck. But, when server is doing configuration polls, there is lot of Node down's and "Unable to create raw socket for ICMP protocol" in syslog. The Node down alarms will recover though, until next configuration poll cycle.
Ok, the usual file descriptor limit. ulimit -n shows 1024, increased and restarted server. No node downs, no errors in syslog, but... the stuck pollers and Syncer thread problem is back...
Reverted to 1024. Pollers/Syncer OK but lots of node downs. Now when the Syncer thread is alive, I will try to disable routing/topology polling on a bunch of nodes to see if I can get the best from these worlds.
Hi,
very interesting observation. Looks like something got broken when large number of sockets can be open in parallel. I'll look in this direction.
Best regards,
Victor
Btw, how many nodes you are monitoring? And what kind of nodes (more switches/routers or servers)? Could it be that you have network devices with very large MAC tables or routing tables?
Best regards,
Victor
Hi, sorry for late reply.
Reverted to 1.2.6. No Syncer Thread problems, however some pollers still might get stuck. Have a feeling that this been the case in previous versions as well, but it is not a big problem.
For your question about number of nodes, output from nxadmc:
netxmsd: sh stat
Total number of objects: 5988
Number of monitored nodes: 1530
Number of collectable DCIs: 45318
Almost 900 nodes with agents, but not any large core routers/switches.
Best regards,
Kjell
Hi, I'm also having issues with this issue since upgrading.
W2003K server monitoring about 40 windows servers and about 20 other devices.
Often I can get timeout errors when applying Templates and polling configurations. Only happing since the upgrade.
Pictures speak a thousand words....
Hi!
This issue should be fixed in 1.2.8 release.
Best regards,
Victor
Thanks guys, keep up the good work...
Great to hear that I was not alone in this issue.
Any E.T.A. for 1.2.8? Any solution until then?
E.T.A. is today :) Most components already packed, I'll put everything on to web site during the day.
Best regards,
Victor
;D Wooot nice news!