NetXMS Support Forum

English Support => General Support => Topic started by: Dani@M3T on April 14, 2015, 05:00:37 PM

Title: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Dani@M3T on April 14, 2015, 05:00:37 PM
We have a lot of events "SYS_THREAD_HANG" (Thread "Item Poller" is not responding).
Most of time only for a few seconds, then "SYS_THREAD_RUNNING" again.

What is the best way to analyse this and find the root cause?
(maybe a good point for the troubleshooting paragraph of the administrator guide)

In server log we get:
[14-Apr-2015 15:30:51.580] [ERROR] Thread "Item Poller" does not respond to watchdog thread

Server debug log is attached.

We use NetXMS V2.0-M3 (linux x64, PostgreSQL).

Thanks!
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Victor Kirhenshtein on April 17, 2015, 08:00:28 PM
Hi,

there could be different reasons for such messages. Most often it's a bug in a server code, or lack of resources. Item poller may hang if data collection configuration of one of the nodes is locked for too long. Do you monitor server internal queues? Also you can check size of the queues when item poller hangs (by running command show queues on server's debug console).

Best regards,
Victor
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Dani@M3T on April 20, 2015, 06:55:20 PM
Thanks Victor. I didn't saw anything in 'show queues. At the moment I have no more such events. But I don't know why ;-) I will come back to this if it re-occures.
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Andreas@rc on August 24, 2015, 09:17:00 AM
Hi,

I'm facing the same problem. NetXMS started with this recently. Sometimes the server seems able to recover, but still ... this is not good.

I'm running version 1.2.17 on a Windows 2008R2 64-bit, with a MSSQL DB on a different server.

netxmsd: sh que
Condition poller                 : 0
Configuration poller             : 0
Topology poller                  : 0
Data collector                   : 0
Database writer                  : 0
Database writer (IData)          : 0
Database writer (raw DCI values) : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 0
Syslog processing                : 0
Syslog writer                    : 0


What's the recommended debug level to help to resolve this issue?
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: tomaskir on August 24, 2015, 11:13:41 AM
1.2.17 has a few bugs that can cause this.

I really recommend you try 2.0-RC1, it has a lot of fixes and optimizations that will probably fix this for you.
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Dani@M3T on August 24, 2015, 01:56:55 PM
We see this problem also again, here on V2.0-RC1 on Linux x64.
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Andreas@rc on August 25, 2015, 12:26:40 PM
As a workaround, i increased the polling interval (from 60->120 sec) for most of the DCIs.

No thread hang so far, but still a bit early to celebrate.

Size of my setup:
Total number of objects:     17788
Number of monitored nodes:   401
Number of collectable DCIs:  2959
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Andreas@rc on September 09, 2015, 08:39:34 AM
Quote from: Andreas@rc on August 25, 2015, 12:26:40 PM
As a workaround, i increased the polling interval (from 60->120 sec) for most of the DCIs.

No thread hang so far, but still a bit early to celebrate.

Size of my setup:
Total number of objects:     17788
Number of monitored nodes:   401
Number of collectable DCIs:  2959


Well, the error occurs every week now, not whenever the server feels like it, only I partially success. I can now schedule a restart to prevent this from happening, but I would still prefer a proper solution.
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Dani@M3T on September 09, 2015, 09:58:56 AM
We also have no solution yet.
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Victor Kirhenshtein on September 10, 2015, 07:15:38 PM
Hi,

when you encounter hang next time, could you please capture process state using following instruction?

1. Check netxmsd PID
2. Run

gdb netxmsd

(use full path if necessary). It will show (gdb) prompt.

3. Within gdb run

attach <netxmsd_pid>

it should show (gdb) prompt again.

4.  Run

thread apply all bt

and send me the output (could be quite long, so make sure you have adequate buffer length in terminal).

5. Run

detach

to detach from netxmsd and let it run.

Best regards,
Victor
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Dani@M3T on September 21, 2015, 12:59:03 PM
Hi Victor

Here the output.

thanks
Dani
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Victor Kirhenshtein on September 21, 2015, 11:37:31 PM
Hi,

from the trace I see that server is blocked on reading one of Service.Check(...) parameters (and I see there are lot of them) which probably takes some time to complete. As DCI is locked when it is being polled, when item poller thread attempts to access this DCI it locks, and so watchdog warning is triggered.
Actually new feature called "agent side DCI cache" introduced in 2.0-RC1 can help - you can turn on agent cache for those DCIs (or all DCIs). This will switch then from polling mode to push mode, thus eliminating long locks on server side.

Best regards,
Victor
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Dani@M3T on September 22, 2015, 03:34:06 PM
Hi Victor

Thanks you for your analyse.
I changed all 'Service-Check'-DCIs to Agent Cache Mode = ON (are all DCIs with Soucre-node = NetXMS server). Seems to be the solution (ok a little bit early for all-clear)!

Is this feature fully implemented now in V2.0-RC1?
Is there also a disadvantage of Cache-mode=on or is it ok to change all agent-DCIs to cache-mode=on?
What are the possible values for server configuration variable "DefaultAgentCacheMode"? I couldn't found it in the Wiki.
Is there a short documentation of the agent-cache-mode feature?


thanks
Dani
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Dani@M3T on October 20, 2015, 12:35:59 PM
Is there a documentation of the cache feature?
There are 3 level for this feature:

Are there limitations with V2.0-RC2?

thanks
Dani
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Victor Kirhenshtein on October 21, 2015, 04:16:47 PM
Hi,

it should be fully functional in 2.0-RC2. We will update documentation soon. In short:

1. DefaultAgentCacheMode can be either 1 (on) or 2 (off);
2. agent level "default" means using DefaultAgentCacheMode settings;
3. DCI level "default" means using agent level settings.

Best regards,
Victor
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Borgso on May 23, 2016, 06:27:24 PM
Sorry for necroing this topic, we have the same issue and just upgraded to 2.0.3 we are now able to use this AgentCache function.

On Source node, there is a script polling custom Services for values.
Without agent cache this would give 1 poll each second, with cache on it polls much more within one second.
But it does not save any data to DCI history, do "Origin" need to be "Push" and not "NetXMS agent" when AgentCache function is On?

Is this function the reason of SQLite requirment on agent and are there any agent side configuration needed (ie db location)?
Title: Re: Best Practice to analyse event "SYS_THREAD_HANG"?
Post by: Victor Kirhenshtein on May 27, 2016, 09:38:36 AM
No, origin need to be "NetXMS agent". Yes, this is the main reason why agent needs SQLite. Usually no additional configuration required, agent will create local database in reasonable location ($install_prefix/var/lib/netxms if built from sources, /var/lib/netxms if installed from deb package).
There is a bug in 2.0.3 (fixed in 2.0.4) that cached mode actually works only for servers listed as MasterServers in agent configuration.

Best regards,
Victor