Best Practice to analyse event "SYS_THREAD_HANG"?

Dani@M3T · April 14, 2015, 05:00:37 PM

We have a lot of events "SYS_THREAD_HANG" (Thread "Item Poller" is not responding).
Most of time only for a few seconds, then "SYS_THREAD_RUNNING" again.

What is the best way to analyse this and find the root cause?
(maybe a good point for the troubleshooting paragraph of the administrator guide)

In server log we get:

Code Select

[14-Apr-2015 15:30:51.580] [ERROR] Thread "Item Poller" does not respond to watchdog thread

Server debug log is attached.

We use NetXMS V2.0-M3 (linux x64, PostgreSQL).

Thanks!

Victor Kirhenshtein · April 17, 2015, 08:00:28 PM

Hi,

there could be different reasons for such messages. Most often it's a bug in a server code, or lack of resources. Item poller may hang if data collection configuration of one of the nodes is locked for too long. Do you monitor server internal queues? Also you can check size of the queues when item poller hangs (by running command show queues on server's debug console).

Best regards,
Victor

Dani@M3T · April 20, 2015, 06:55:20 PM

Thanks Victor. I didn't saw anything in 'show queues. At the moment I have no more such events. But I don't know why ;-) I will come back to this if it re-occures.

Andreas@rc · August 24, 2015, 09:17:00 AM

Hi,

I'm facing the same problem. NetXMS started with this recently. Sometimes the server seems able to recover, but still ... this is not good.

I'm running version 1.2.17 on a Windows 2008R2 64-bit, with a MSSQL DB on a different server.

Code Select

netxmsd: sh que
Condition poller                 : 0
Configuration poller             : 0
Topology poller                  : 0
Data collector                   : 0
Database writer                  : 0
Database writer (IData)          : 0
Database writer (raw DCI values) : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 0
Syslog processing                : 0
Syslog writer                    : 0

What's the recommended debug level to help to resolve this issue?

tomaskir · August 24, 2015, 11:13:41 AM

1.2.17 has a few bugs that can cause this.

I really recommend you try 2.0-RC1, it has a lot of fixes and optimizations that will probably fix this for you.

Dani@M3T · August 24, 2015, 01:56:55 PM

We see this problem also again, here on V2.0-RC1 on Linux x64.

Andreas@rc · August 25, 2015, 12:26:40 PM

As a workaround, i increased the polling interval (from 60->120 sec) for most of the DCIs.

No thread hang so far, but still a bit early to celebrate.

Size of my setup:

Code Select

Total number of objects:     17788
Number of monitored nodes:   401
Number of collectable DCIs:  2959

Andreas@rc · September 09, 2015, 08:39:34 AM

Quote from: Andreas@rc on August 25, 2015, 12:26:40 PM
As a workaround, i increased the polling interval (from 60->120 sec) for most of the DCIs.

No thread hang so far, but still a bit early to celebrate.

Size of my setup:
Code Select Expand
Total number of objects: 17788 Number of monitored nodes: 401 Number of collectable DCIs: 2959

Well, the error occurs every week now, not whenever the server feels like it, only I partially success. I can now schedule a restart to prevent this from happening, but I would still prefer a proper solution.

Dani@M3T · September 09, 2015, 09:58:56 AM

We also have no solution yet.

Victor Kirhenshtein · September 10, 2015, 07:15:38 PM

Hi,

when you encounter hang next time, could you please capture process state using following instruction?

1. Check netxmsd PID
2. Run

gdb netxmsd

(use full path if necessary). It will show (gdb) prompt.

3. Within gdb run

attach <netxmsd_pid>

it should show (gdb) prompt again.

4. Run

thread apply all bt

and send me the output (could be quite long, so make sure you have adequate buffer length in terminal).

5. Run

detach

to detach from netxmsd and let it run.

Best regards,
Victor

Dani@M3T · September 21, 2015, 12:59:03 PM

Hi Victor

Here the output.

thanks
Dani

Victor Kirhenshtein · September 21, 2015, 11:37:31 PM

Hi,

from the trace I see that server is blocked on reading one of Service.Check(...) parameters (and I see there are lot of them) which probably takes some time to complete. As DCI is locked when it is being polled, when item poller thread attempts to access this DCI it locks, and so watchdog warning is triggered.
Actually new feature called "agent side DCI cache" introduced in 2.0-RC1 can help - you can turn on agent cache for those DCIs (or all DCIs). This will switch then from polling mode to push mode, thus eliminating long locks on server side.

Best regards,
Victor

Dani@M3T · September 22, 2015, 03:34:06 PM

Hi Victor

Thanks you for your analyse.
I changed all 'Service-Check'-DCIs to Agent Cache Mode = ON (are all DCIs with Soucre-node = NetXMS server). Seems to be the solution (ok a little bit early for all-clear)!

Is this feature fully implemented now in V2.0-RC1?
Is there also a disadvantage of Cache-mode=on or is it ok to change all agent-DCIs to cache-mode=on?
What are the possible values for server configuration variable "DefaultAgentCacheMode"? I couldn't found it in the Wiki.
Is there a short documentation of the agent-cache-mode feature?

thanks
Dani

Dani@M3T · October 20, 2015, 12:35:59 PM

Is there a documentation of the cache feature?
There are 3 level for this feature:

server default setting 'DefaultAgentCacheMode' (which number is which setting?)
settings on the agent (on, off, default, is default the setting from the server settings?
settings on the individual DCI (on, off, default)

Are there limitations with V2.0-RC2?

thanks
Dani

Victor Kirhenshtein · October 21, 2015, 04:16:47 PM

Hi,

it should be fully functional in 2.0-RC2. We will update documentation soon. In short:

1. DefaultAgentCacheMode can be either 1 (on) or 2 (off);
2. agent level "default" means using DefaultAgentCacheMode settings;
3. DCI level "default" means using agent level settings.

Best regards,
Victor

NetXMS Support Forum

News:

Best Practice to analyse event "SYS_THREAD_HANG"?

Dani@M3T

Victor Kirhenshtein

Dani@M3T

Andreas@rc

tomaskir

Dani@M3T

Andreas@rc

Andreas@rc

Dani@M3T

Victor Kirhenshtein

Dani@M3T

Victor Kirhenshtein

Dani@M3T

Dani@M3T

Victor Kirhenshtein