Best Practice to analyse event "SYS_THREAD_HANG"?

Started by Dani@M3T, April 14, 2015, 05:00:37 PM

Previous topic - Next topic

Dani@M3T

We have a lot of events "SYS_THREAD_HANG" (Thread "Item Poller" is not responding).
Most of time only for a few seconds, then "SYS_THREAD_RUNNING" again.

What is the best way to analyse this and find the root cause?
(maybe a good point for the troubleshooting paragraph of the administrator guide)

In server log we get:
[14-Apr-2015 15:30:51.580] [ERROR] Thread "Item Poller" does not respond to watchdog thread

Server debug log is attached.

We use NetXMS V2.0-M3 (linux x64, PostgreSQL).

Thanks!

Victor Kirhenshtein

Hi,

there could be different reasons for such messages. Most often it's a bug in a server code, or lack of resources. Item poller may hang if data collection configuration of one of the nodes is locked for too long. Do you monitor server internal queues? Also you can check size of the queues when item poller hangs (by running command show queues on server's debug console).

Best regards,
Victor

Dani@M3T

Thanks Victor. I didn't saw anything in 'show queues. At the moment I have no more such events. But I don't know why ;-) I will come back to this if it re-occures.

Andreas@rc

Hi,

I'm facing the same problem. NetXMS started with this recently. Sometimes the server seems able to recover, but still ... this is not good.

I'm running version 1.2.17 on a Windows 2008R2 64-bit, with a MSSQL DB on a different server.

netxmsd: sh que
Condition poller                 : 0
Configuration poller             : 0
Topology poller                  : 0
Data collector                   : 0
Database writer                  : 0
Database writer (IData)          : 0
Database writer (raw DCI values) : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 0
Syslog processing                : 0
Syslog writer                    : 0


What's the recommended debug level to help to resolve this issue?

tomaskir

1.2.17 has a few bugs that can cause this.

I really recommend you try 2.0-RC1, it has a lot of fixes and optimizations that will probably fix this for you.

Dani@M3T

We see this problem also again, here on V2.0-RC1 on Linux x64.

Andreas@rc

As a workaround, i increased the polling interval (from 60->120 sec) for most of the DCIs.

No thread hang so far, but still a bit early to celebrate.

Size of my setup:
Total number of objects:     17788
Number of monitored nodes:   401
Number of collectable DCIs:  2959

Andreas@rc

Quote from: Andreas@rc on August 25, 2015, 12:26:40 PM
As a workaround, i increased the polling interval (from 60->120 sec) for most of the DCIs.

No thread hang so far, but still a bit early to celebrate.

Size of my setup:
Total number of objects:     17788
Number of monitored nodes:   401
Number of collectable DCIs:  2959


Well, the error occurs every week now, not whenever the server feels like it, only I partially success. I can now schedule a restart to prevent this from happening, but I would still prefer a proper solution.

Dani@M3T

We also have no solution yet.

Victor Kirhenshtein

Hi,

when you encounter hang next time, could you please capture process state using following instruction?

1. Check netxmsd PID
2. Run

gdb netxmsd

(use full path if necessary). It will show (gdb) prompt.

3. Within gdb run

attach <netxmsd_pid>

it should show (gdb) prompt again.

4.  Run

thread apply all bt

and send me the output (could be quite long, so make sure you have adequate buffer length in terminal).

5. Run

detach

to detach from netxmsd and let it run.

Best regards,
Victor

Dani@M3T

Hi Victor

Here the output.

thanks
Dani

Victor Kirhenshtein

Hi,

from the trace I see that server is blocked on reading one of Service.Check(...) parameters (and I see there are lot of them) which probably takes some time to complete. As DCI is locked when it is being polled, when item poller thread attempts to access this DCI it locks, and so watchdog warning is triggered.
Actually new feature called "agent side DCI cache" introduced in 2.0-RC1 can help - you can turn on agent cache for those DCIs (or all DCIs). This will switch then from polling mode to push mode, thus eliminating long locks on server side.

Best regards,
Victor

Dani@M3T

#12
Hi Victor

Thanks you for your analyse.
I changed all 'Service-Check'-DCIs to Agent Cache Mode = ON (are all DCIs with Soucre-node = NetXMS server). Seems to be the solution (ok a little bit early for all-clear)!

Is this feature fully implemented now in V2.0-RC1?
Is there also a disadvantage of Cache-mode=on or is it ok to change all agent-DCIs to cache-mode=on?
What are the possible values for server configuration variable "DefaultAgentCacheMode"? I couldn't found it in the Wiki.
Is there a short documentation of the agent-cache-mode feature?


thanks
Dani

Dani@M3T

#13
Is there a documentation of the cache feature?
There are 3 level for this feature:

  • server default setting 'DefaultAgentCacheMode' (which number is which setting?)
  • settings on the agent (on, off, default, is default the setting from the server settings?
  • settings on the individual DCI (on, off, default)

Are there limitations with V2.0-RC2?

thanks
Dani

Victor Kirhenshtein

Hi,

it should be fully functional in 2.0-RC2. We will update documentation soon. In short:

1. DefaultAgentCacheMode can be either 1 (on) or 2 (off);
2. agent level "default" means using DefaultAgentCacheMode settings;
3. DCI level "default" means using agent level settings.

Best regards,
Victor