Timeout Error Terminating Alarms

Started by RickB, June 28, 2018, 06:26:24 PM

Previous topic - Next topic

RickB

Almost always, when I terminate more than 1 alarm, I get a timeout error.  I switched from the native database to MSSQL to see if that made a difference, but it did not.  Are there any settings, etc. that need to be changed to help prevent this?

Thank you,

-Rick Beaber

StanHubble

We also have this problem.....resolving or terminating as well as other functions on nodes or containers (renaming, binding, etc).  The timeout dialog is displayed but the operation DOES happen.  It just takes some time to display in the management console. 

Victor Kirhenshtein

Hi,

could you please provide statistics for your server(s)?
- CPU, RAM, disk I/O usage for server machine
- Output of the following commands on debug console:
  - show stats
  - show queues
  - show threads
  - show watchdog
  - show pollers
  - show msgwq
  - show dbcp
  - show dbstats
  - show flags

Best regards,
Victor

RickB

Results of the server commands sent via PM as they contain IP's

I ran the commands during a terminate alarms session, and noticed the pollers were at 100% usage.  Is this possibly the issue?

Thanks,

-Rick

Tatjana Dubrovica

Hi,

Yes it might be the issue. Can you please update your system to 2.2.8 and check poller load by "show threads" command? If "Current load" will be more than 100%, than can you please take thread dump using https://github.com/netxms/netxms/blob/master/tools/capture_netxmsd_threads.sh script? Please check that debug packages are installed if server is installed from repository.

Script usage:
Run script with first parameter - path to netxmsd binary, script will generate thread dump into /tmp folger

RickB

Updated to 2.2.8  Pollers still show 100%.  Timeouts still occur when terminating multiple alarms.

My NetXMS server is running on Windows Server 2016.  I was able to install gdb and have tried running the script in a Git Bash shell, but I get the these errors:

$ sh debug.sh
ps: unknown option -- x
Try `ps --help' for more information.
"c:\netxms\bin\netxmsd.exe": not in executable format: File format not recognized
C:/Users/ADMBEA~1/AppData/Local/Temp/3/capture_netxmsd_threads.gdb:3: Error in sourced command file:
set logging: No such file or directory.

Could you tell me what I'm doing wrong?

Thank you,

-Rick

Tursiops

I can't help with thread debugging, but just in regards to the 100% poller usage:
Have you attempted to increase your ThreadPool.Poller.MaxSize in the server configuration? If you are constantly running at 100% usage for that poller, increasing the maximum may help with that. Default value is 250, ours is currently set to ~6000.

I'm acknowledging/resolving/terminating dozens of alarms at each morning and when I've seen this timeout issue in the past, it was usually one of these:
- The Syncer thread is running and taking forever. While this is running, the interface is painful to use. We're using the maximum of 64 threads for this which has reduced these occurrence significantly
- Insufficient RAM on the server. It would still work, but queues would start increasing and everything would slow down.
- I have some large object list expanded, e.g. a template or container with hundreds of nodes in it. This slows the entire interface down. Closing such containers/templates tends to speed things up again. No idea if that's a Java thing or something else.

Not sure if any of the above might apply to you, but maybe it helps.

Tatjana Dubrovica

Debug script is created for the Linux systems. In Windows you can create dumb from Task Manager by right click on the process "Create Dump File". Sorry, that have not asked for the OS first.

RickB

Tatjana:
     Dmp file created.  It is approx 73MB zipped.  Too large to attach to this message.  What is the best way to get it to you?

Thank you,

-Rick


Victor Kirhenshtein

Hi,

just sent upload link in PM.

Best regards,
Victor

jermudgeon

I'm seeing this same problem in 2.2.14. Thread loads appear normal.

Scenario to reproduce:

1. Create templated DCIs (~80k DCIs in this instance).
2. Disable DCI templates.
3. 80k alarms are created upon disabling DCIs.
4. Terminate alarms (4096 at a time).

I can prevent the client from crashing by immediately closing the alarm window after terminating 4096 alarms. The job will still timeout, but the errors will be terminated.