Server crashed again, possibly database writer queue?

Started by xenth, May 13, 2008, 09:17:36 AM

Previous topic - Next topic

xenth

I've had netxms running unattended for around 4-5 days and it crashed, it didn't collect any data after this occured  :(

The logfile shows the same problem as from before


[11-May-2008 17:32:46] Thread "Item Poller" does not respond to watchdog thread
[11-May-2008 17:34:06] Thread "Poll Manager" does not respond to watchdog thread
[11-May-2008 17:34:26] Thread "Syncer Thread" does not respond to watchdog thread


However, this time there were no substantial queues, to prove this here's a graph of all the pollers.

Yellow: Status poller queue
Blue:    Data collector queue
Purple: Config poller queue
Light blue: Database writer queue
Green: Average time to queue dci's



When I look at the entire graph over 4 days I see that the database writer can get pretty high from time to time, just for a moment, but still.

Here's a graph over 4 days, same colour codes apply.


xenth

Here is an example of how bad the database writer queue can get:




:(

xenth

Happened again just now  >:(


[13-May-2008 09:40:58] Thread "Item Poller" does not respond to watchdog thread
[13-May-2008 09:41:58] Thread "Poll Manager" does not respond to watchdog thread
[13-May-2008 09:42:38] Thread "Syncer Thread" does not respond to watchdog thread

Victor Kirhenshtein

Hello!

Please try to upgrade to 0.2.21 - it's available already at https://www.netxms.org/download/netxms-0.2.21.exe but not announced yet - we do some final testing. You problem looks very similar to one solved in 0.2.21, related to SNMP data collection.

Best regards,
Victor

xenth

I am very interested in the changelog if you have it available  :)

I'm going to try upgrading the server when I have the time.

Thank you.


Victor Kirhenshtein

Change log:

- Multiple network maps implemented
- Added parameter ListenAddress to all services (server, web server, agent)
- New possible value for UseInterfaceAliases - concatenate name with alias
- Added possibility to create custom message in event matching script and
  use it in alarms and actions
- WMI subagent added
- SNMP sysDescr and agent's uname now polled and displayed
- New features in Windows console:
        - Possibility to use non-local timezone in Windows console
        - Default graph settings can be changed
- AIX subagent: implemented System.CPU.LoadAvg* and System.Uptime parameters
- Fixed issues: #193, #194, #198, #204, #209, #211, #212, #213, #214, #215

Best regards,
Victor

xenth

I'm using it now :)

I'll let you know if I experience strange things

xenth

The problem with the database writer queue hasn't been fixed unfortunately  :(
Here's a graph showing my problem


Alex Kirhenshtein

It looks normal; most of the time queue is empty - these peaks are caused by housekeeping process which runs every hour (it loads quite bit amount of data, do vacuum (on postgresql), etc).

DB Writer queue size indicates problem only when it's value is above zero most of the time.

However, if db load will be too high, you can try to increase housekeeping intervals in server config - this should help a bit.

xenth

Ahh I see :)

But, I want to get an alert if one of the queues is too high, what threshold do you recommend that I set on the databasewriterqueue?

Alex Kirhenshtein

I'd set something like "average >= 10 for last 5 polls"