Server crash

Started by xenth, April 30, 2008, 05:25:56 PM

Previous topic - Next topic

xenth

As of yesterday I started monitoring all the nodes in my own network as a big test,
however I'm trying to log in today and the console is stuck at "synchronizing objects"

The core service and SQL are still running, but I will restart them both now to see if that solves the problem.

This the log file data:


[30-Apr-2008 15:16:05] Thread "Item Poller" does not respond to watchdog thread
[30-Apr-2008 15:17:05] Thread "Poll Manager" does not respond to watchdog thread
[30-Apr-2008 15:18:45] Thread "Syncer Thread" does not respond to watchdog thread


This is a very big problem as I want to keep it running 24/7, any way to determine what caused this and how it can be prevented
in the future?

Thanks in advance.

-UPDATE
I couldn't stop the mysql service, but I stopped and started netXMS core and I can log in now.

-UPDATE2-
Deleting DCI'S is causing timeouts, related problem? I think so.

-UPDATE3-
It was just a matter of time, it happened again


[30-Apr-2008 16:29:36] Log file opened
[30-Apr-2008 16:29:37] Database driver "mysql.ddr" loaded and initialized successfully
[30-Apr-2008 16:29:37] Stalled database lock removed
[30-Apr-2008 16:29:39] Failed to load template object with id 412 from database
[30-Apr-2008 16:29:39] NetXMS Server started
[30-Apr-2008 16:34:39] Thread "Item Poller" does not respond to watchdog thread
[30-Apr-2008 16:35:59] Thread "Poll Manager" does not respond to watchdog thread



Alex Kirhenshtein

Hi.

When it hangs again, please do this:

Run "nxadm -i" on the server, this should give you access to the sever's console

In nxadm execute following commands:
show mutex
show pollers
show queues
show stats
show watchdog


And, it would be great if you can attach debugger to the process and make a minidump or get theads info.

Is you running windows, this can be done using WinDbg (freeware from microsoft):
*) run WinDbg, press F6 (attach to the process), select netxmsd.
*) type: ".dump c:\netxms.dump"

If you running unix, you can use gdb:
*) run "gdb /path/to/netxmsd"
*) type: "attach netxmsd_pid"
*) type: "thread apply all bt", this should give you large output with state of all threads.

xenth

I am sorry, minidump is currently not an option (working over internet, the firewall is configured to not give that server http access).
I will get it for you on monday.

Here is the output from the commands:


NetXMS Server Remote Console V0.2.20 Ready
Enter "help" for command list

netxmsd: show mutex
Mutex status:
  g_hMutexIdIndex: locked for reading
  g_hMutexNodeIndex: locked for reading
  g_hMutexSubnetIndex: unlocked
  g_hMutexInterfaceIndex: unlocked

netxmsd: show pollers
PT  TIME                   STATE
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:37:17   poll: [Censored] - Oki 3530 [372] - child poll
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:36:36   wait
S   01/May/2008 17:37:18   wait
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:37:13   wait
S   01/May/2008 17:36:26   wait
C   01/May/2008 17:35:17   poll: [Censored] - [Censored] [369] - interface check
C   01/May/2008 17:37:18   poll: [Censored]- [Censored] [440] - interface check
C   01/May/2008 17:36:51   poll: [Censored] - [Censored] [444] - capability check
C   01/May/2008 17:36:00   poll: [Censored] - [Censored] [425] - interface check
R   01/May/2008 17:34:19   wait
R   01/May/2008 17:34:27   wait
R   01/May/2008 17:35:16   wait
R   01/May/2008 17:34:36   wait
R   01/May/2008 17:34:36   wait
D   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
A   01/May/2008 17:33:06   wait

netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 0
Data collector                   : 330
Database writer                  : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 0

netxmsd: show status
ERROR: Invalid SHOW subcommand

netxmsd: show stats
Total number of objects:     103
Number of monitored nodes:   36
Number of collectable DCIs:  383

netxmsd: show watchdog
Thread                                           Interval Status
----------------------------------------------------------------------------
Item Poller                                      20       Running
Syncer Thread                                    130      Running
Poll Manager                                     60       Running


Please note that I have changed the object names to: [Censored]

Thanks again  :)

xenth

Anything else I can try untill I get you the minidump data?  :(

xenth

Here is the requested dump

This is the dump I made while I could not connect to the server via the console.
Just after getting you this dump the whole core service has shut down, this is new behaviour.

Download dump file: http://www.mediafire.com/?madxezmwjem
Mirror: http://rapidshare.com/files/112653022/netxms.dump.html

I hope I made the dump correctly, please let me know as soon as possible.
Thanks in advance.


xenth

I just noticed something very disturbing, the polling isn't going properly.

My set intervals aren't being listened to, I'll give you an example, this is one of my workstations.
I have set the interval to 60 seconds, yet it does NOT poll every 60 seconds  :'(

It should poll every 60 seconds, yet when I view the history:


05-May-2008 07:13:29 527216640
05-May-2008 07:09:30 527216640
05-May-2008 07:05:48 527216640
05-May-2008 07:01:57 527216640
05-May-2008 06:58:12 527216640
05-May-2008 06:54:25 527216640
05-May-2008 06:50:26 527216640
05-May-2008 06:46:32 527216640
05-May-2008 06:42:42 527216640
05-May-2008 06:38:50 527216640


This is a VERY big problem as well, is it related to my other problem?  :(

Alex Kirhenshtein

Thanks for the dump, I'm checking it.

xenth


Victor Kirhenshtein

Quote from: xenth on May 05, 2008, 10:27:55 AM
I just noticed something very disturbing, the polling isn't going properly.

My set intervals aren't being listened to, I'll give you an example, this is one of my workstations.
I have set the interval to 60 seconds, yet it does NOT poll every 60 seconds  :'(

It should poll every 60 seconds, yet when I view the history:

This is a VERY big problem as well, is it related to my other problem?  :(

Yes, most likely they are related. If you take a look at a result of "show queues" console command, you can see quite big number (330) in data collector queue. This means that at the moment when you type this command 330 requests for collecting data was waiting for processing, because all data collectors was busy. You can try to increase number of data collectors to 40 or 50 (by changing server's parameter NumberOfDataCollectors), this may help a bit, but will not remove the problem completely.

Some additional questions: do you have SNMP on the nodes? Do you use SNMP for data collection?

Best regards,
Victor

xenth

#9
Hi,

I set the Numberofdatacollectors to 90 and the data collector queue is now at 150ish, should I keep increasing the number of collectors?

Some of the workstations have windows XP on them with the SNMP service but I am not running any dci's for those,
I am however monitoring several routers/modems/etc with snmp.

To be more precise: About 7 or 8 nodes with SNMP that I am monitoring with an average of about 6-7 dci's per SNMP node.

Thank you for your time.

Victor Kirhenshtein

Hi!

Further increase of data collectors will not help - we have similar problem at one of our customer's site, and it's related to SNMP. I'll publish updated version of NetXMS server today, it may help.

Best regards,
Victor

xenth

Great! I can't wait to test it  :)

Queue problem is fixed now anyway, I'll post another topic with some  questions regarding it.

xenth

Hi, small update.

I set the "numberofdatacollectors" at 100 now and everything is running as smooth as ever, 0 in the queues :)
Every problem I had with crashing and timeouts appear to be gone.

:)

Victor Kirhenshtein

Very interesting information... Thank you for reporting!

Best regards,
Victor

xenth

#14
You're welcome, all is not solved unfortunately  :(

On all the workstations I am monitoring I have disabled status polling, personally I think that's where the problem is coming from.

When everyone is working on the machines (and they are reachable by netxms) everything is fine with the current settings.

However at the moment (it's 20:00 here) I am seeing a queue of around 100 for the datacollector pollers and the same problems are happening again, I can't delete anything, timeouts and I have to reboot the netxms service.

You see what's happening is this:

08:00 - 17:00     -    Server requests data from agents and gets replies     -     no queues, no issues.
17:00 - 08:00     -    Server requests data from agents and the packets are blocked by the firewall because the destination adres does not exist at that moment      -      queues and lots of issues.

-UPDATE: Here's the routers log message for when it tries to poll offline nodes if it helps:
Exceed MAX incomplete, sent TCP RST

Putting my workstations back on status polling is not an option because they go down so frequently  :(