NetXMS Support Forum

English Support => General Support => Topic started by: xenth on April 30, 2008, 05:25:56 PM

Title: Server crash
Post by: xenth on April 30, 2008, 05:25:56 PM
As of yesterday I started monitoring all the nodes in my own network as a big test,
however I'm trying to log in today and the console is stuck at "synchronizing objects"

The core service and SQL are still running, but I will restart them both now to see if that solves the problem.

This the log file data:


[30-Apr-2008 15:16:05] Thread "Item Poller" does not respond to watchdog thread
[30-Apr-2008 15:17:05] Thread "Poll Manager" does not respond to watchdog thread
[30-Apr-2008 15:18:45] Thread "Syncer Thread" does not respond to watchdog thread


This is a very big problem as I want to keep it running 24/7, any way to determine what caused this and how it can be prevented
in the future?

Thanks in advance.

-UPDATE
I couldn't stop the mysql service, but I stopped and started netXMS core and I can log in now.

-UPDATE2-
Deleting DCI'S is causing timeouts, related problem? I think so.

-UPDATE3-
It was just a matter of time, it happened again


[30-Apr-2008 16:29:36] Log file opened
[30-Apr-2008 16:29:37] Database driver "mysql.ddr" loaded and initialized successfully
[30-Apr-2008 16:29:37] Stalled database lock removed
[30-Apr-2008 16:29:39] Failed to load template object with id 412 from database
[30-Apr-2008 16:29:39] NetXMS Server started
[30-Apr-2008 16:34:39] Thread "Item Poller" does not respond to watchdog thread
[30-Apr-2008 16:35:59] Thread "Poll Manager" does not respond to watchdog thread


Title: Re: Server crash
Post by: Alex Kirhenshtein on April 30, 2008, 09:24:03 PM
Hi.

When it hangs again, please do this:

Run "nxadm -i" on the server, this should give you access to the sever's console

In nxadm execute following commands:
show mutex
show pollers
show queues
show stats
show watchdog


And, it would be great if you can attach debugger to the process and make a minidump or get theads info.

Is you running windows, this can be done using WinDbg (freeware from microsoft):
*) run WinDbg, press F6 (attach to the process), select netxmsd.
*) type: ".dump c:\netxms.dump"

If you running unix, you can use gdb:
*) run "gdb /path/to/netxmsd"
*) type: "attach netxmsd_pid"
*) type: "thread apply all bt", this should give you large output with state of all threads.
Title: Re: Server crash
Post by: xenth on May 01, 2008, 06:44:01 PM
I am sorry, minidump is currently not an option (working over internet, the firewall is configured to not give that server http access).
I will get it for you on monday.

Here is the output from the commands:


NetXMS Server Remote Console V0.2.20 Ready
Enter "help" for command list

netxmsd: show mutex
Mutex status:
  g_hMutexIdIndex: locked for reading
  g_hMutexNodeIndex: locked for reading
  g_hMutexSubnetIndex: unlocked
  g_hMutexInterfaceIndex: unlocked

netxmsd: show pollers
PT  TIME                   STATE
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:37:17   poll: [Censored] - Oki 3530 [372] - child poll
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:36:36   wait
S   01/May/2008 17:37:18   wait
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:36:26   wait
S   01/May/2008 17:37:13   wait
S   01/May/2008 17:36:26   wait
C   01/May/2008 17:35:17   poll: [Censored] - [Censored] [369] - interface check
C   01/May/2008 17:37:18   poll: [Censored]- [Censored] [440] - interface check
C   01/May/2008 17:36:51   poll: [Censored] - [Censored] [444] - capability check
C   01/May/2008 17:36:00   poll: [Censored] - [Censored] [425] - interface check
R   01/May/2008 17:34:19   wait
R   01/May/2008 17:34:27   wait
R   01/May/2008 17:35:16   wait
R   01/May/2008 17:34:36   wait
R   01/May/2008 17:34:36   wait
D   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
N   01/May/2008 17:33:06   wait
A   01/May/2008 17:33:06   wait

netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 0
Data collector                   : 330
Database writer                  : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 0

netxmsd: show status
ERROR: Invalid SHOW subcommand

netxmsd: show stats
Total number of objects:     103
Number of monitored nodes:   36
Number of collectable DCIs:  383

netxmsd: show watchdog
Thread                                           Interval Status
----------------------------------------------------------------------------
Item Poller                                      20       Running
Syncer Thread                                    130      Running
Poll Manager                                     60       Running


Please note that I have changed the object names to: [Censored]

Thanks again  :)
Title: Re: Server crash
Post by: xenth on May 03, 2008, 02:27:58 AM
Anything else I can try untill I get you the minidump data?  :(
Title: Re: Server crash
Post by: xenth on May 05, 2008, 10:10:25 AM
Here is the requested dump

This is the dump I made while I could not connect to the server via the console.
Just after getting you this dump the whole core service has shut down, this is new behaviour.

Download dump file: http://www.mediafire.com/?madxezmwjem
Mirror: http://rapidshare.com/files/112653022/netxms.dump.html

I hope I made the dump correctly, please let me know as soon as possible.
Thanks in advance.

Title: Re: Server crash
Post by: xenth on May 05, 2008, 10:27:55 AM
I just noticed something very disturbing, the polling isn't going properly.

My set intervals aren't being listened to, I'll give you an example, this is one of my workstations.
I have set the interval to 60 seconds, yet it does NOT poll every 60 seconds  :'(

It should poll every 60 seconds, yet when I view the history:


05-May-2008 07:13:29 527216640
05-May-2008 07:09:30 527216640
05-May-2008 07:05:48 527216640
05-May-2008 07:01:57 527216640
05-May-2008 06:58:12 527216640
05-May-2008 06:54:25 527216640
05-May-2008 06:50:26 527216640
05-May-2008 06:46:32 527216640
05-May-2008 06:42:42 527216640
05-May-2008 06:38:50 527216640


This is a VERY big problem as well, is it related to my other problem?  :(
Title: Re: Server crash
Post by: Alex Kirhenshtein on May 05, 2008, 02:45:38 PM
Thanks for the dump, I'm checking it.
Title: Re: Server crash
Post by: xenth on May 05, 2008, 05:04:02 PM
Great! :)
Title: Re: Server crash
Post by: Victor Kirhenshtein on May 05, 2008, 06:09:25 PM
Quote from: xenth on May 05, 2008, 10:27:55 AM
I just noticed something very disturbing, the polling isn't going properly.

My set intervals aren't being listened to, I'll give you an example, this is one of my workstations.
I have set the interval to 60 seconds, yet it does NOT poll every 60 seconds  :'(

It should poll every 60 seconds, yet when I view the history:

This is a VERY big problem as well, is it related to my other problem?  :(

Yes, most likely they are related. If you take a look at a result of "show queues" console command, you can see quite big number (330) in data collector queue. This means that at the moment when you type this command 330 requests for collecting data was waiting for processing, because all data collectors was busy. You can try to increase number of data collectors to 40 or 50 (by changing server's parameter NumberOfDataCollectors), this may help a bit, but will not remove the problem completely.

Some additional questions: do you have SNMP on the nodes? Do you use SNMP for data collection?

Best regards,
Victor
Title: Re: Server crash
Post by: xenth on May 05, 2008, 08:29:22 PM
Hi,

I set the Numberofdatacollectors to 90 and the data collector queue is now at 150ish, should I keep increasing the number of collectors?

Some of the workstations have windows XP on them with the SNMP service but I am not running any dci's for those,
I am however monitoring several routers/modems/etc with snmp.

To be more precise: About 7 or 8 nodes with SNMP that I am monitoring with an average of about 6-7 dci's per SNMP node.

Thank you for your time.
Title: Re: Server crash
Post by: Victor Kirhenshtein on May 06, 2008, 09:15:16 AM
Hi!

Further increase of data collectors will not help - we have similar problem at one of our customer's site, and it's related to SNMP. I'll publish updated version of NetXMS server today, it may help.

Best regards,
Victor
Title: Re: Server crash
Post by: xenth on May 06, 2008, 10:34:45 AM
Great! I can't wait to test it  :)

Queue problem is fixed now anyway, I'll post another topic with some  questions regarding it.
Title: Re: Server crash
Post by: xenth on May 06, 2008, 02:12:30 PM
Hi, small update.

I set the "numberofdatacollectors" at 100 now and everything is running as smooth as ever, 0 in the queues :)
Every problem I had with crashing and timeouts appear to be gone.

:)
Title: Re: Server crash
Post by: Victor Kirhenshtein on May 06, 2008, 02:16:14 PM
Very interesting information... Thank you for reporting!

Best regards,
Victor
Title: Re: Server crash
Post by: xenth on May 06, 2008, 09:05:03 PM
You're welcome, all is not solved unfortunately  :(

On all the workstations I am monitoring I have disabled status polling, personally I think that's where the problem is coming from.

When everyone is working on the machines (and they are reachable by netxms) everything is fine with the current settings.

However at the moment (it's 20:00 here) I am seeing a queue of around 100 for the datacollector pollers and the same problems are happening again, I can't delete anything, timeouts and I have to reboot the netxms service.

You see what's happening is this:

08:00 - 17:00     -    Server requests data from agents and gets replies     -     no queues, no issues.
17:00 - 08:00     -    Server requests data from agents and the packets are blocked by the firewall because the destination adres does not exist at that moment      -      queues and lots of issues.

-UPDATE: Here's the routers log message for when it tries to poll offline nodes if it helps:
Exceed MAX incomplete, sent TCP RST

Putting my workstations back on status polling is not an option because they go down so frequently  :(
Title: Re: Server crash
Post by: Victor Kirhenshtein on May 06, 2008, 10:00:44 PM
It can be a problem for NetXMS, because if during status poll it determines that node is down, data collection will not be scheduled for that node. But, if there are no status polls, NetXMS will try to establish TCP connection for each parameter to be collected. With connection timeout about 1 minute, it can easily cause long data collection queues.

What is the reason for disabling status polls? If you don't want alarms, mails, etc. for SYS_NODE_DOWN events, you can just block them in the first rule of event processing policy.

Best regards,
Victor
Title: Re: Server crash
Post by: xenth on May 06, 2008, 10:13:13 PM
Hi  :)

That makes a lot of sense.

But, if I block those specific nodes from the policy that will only block them from my alarm list (which is what I want) but they will still be marked with a red dot in front of them, and in the network summary they will be displayed as critical nodes.

This is the specific reason I've blocked status polling on those nodes.

Basically what I wanted to achieve:
Only important nodes need to be marked with a red dot and displayed as critical in the network summary, is there any other way to achieve this then?

Thanks again for all your work, I appreciate it.

Title: Re: Server crash
Post by: Victor Kirhenshtein on May 07, 2008, 12:01:30 AM
Node status calculated based on two sources - active alarms for that node and status of child objects (usually interfaces). If you wish that node remains in NORMAL status event if some or all interfaces goes down, you should change status propagation algorithm for child objects of that node. You can select "fixed value" propagation algorithm for interface objects and set value to NORMAL.

Best regards,
Victor
Title: Re: Server crash
Post by: xenth on May 07, 2008, 11:39:38 AM
Great! This appears to be working perfectly :)