NetXMS Server Crashing

Started by Netvoid, November 05, 2010, 05:15:49 PM

Previous topic - Next topic

Netvoid

Server is, Windows 2008 R2 64 Bit with SQL Server 2008 db on a separate machine. Running 1.0.5 server with 1.0.6 clients.

The following error is happening,



We have about 450 agents and maybe a total of 1500 nodes. Only about 4-6 DCI per node, vast majority are just monitoring windows event logs for errors and criticals.

I am noticing that we are getting a fair amount of agents dropping connectivity and then restoring in blocks of about 10-20 every 10-20 minutes. The crashes only started after we went from about 300 agents installed to 450 agents installed. I put the numberofstatuspollers up to 50 at first and that didn't seem to help, now I am putting the number back down to 30. That seems related but I'm not sure. The netowrk utilization is steady at about 2-3 percent. The processor is steady at about 10-20 percent. Disk IO is nothing...

Any suggestions based on this information would be helpful.

Victor Kirhenshtein

I'll try to check this. Also, could you please enable crash dump generation on server and send generated dumps to [email protected]? This can greatly simplify debugging. You can enable crash dump generation by setting two parameters in netxmsd.conf:

CreateCrashDumps = yes
DumpDirectory = some existing directory writable by server process

Also, could you please run command "nxadm -i" on server machine post output of the following commands:

show pollers
show queues
show flags
show mutex

Best regards,
Victor

Netvoid

I enabled the crash dumps, next time it happens I'll post.

These commands were run about 2-3 minutes after a reboot to get the system operational again. I'll run them again in a few hours to see if something sticks out later on.

show pollers:



show queues:

netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 3590
Data collector                   : 7096
Database writer                  : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 1
Routing table poller             : 2307
Status poller                    : 1940

show flags:

netxmsd: show flags
Flags: 0x43300257
  AF_DAEMON                        = 1
  AF_USE_SYSLOG                    = 1
  AF_ENABLE_NETWORK_DISCOVERY      = 1
  AF_ACTIVE_NETWORK_DISCOVERY      = 0
  AF_LOG_SQL_ERRORS                = 1
  AF_DELETE_EMPTY_SUBNETS          = 0
  AF_ENABLE_SNMP_TRAPD             = 1
  AF_ENABLE_ZONING                 = 0
  AF_SYNC_NODE_NAMES_WITH_DNS      = 0
  AF_CHECK_TRUSTED_NODES           = 1
  AF_WRITE_FULL_DUMP               = 0
  AF_RESOLVE_NODE_NAMES            = 1
  AF_CATCH_EXCEPTIONS              = 1
  AF_INTERNAL_CA                   = 0
  AF_DB_LOCKED                     = 1
  AF_ENABLE_MULTIPLE_DB_CONN       = 1
  AF_DB_CONNECTION_LOST            = 0
  AF_NO_NETWORK_CONNECTIVITY       = 0
  AF_EVENT_STORM_DETECTED          = 0
  AF_SERVER_INITIALIZED            = 1
  AF_SHUTDOWN                      = 0

show mutex:

netxmsd: show mutex
Mutex status:
  g_hMutexIdIndex: unlocked
  g_hMutexNodeIndex: unlocked
  g_hMutexSubnetIndex: unlocked
  g_hMutexInterfaceIndex: unlocked

Thanks

Netvoid

Here is the dump capture, the mdmp file was empty.


NETXMSD CRASH DUMP
Sat Nov 06 12:33:35 2010

EXCEPTION: C0000005 (Access violation) at 7571C120

NetXMS Version: 1.0.5
OS Version: Windows Server 2008 R2 Build 7600
Processor architecture: Intel x86

Register information:
 eax=0015DFC0  ebx=00057800  ecx=0002F9F3  edx=00000000
 esi=0015DFBC  edi=0015DFFC  ebp=054DECE0  esp=054DECD8
 cs=0023  ds=002B  es=002B  ss=002B  fs=0053  gs=002B  flags=00010616

Call stack:
 [msvcrt:7571C120]: (function-name not available)
 [libnetxms:0022A532]: (function-name not available)
 [nxcore:1000F10B]: (function-name not available)
 [nxcore:100114E4]: (function-name not available)
 [(module-name not available):054DEE74]: (function-name not available)
 [(module-name not available):4B8AF098]: (function-name not available)
 [libnxsrv:002564BA]: (function-name not available)

Netvoid

Seems to be crashing every 7-9 hours, same debug info in the dump file each time. Tomorrow I'll try to run those nxadm debug statements every hour or so in order to track the situation.

Thanks

Victor Kirhenshtein

Hi!

Please try to upgrade to 1.0.7, and select "Install PDB files" in components during upgrade. If it will continue to crash, and you are running on multiprocessor/multicore server, try to bound netxmsd process to single core by adding

ProcessAffinityMask = 1

to netxmsd.conf. This will bound server process to CPU #0. If you wish to bound to CPU #1 or other, use value 2 for CPU #1, 4 for CPU #2, 8 for CPU #3, and so on.

Best regards,
Victor

Netvoid

Thanks Victor,

I tried all of this, I'm now running on 1.07 with the ProcessAffinityMask = 1.

I received the following crash dump. The system ran for about 4 hours before the application faulted. Here is the contents of the dump file.

NETXMSD CRASH DUMP
Tue Nov 09 11:52:15 2010

EXCEPTION: C0000005 (Access violation) at 776AC120

NetXMS Version: 1.0.7
OS Version: Windows Server 2008 R2 Build 7600
Processor architecture: Intel x86

Register information:
  eax=00163F00  ebx=00059000  ecx=00006394  edx=00000000
  esi=00163EFC  edi=00163FFC  ebp=04F1ECE0  esp=04F1ECD8
  cs=0023  ds=002B  es=002B  ss=002B  fs=0053  gs=002B  flags=00010612

Call stack:
  [msvcrt:776AC120]: (function-name not available)
  [libnetxms:C:\Source\NetXMS-1.0.x\src\libnetxms\queue.cpp:95]: Queue::Put
  [nxcore:C:\Source\NetXMS-1.0.x\src\server\core\dbwrite.cpp:55]: QueueSQLRequest
  [nxcore:C:\Source\NetXMS-1.0.x\src\server\core\dcitem.cpp:1017]: DCItem::processNewValue
  [nxcore:C:\Source\NetXMS-1.0.x\src\server\core\node.cpp:3514]: Node::processNewDciValue
  [nxcore:C:\Source\NetXMS-1.0.x\src\server\core\datacoll.cpp:147]: DataCollector
  [(module-name not available):04F1FF48]: (function-name not available)
  [(module-name not available):03808570]: (function-name not available)

Victor Kirhenshtein

Could you please try to put attached msvcrt.dll into NetXMS bin directory and restart server?

Best regards,
Victor

Victor Kirhenshtein

Sorry, looks like this will not help, because modern Windows version ignores local copy of msvcrt.dll and always use one from system directory because it's in a list of "known DLL". I suspect that there could be incompatibility with new versions of msvcrt.dll, because netxmsd.exe built with VC6. I'll try to do special build this evening.

Best regards,
Victor


Netvoid

Victor,

First, thank you very much for all this support. I would be happy to find a way to donate, or, is Raden Solutions a direct connection to you folks? I am completely open to changing my OS from Win2008 R2 64bit to anything you suggest for better outcomes. Although I am also happy to continue helping us both resolve this issue.

Just let me know.

Regards,

Victor Kirhenshtein

Thank you! I hope that we will be able to finally solve this issue without forcing you to switch to another OS.
Donations are welcome in any form :) In fact, even your patience with this strange issue resolution is very helpful - because we are unable to reproduce this issue at our systems. And yes, Raden Solutions is owned by me and Alex.

Attaches is another build of libnetxms. Could you try to run server with it?

Best regards,
Victor

Netvoid

Victor,

Replaced these files in the bin folder but had no luck because the service wouldn't start. I reboot just in case and still no luck service wouldn't start. I restored the files and system was able to come up again.

Regards,

Clark


Victor Kirhenshtein

I create full build with changed memory management. Installer available at https://www.netxms.org/download/rc/netxms-1.0.8-rc1.exe. It can be installed as usual, and replaced back to 1.0.7 if needed.

Best regards,
Victor

Netvoid

Yeah I tried this one and it failed to start properly also. Agent and core on the server give similar error upon startup attempts.




Netvoid

#14
Attaching results from,

show pollers
show queues
show flags
show mutex

After server has been up for about an hour rather than right after startup. I expect server crash in 2-4 hours. I will try to run these commands every hour while the server is up to help pinpoint....

Also ran a show stats,

netxmsd: show stats
Total number of objects:     9309
Number of monitored nodes:   3649
Number of collectable DCIs:  8147