NetXMS Support Forum

English Support => General Support => Topic started by: Netvoid on November 05, 2010, 05:15:49 PM

Title: NetXMS Server Crashing
Post by: Netvoid on November 05, 2010, 05:15:49 PM
Server is, Windows 2008 R2 64 Bit with SQL Server 2008 db on a separate machine. Running 1.0.5 server with 1.0.6 clients.

The following error is happening,

(http://s3.postimage.org/Sy60S.jpg) (http://postimage.org/image/32qwucx1g/)

We have about 450 agents and maybe a total of 1500 nodes. Only about 4-6 DCI per node, vast majority are just monitoring windows event logs for errors and criticals.

I am noticing that we are getting a fair amount of agents dropping connectivity and then restoring in blocks of about 10-20 every 10-20 minutes. The crashes only started after we went from about 300 agents installed to 450 agents installed. I put the numberofstatuspollers up to 50 at first and that didn't seem to help, now I am putting the number back down to 30. That seems related but I'm not sure. The netowrk utilization is steady at about 2-3 percent. The processor is steady at about 10-20 percent. Disk IO is nothing...

Any suggestions based on this information would be helpful.
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 05, 2010, 11:32:59 PM
I'll try to check this. Also, could you please enable crash dump generation on server and send generated dumps to [email protected]? This can greatly simplify debugging. You can enable crash dump generation by setting two parameters in netxmsd.conf:

CreateCrashDumps = yes
DumpDirectory = some existing directory writable by server process

Also, could you please run command "nxadm -i" on server machine post output of the following commands:

show pollers
show queues
show flags
show mutex

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 06, 2010, 06:00:41 PM
I enabled the crash dumps, next time it happens I'll post.

These commands were run about 2-3 minutes after a reboot to get the system operational again. I'll run them again in a few hours to see if something sticks out later on.

show pollers:

(http://s4.postimage.org/RfMQr.jpg) (http://postimage.org/image/304k0q944/)

show queues:

netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 3590
Data collector                   : 7096
Database writer                  : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 1
Routing table poller             : 2307
Status poller                    : 1940

show flags:

netxmsd: show flags
Flags: 0x43300257
  AF_DAEMON                        = 1
  AF_USE_SYSLOG                    = 1
  AF_ENABLE_NETWORK_DISCOVERY      = 1
  AF_ACTIVE_NETWORK_DISCOVERY      = 0
  AF_LOG_SQL_ERRORS                = 1
  AF_DELETE_EMPTY_SUBNETS          = 0
  AF_ENABLE_SNMP_TRAPD             = 1
  AF_ENABLE_ZONING                 = 0
  AF_SYNC_NODE_NAMES_WITH_DNS      = 0
  AF_CHECK_TRUSTED_NODES           = 1
  AF_WRITE_FULL_DUMP               = 0
  AF_RESOLVE_NODE_NAMES            = 1
  AF_CATCH_EXCEPTIONS              = 1
  AF_INTERNAL_CA                   = 0
  AF_DB_LOCKED                     = 1
  AF_ENABLE_MULTIPLE_DB_CONN       = 1
  AF_DB_CONNECTION_LOST            = 0
  AF_NO_NETWORK_CONNECTIVITY       = 0
  AF_EVENT_STORM_DETECTED          = 0
  AF_SERVER_INITIALIZED            = 1
  AF_SHUTDOWN                      = 0

show mutex:

netxmsd: show mutex
Mutex status:
  g_hMutexIdIndex: unlocked
  g_hMutexNodeIndex: unlocked
  g_hMutexSubnetIndex: unlocked
  g_hMutexInterfaceIndex: unlocked

Thanks
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 07, 2010, 01:55:36 AM
Here is the dump capture, the mdmp file was empty.


NETXMSD CRASH DUMP
Sat Nov 06 12:33:35 2010

EXCEPTION: C0000005 (Access violation) at 7571C120

NetXMS Version: 1.0.5
OS Version: Windows Server 2008 R2 Build 7600
Processor architecture: Intel x86

Register information:
 eax=0015DFC0  ebx=00057800  ecx=0002F9F3  edx=00000000
 esi=0015DFBC  edi=0015DFFC  ebp=054DECE0  esp=054DECD8
 cs=0023  ds=002B  es=002B  ss=002B  fs=0053  gs=002B  flags=00010616

Call stack:
 [msvcrt:7571C120]: (function-name not available)
 [libnetxms:0022A532]: (function-name not available)
 [nxcore:1000F10B]: (function-name not available)
 [nxcore:100114E4]: (function-name not available)
 [(module-name not available):054DEE74]: (function-name not available)
 [(module-name not available):4B8AF098]: (function-name not available)
 [libnxsrv:002564BA]: (function-name not available)
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 08, 2010, 12:56:52 AM
Seems to be crashing every 7-9 hours, same debug info in the dump file each time. Tomorrow I'll try to run those nxadm debug statements every hour or so in order to track the situation.

Thanks
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 09, 2010, 02:24:28 PM
Hi!

Please try to upgrade to 1.0.7, and select "Install PDB files" in components during upgrade. If it will continue to crash, and you are running on multiprocessor/multicore server, try to bound netxmsd process to single core by adding

ProcessAffinityMask = 1

to netxmsd.conf. This will bound server process to CPU #0. If you wish to bound to CPU #1 or other, use value 2 for CPU #1, 4 for CPU #2, 8 for CPU #3, and so on.

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 09, 2010, 11:00:46 PM
Thanks Victor,

I tried all of this, I'm now running on 1.07 with the ProcessAffinityMask = 1.

I received the following crash dump. The system ran for about 4 hours before the application faulted. Here is the contents of the dump file.

NETXMSD CRASH DUMP
Tue Nov 09 11:52:15 2010

EXCEPTION: C0000005 (Access violation) at 776AC120

NetXMS Version: 1.0.7
OS Version: Windows Server 2008 R2 Build 7600
Processor architecture: Intel x86

Register information:
  eax=00163F00  ebx=00059000  ecx=00006394  edx=00000000
  esi=00163EFC  edi=00163FFC  ebp=04F1ECE0  esp=04F1ECD8
  cs=0023  ds=002B  es=002B  ss=002B  fs=0053  gs=002B  flags=00010612

Call stack:
  [msvcrt:776AC120]: (function-name not available)
  [libnetxms:C:\Source\NetXMS-1.0.x\src\libnetxms\queue.cpp:95]: Queue::Put
  [nxcore:C:\Source\NetXMS-1.0.x\src\server\core\dbwrite.cpp:55]: QueueSQLRequest
  [nxcore:C:\Source\NetXMS-1.0.x\src\server\core\dcitem.cpp:1017]: DCItem::processNewValue
  [nxcore:C:\Source\NetXMS-1.0.x\src\server\core\node.cpp:3514]: Node::processNewDciValue
  [nxcore:C:\Source\NetXMS-1.0.x\src\server\core\datacoll.cpp:147]: DataCollector
  [(module-name not available):04F1FF48]: (function-name not available)
  [(module-name not available):03808570]: (function-name not available)
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 10, 2010, 10:15:43 AM
Could you please try to put attached msvcrt.dll into NetXMS bin directory and restart server?

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 10, 2010, 10:36:35 AM
Sorry, looks like this will not help, because modern Windows version ignores local copy of msvcrt.dll and always use one from system directory because it's in a list of "known DLL". I suspect that there could be incompatibility with new versions of msvcrt.dll, because netxmsd.exe built with VC6. I'll try to do special build this evening.

Best regards,
Victor

Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 10, 2010, 09:17:17 PM
Victor,

First, thank you very much for all this support. I would be happy to find a way to donate, or, is Raden Solutions a direct connection to you folks? I am completely open to changing my OS from Win2008 R2 64bit to anything you suggest for better outcomes. Although I am also happy to continue helping us both resolve this issue.

Just let me know.

Regards,
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 10, 2010, 10:27:28 PM
Thank you! I hope that we will be able to finally solve this issue without forcing you to switch to another OS.
Donations are welcome in any form :) In fact, even your patience with this strange issue resolution is very helpful - because we are unable to reproduce this issue at our systems. And yes, Raden Solutions is owned by me and Alex.

Attaches is another build of libnetxms. Could you try to run server with it?

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 10, 2010, 10:53:54 PM
Victor,

Replaced these files in the bin folder but had no luck because the service wouldn't start. I reboot just in case and still no luck service wouldn't start. I restored the files and system was able to come up again.

Regards,

Clark

Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 11, 2010, 12:44:46 AM
I create full build with changed memory management. Installer available at https://www.netxms.org/download/rc/netxms-1.0.8-rc1.exe (https://www.netxms.org/download/rc/netxms-1.0.8-rc1.exe). It can be installed as usual, and replaced back to 1.0.7 if needed.

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 11, 2010, 01:19:27 AM
Yeah I tried this one and it failed to start properly also. Agent and core on the server give similar error upon startup attempts.

(http://img585.imageshack.us/img585/2056/errf.jpg) (http://img585.imageshack.us/i/errf.jpg/)

Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 11, 2010, 07:48:58 PM
Attaching results from,

show pollers
show queues
show flags
show mutex

After server has been up for about an hour rather than right after startup. I expect server crash in 2-4 hours. I will try to run these commands every hour while the server is up to help pinpoint....

Also ran a show stats,

netxmsd: show stats
Total number of objects:     9309
Number of monitored nodes:   3649
Number of collectable DCIs:  8147
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 11, 2010, 08:11:29 PM
Hi!

Numbers are quite high, especially database writer queue. I would suggest to change some server parameters, taking into account number of nodes you have:

NumberOfStatusPollers = 100
NumberOfConfigurationPollers = 25
NumberOfRoutingTablePollers = 25

NumberOfDatabaseWriters = 5


I would also suggest to increase status and configuration polling intervals, which are one minute and one hour respectively. My recommended settings is to set configuration polling interval to at least 4 hours, and status polling interval to 90 seconds. It can be done by setting the following parameters:

StatusPollingInterval = 90
ConfigurationPollingInterval = 14400


Best regards,
Victor


Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 11, 2010, 08:17:33 PM
Few things I forgot:

- Remove ProcessAffinityMask option from netxmsd.conf. It does not help with the crash issue, but prevents using of all CPUs in the system.

- Some additional questions:

1. How many memory consumes netxmsd process? Is it growing?
2. What is the state of SQL server? So high value for database writer queue means that either NetXMS server cannot send requests as fast as needed or SQL server cannot process them as fast as needed. Increasing value for NumberOfDatabaseWriters will speed up request sending, but will SQL server be able to handle them?

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 11, 2010, 08:58:29 PM
This hour stats..


netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 2978
Data collector                   : 0
Database writer                  : 181423
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 2316

netxmsd: show mutex
Mutex status:
  g_hMutexIdIndex: unlocked
  g_hMutexNodeIndex: unlocked
  g_hMutexSubnetIndex: unlocked
  g_hMutexInterfaceIndex: unlocked

netxmsd: show flags
Flags: 0x43300257
  AF_DAEMON                        = 1
  AF_USE_SYSLOG                    = 1
  AF_ENABLE_NETWORK_DISCOVERY      = 1
  AF_ACTIVE_NETWORK_DISCOVERY      = 0
  AF_LOG_SQL_ERRORS                = 1
  AF_DELETE_EMPTY_SUBNETS          = 0
  AF_ENABLE_SNMP_TRAPD             = 1
  AF_ENABLE_ZONING                 = 0
  AF_SYNC_NODE_NAMES_WITH_DNS      = 0
  AF_CHECK_TRUSTED_NODES           = 1
  AF_WRITE_FULL_DUMP               = 0
  AF_RESOLVE_NODE_NAMES            = 1
  AF_CATCH_EXCEPTIONS              = 1
  AF_INTERNAL_CA                   = 0
  AF_DB_LOCKED                     = 1
  AF_ENABLE_MULTIPLE_DB_CONN       = 1
  AF_DB_CONNECTION_LOST            = 0
  AF_NO_NETWORK_CONNECTIVITY       = 0
  AF_EVENT_STORM_DETECTED          = 0
  AF_SERVER_INITIALIZED            = 1
  AF_SHUTDOWN                      = 0

netxmsd: show stats
Total number of objects:     9309
Number of monitored nodes:   3649
Number of collectable DCIs:  8147
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 11, 2010, 09:16:58 PM
Memory usage of the netxmsd before making all these changes and restarting was holding at about 130mb.

The database should be healthy, I'll watch the perfomance logs on that after these changes but the SQL server is low usage and on a higher performance segment of our SAN.

These suggestions,

StatusPollingInterval = 90
ConfigurationPollingInterval = 14400

I had already moved status interval to 90, so I bumped it to 120. The configuration I put at 14400 as you suggest and it was 3600.

Just restarted with the revised setting and removed the processor affinity. I'll check and post the stats in an hour.



Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 11, 2010, 10:43:08 PM
I am noticing a tremendous amount of these disconnect and reconnects in the event log since I revised the settings. I attached screen shot. I put the status poller interval back down to 90 from 120 in case that is the cause.

memory is sitting at 110mb after being up for an hour or so..

queues look like this now,

netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 523
Data collector                   : 5343
Database writer                  : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 190

Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 11, 2010, 11:11:46 PM
Queues looks much better now. The only possibly problematic number is data collector's queue. Is it always like this, or it was just a short pike? Btw, by default server create DCIs for internal queue sizes. If those DCIs are configured, could you post a charts of them for last hour or two?
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 12, 2010, 12:14:25 AM
The client connectivity loss/restore entries all still quite high. Dozens every few minutes, as shown by the attached event log.

Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 12, 2010, 01:53:11 AM
Server responding very slowly now.. For example event log takes a minute to load up with a progress bar I had not seen before... The log is riddled with agent connection loss/restored messages. 50-100 clients are losing connection and restoring connections per minute.


The queues are currently,

netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 0
Data collector                   : 0
Database writer                  : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 132


netxmsd: show mutex
Mutex status:
  g_hMutexIdIndex: unlocked
  g_hMutexNodeIndex: unlocked
  g_hMutexSubnetIndex: unlocked
  g_hMutexInterfaceIndex: unlocked

netxmsd: show flags
Flags: 0x43300257
  AF_DAEMON                        = 1
  AF_USE_SYSLOG                    = 1
  AF_ENABLE_NETWORK_DISCOVERY      = 1
  AF_ACTIVE_NETWORK_DISCOVERY      = 0
  AF_LOG_SQL_ERRORS                = 1
  AF_DELETE_EMPTY_SUBNETS          = 0
  AF_ENABLE_SNMP_TRAPD             = 1
  AF_ENABLE_ZONING                 = 0
  AF_SYNC_NODE_NAMES_WITH_DNS      = 0
  AF_CHECK_TRUSTED_NODES           = 1
  AF_WRITE_FULL_DUMP               = 0
  AF_RESOLVE_NODE_NAMES            = 1
  AF_CATCH_EXCEPTIONS              = 1
  AF_INTERNAL_CA                   = 0
  AF_DB_LOCKED                     = 1
  AF_ENABLE_MULTIPLE_DB_CONN       = 1
  AF_DB_CONNECTION_LOST            = 0
  AF_NO_NETWORK_CONNECTIVITY       = 0
  AF_EVENT_STORM_DETECTED          = 0
  AF_SERVER_INITIALIZED            = 1
  AF_SHUTDOWN                      = 0

The memory utilization is holding at about 121mb.
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 12, 2010, 02:01:26 AM
After a couple minutes the event viewer started responding more quickly again..

The queues went to all zero.

And just as a heads up this is what my SQL server activity is looking like over the last while...
Title: Re: NetXMS Server Crashing
Post by: Sumit Pandya on November 12, 2010, 10:24:05 AM
Very random post... Please ensure that your network connectivity is stable. I experienced random NetXMS problem/behavior when network availability fluctuates.
Just give a try to install "Microsoft Loopback Adapter" from Add Hardware Wizard. Assign some Private IP (Other then your network). I'm sure there is no harm in installing loopback adapter
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 12, 2010, 05:20:33 PM
Server has not crashed in almost 24 hours. The only remaning issue seems to be the connectivity loss of the agents dropping and restoring.
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 12, 2010, 05:45:53 PM
I suspect that it may hit some limits for open connections. There was a limit of around 4000 outgoing TCP connections in Windows 2003, I'm not sure how it is in Windows 2008. Could you please run netstat -n and count connections to external IPs on port 4700? And how many of them are not in ESTABLISHED state?

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 12, 2010, 07:21:33 PM
292 total for port 4700, about 90 are in TIME_WAIT or SYN_SENT...

After a fresh reboot and a few minutes time, I see about 950 with 600 in TIME_WAIT.
Title: Re: NetXMS Server Crashing
Post by: Sumit Pandya on November 15, 2010, 03:49:26 PM
If there are some socket related limitation that can be overcome by below guidelines
Start Registry Editor. Browse to, and then click the following key in the registry:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
On the Edit menu, click New, DWORD Value, and then add the following registry values
1. Reduce the client TCP/IP socket connection timeout value from the default value of 240 seconds
Value name: TcpTimedWaitDelay
Value data: <Enter a decimal value between 30 and 240 here>

2.  Increase the number of ephemeral ports that can by dynamically allocated to clients:
Value name: MaxUserPort
Value data: <Enter a decimal value between 5000 and 65534 here>

Victor, Please consider using "SO_REUSEADDR" socket setting, by calling "setsockopt()"

Reference for all registry is from http://support.microsoft.com/default.aspx?scid=kb;en-us;314053
http://technet.microsoft.com/en-us/library/bb726981.aspx#EDAA
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 15, 2010, 05:12:47 PM
Still getting hundreds of agent timeouts per minute with negligible amounts of visible system resource utilization.

Also, even the console and command line nxadm -i timeouts are very common. Even when running the console locally.

For example the, "show queues" command at times never responds on the server. If I close it and run it a few more times it will eventually give results. According to the results nothing is queued.

The netxmsd process is at about 300mb and holding, the server has been running for at least 4 days now without a crash. It takes about 5-10 timeouts and retries with the console app to get it connected.
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 15, 2010, 07:46:24 PM
Few additional questions:

Do you have a Windows firewall running on server machine?
Is there any error messages in netxmsd log?

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 15, 2010, 10:42:45 PM
Yes windows firewall is running, disabled for domain, enabled in private and public. Not currently logging.
Don't see any netxmsd log / error.
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 16, 2010, 12:32:29 AM
Could you please try to install updated build: https://www.netxms.org/download/rc/netxms-1.0.8-rc2.exe (https://www.netxms.org/download/rc/netxms-1.0.8-rc2.exe). If it will not solve the problem, please set logging to file (by setting LogFile = some_file in netxmsd.conf) and run netxmsd.exe with -D 6 command line option (this will enable debug output). It will generate a lot of messages in the log file - I'm interested in messages containing words "agent unreachable, error=".

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 16, 2010, 05:54:45 PM
Okay performed the update and I pretty much don't see a difference.

Here are a variety of entries for the error you were asking about.

This is the most common one, lots of these 98% is this.

agent unreachable, error=910, socketError=0

Then we have some of these,

agent unreachable, error=500, socketError=0

A few of these,

agent unreachable, error=500, socketError=10053
agent unreachable, error=500, socketError=10054

Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 16, 2010, 10:17:07 PM
Hi!

I'm still unable to reproduce this issue or find any clue to the source of the problem. Please try to replace libnetxms.dll to attached one - it has another changes in communication code. At least, it may give more meaningful error codes in "agent unreachable" messages in the log.

Also, do you experience this issue from the beginning, or it appears after latest upgrades?

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Netvoid on November 16, 2010, 10:49:52 PM
It all seemed to start after we went from about 150-200 agents to about 400+ agents.

I'm going to be moving the system to a 32 bit server today or tomorrow to see if that elimates the connection issue, I don't think it would be related but worth a try.

Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 16, 2010, 11:01:32 PM
It also could be Windows 2008 issue. We have installation in Riga with 700+ agents - and I never seen issues like that. They are running on Windows Server 2003 x86.

Best regards,
Victor
Title: Re: NetXMS Server Crashing
Post by: Sumit Pandya on November 17, 2010, 06:46:59 AM
Do you guys has any reluctance on my suggestion put about
1. Microsoft Loopback adapter
2. Registry values for MaxUserPort and TcpTimedWaitDelay
3. Using SO_REUSEADDR by setsockopt()
Please give them a try. I undestand that some setup have much higher load then current setup but you need to understand that every setup/network is different!!!
Title: Re: NetXMS Server Crashing
Post by: Victor Kirhenshtein on November 17, 2010, 10:05:47 AM
Hi!

Last special build uses SO_REUSEADDR option - it makes no difference. As for MaxUserPort, things are different on Windows 2008 - http://support.microsoft.com/kb/929851 (http://support.microsoft.com/kb/929851). So by default we should have about 16000 ports for outgoing connections, which is much more then required.

Best regards,
Victor