NetXMS Server Crashing

Started by Netvoid, November 05, 2010, 05:15:49 PM

Previous topic - Next topic

Victor Kirhenshtein

Hi!

Numbers are quite high, especially database writer queue. I would suggest to change some server parameters, taking into account number of nodes you have:

NumberOfStatusPollers = 100
NumberOfConfigurationPollers = 25
NumberOfRoutingTablePollers = 25

NumberOfDatabaseWriters = 5


I would also suggest to increase status and configuration polling intervals, which are one minute and one hour respectively. My recommended settings is to set configuration polling interval to at least 4 hours, and status polling interval to 90 seconds. It can be done by setting the following parameters:

StatusPollingInterval = 90
ConfigurationPollingInterval = 14400


Best regards,
Victor



Victor Kirhenshtein

Few things I forgot:

- Remove ProcessAffinityMask option from netxmsd.conf. It does not help with the crash issue, but prevents using of all CPUs in the system.

- Some additional questions:

1. How many memory consumes netxmsd process? Is it growing?
2. What is the state of SQL server? So high value for database writer queue means that either NetXMS server cannot send requests as fast as needed or SQL server cannot process them as fast as needed. Increasing value for NumberOfDatabaseWriters will speed up request sending, but will SQL server be able to handle them?

Best regards,
Victor

Netvoid

This hour stats..


netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 2978
Data collector                   : 0
Database writer                  : 181423
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 2316

netxmsd: show mutex
Mutex status:
  g_hMutexIdIndex: unlocked
  g_hMutexNodeIndex: unlocked
  g_hMutexSubnetIndex: unlocked
  g_hMutexInterfaceIndex: unlocked

netxmsd: show flags
Flags: 0x43300257
  AF_DAEMON                        = 1
  AF_USE_SYSLOG                    = 1
  AF_ENABLE_NETWORK_DISCOVERY      = 1
  AF_ACTIVE_NETWORK_DISCOVERY      = 0
  AF_LOG_SQL_ERRORS                = 1
  AF_DELETE_EMPTY_SUBNETS          = 0
  AF_ENABLE_SNMP_TRAPD             = 1
  AF_ENABLE_ZONING                 = 0
  AF_SYNC_NODE_NAMES_WITH_DNS      = 0
  AF_CHECK_TRUSTED_NODES           = 1
  AF_WRITE_FULL_DUMP               = 0
  AF_RESOLVE_NODE_NAMES            = 1
  AF_CATCH_EXCEPTIONS              = 1
  AF_INTERNAL_CA                   = 0
  AF_DB_LOCKED                     = 1
  AF_ENABLE_MULTIPLE_DB_CONN       = 1
  AF_DB_CONNECTION_LOST            = 0
  AF_NO_NETWORK_CONNECTIVITY       = 0
  AF_EVENT_STORM_DETECTED          = 0
  AF_SERVER_INITIALIZED            = 1
  AF_SHUTDOWN                      = 0

netxmsd: show stats
Total number of objects:     9309
Number of monitored nodes:   3649
Number of collectable DCIs:  8147

Netvoid

Memory usage of the netxmsd before making all these changes and restarting was holding at about 130mb.

The database should be healthy, I'll watch the perfomance logs on that after these changes but the SQL server is low usage and on a higher performance segment of our SAN.

These suggestions,

StatusPollingInterval = 90
ConfigurationPollingInterval = 14400

I had already moved status interval to 90, so I bumped it to 120. The configuration I put at 14400 as you suggest and it was 3600.

Just restarted with the revised setting and removed the processor affinity. I'll check and post the stats in an hour.




Netvoid

#19
I am noticing a tremendous amount of these disconnect and reconnects in the event log since I revised the settings. I attached screen shot. I put the status poller interval back down to 90 from 120 in case that is the cause.

memory is sitting at 110mb after being up for an hour or so..

queues look like this now,

netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 523
Data collector                   : 5343
Database writer                  : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 190


Victor Kirhenshtein

Queues looks much better now. The only possibly problematic number is data collector's queue. Is it always like this, or it was just a short pike? Btw, by default server create DCIs for internal queue sizes. If those DCIs are configured, could you post a charts of them for last hour or two?

Netvoid

The client connectivity loss/restore entries all still quite high. Dozens every few minutes, as shown by the attached event log.


Netvoid

Server responding very slowly now.. For example event log takes a minute to load up with a progress bar I had not seen before... The log is riddled with agent connection loss/restored messages. 50-100 clients are losing connection and restoring connections per minute.


The queues are currently,

netxmsd: show queues
Condition poller                 : 0
Configuration poller             : 0
Data collector                   : 0
Database writer                  : 0
Event processor                  : 0
Network discovery poller         : 0
Node poller                      : 0
Routing table poller             : 0
Status poller                    : 132


netxmsd: show mutex
Mutex status:
  g_hMutexIdIndex: unlocked
  g_hMutexNodeIndex: unlocked
  g_hMutexSubnetIndex: unlocked
  g_hMutexInterfaceIndex: unlocked

netxmsd: show flags
Flags: 0x43300257
  AF_DAEMON                        = 1
  AF_USE_SYSLOG                    = 1
  AF_ENABLE_NETWORK_DISCOVERY      = 1
  AF_ACTIVE_NETWORK_DISCOVERY      = 0
  AF_LOG_SQL_ERRORS                = 1
  AF_DELETE_EMPTY_SUBNETS          = 0
  AF_ENABLE_SNMP_TRAPD             = 1
  AF_ENABLE_ZONING                 = 0
  AF_SYNC_NODE_NAMES_WITH_DNS      = 0
  AF_CHECK_TRUSTED_NODES           = 1
  AF_WRITE_FULL_DUMP               = 0
  AF_RESOLVE_NODE_NAMES            = 1
  AF_CATCH_EXCEPTIONS              = 1
  AF_INTERNAL_CA                   = 0
  AF_DB_LOCKED                     = 1
  AF_ENABLE_MULTIPLE_DB_CONN       = 1
  AF_DB_CONNECTION_LOST            = 0
  AF_NO_NETWORK_CONNECTIVITY       = 0
  AF_EVENT_STORM_DETECTED          = 0
  AF_SERVER_INITIALIZED            = 1
  AF_SHUTDOWN                      = 0

The memory utilization is holding at about 121mb.

Netvoid

After a couple minutes the event viewer started responding more quickly again..

The queues went to all zero.

And just as a heads up this is what my SQL server activity is looking like over the last while...

Sumit Pandya

Very random post... Please ensure that your network connectivity is stable. I experienced random NetXMS problem/behavior when network availability fluctuates.
Just give a try to install "Microsoft Loopback Adapter" from Add Hardware Wizard. Assign some Private IP (Other then your network). I'm sure there is no harm in installing loopback adapter

Netvoid

Server has not crashed in almost 24 hours. The only remaning issue seems to be the connectivity loss of the agents dropping and restoring.

Victor Kirhenshtein

I suspect that it may hit some limits for open connections. There was a limit of around 4000 outgoing TCP connections in Windows 2003, I'm not sure how it is in Windows 2008. Could you please run netstat -n and count connections to external IPs on port 4700? And how many of them are not in ESTABLISHED state?

Best regards,
Victor

Netvoid

#27
292 total for port 4700, about 90 are in TIME_WAIT or SYN_SENT...

After a fresh reboot and a few minutes time, I see about 950 with 600 in TIME_WAIT.

Sumit Pandya

If there are some socket related limitation that can be overcome by below guidelines
Start Registry Editor. Browse to, and then click the following key in the registry:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
On the Edit menu, click New, DWORD Value, and then add the following registry values
1. Reduce the client TCP/IP socket connection timeout value from the default value of 240 seconds
Value name: TcpTimedWaitDelay
Value data: <Enter a decimal value between 30 and 240 here>

2.  Increase the number of ephemeral ports that can by dynamically allocated to clients:
Value name: MaxUserPort
Value data: <Enter a decimal value between 5000 and 65534 here>

Victor, Please consider using "SO_REUSEADDR" socket setting, by calling "setsockopt()"

Reference for all registry is from http://support.microsoft.com/default.aspx?scid=kb;en-us;314053
http://technet.microsoft.com/en-us/library/bb726981.aspx#EDAA

Netvoid

Still getting hundreds of agent timeouts per minute with negligible amounts of visible system resource utilization.

Also, even the console and command line nxadm -i timeouts are very common. Even when running the console locally.

For example the, "show queues" command at times never responds on the server. If I close it and run it a few more times it will eventually give results. According to the results nothing is queued.

The netxmsd process is at about 300mb and holding, the server has been running for at least 4 days now without a crash. It takes about 5-10 timeouts and retries with the console app to get it connected.