Agent service crashes, delaying boot times, BIG problem

Started by xenth, May 28, 2008, 10:22:50 AM

Previous topic - Next topic

xenth

On all the machines that I am monitoring, startup times are quite long

Yesterday I found out that the NetXMS agent is causing this problem, basically the agent service is started but it immediately crashes, after some time it starts up again and everything is fine. The readouts are also fine.

This problem is causing a two minute delay (or longer) on boot times for some of my more slower machines  >:(

I am using version 0.2.21, I haven't yet tested on X64 machines but most likely the problem occurs there as well. So far I have only tested windows XP machines for this problem, I will test vista machines as well if you need that infromation.

Config file:


#
# NetXMS agent configuration file
# Created by agent installer at Tue May 13 14:14:16 2008
#

MasterServers = [censored]
LogFile = {syslog}
FileStore = C:\NetXMS\var
SubAgent = winperf.nsm


Here are all the syslog reports (not running an english OS so I translated some things)

1st one

Type of event: Information
Source of event: NetXMS Win32 Agent
Category of event: None
Event-ID: 3
Date 28-5-2008
Time: 9:04:57
User: none
Computer: Censored
Description:
Subagent "WINNT.NSM" loaded successfully


2nd

Type of event: error
Source of event: Service Control Manager
Category of event: None
event-ID: 7022
Date: 28-5-2008
Time: 9:06:32
User: none
Computer: [censored]
Description:

The NeTXMS agent-service has crashed at startup.


20 Second after that 5 other syslog events regarding the agent are created with the following descriptions:

3d:
Counter set B is empty, collector thread for that set will not start

4th:
Subagent "winperf.nsm" loaded successfully

5th:
Counter set C is empty, collector thread for that set will not start

6th:
Listening on socket 0.0.0.0:4700

7th (23 seconds after error or so):
NetXMS Agent started

This is a very big problem as you can understand, if you need any other info please ask.


Victor Kirhenshtein

Hi!

Sounds very strange, I'll check the agent startup code. Just one question so far: do you use host name or IP address in MasterServers parameter?

Best regards,
Victor

xenth

Thank you!

I use the IP address, it's impossible to use the host name in my setup (Set the firewall to block DNS to NetXMS section)



xenth

Hi, any updates you can give me on this problem are very welcome  :)

xenth


Victor Kirhenshtein

Sorry for delay, I was quite busy :( Could you please do some additional tests?

1. Comment out winperf.nsm subagent - I suspect that it can be the cause of crash.
2. If agent service crashes without winperf.nsm, run nxagentd.exe from command line with -D switch - it will cause agent to log additional debug information.

Best regards,
Victor

Victor Kirhenshtein

One more question: how counter names \System\System Up Time and \System\Threads looks in your localized Windows?

Best regards,
Victor

xenth

Quote from: Victor Kirhenshtein on June 04, 2008, 09:18:38 AM
Sorry for delay, I was quite busy :( Could you please do some additional tests?

1. Comment out winperf.nsm subagent - I suspect that it can be the cause of crash.
2. If agent service crashes without winperf.nsm, run nxagentd.exe from command line with -D switch - it will cause agent to log additional debug information.

Best regards,
Victor


Here are some tests I ran for you:

Config file change:

Quote
#
# NetXMS agent configuration file
# Created by agent installer at Tue May 13 14:14:16 2008
#

MasterServers = 192.168.150.9
LogFile = {syslog}
FileStore = C:\NetXMS\var
# SubAgent = winperf.nsm

Time from login screen to usable system: 2.40 Minutes  :(

Agent completely disabled, time: 2.05 Minutes

Standard settings: 3 minutes

Winperf renamed and taken out of config: 2.30 minutes

Placed config in C: root to test nxagentd -D : 3.4 minutes


System.uptime: Gives value of 574  (on for 5 minutes ~ )
system.threads: gives value of 386

NetXMS agent crash is gone after I make sure winperf doesn't start but slow start up isn't fixed by this.

Hope this gives you enough info!

Victor Kirhenshtein

Hello!

I'm still unable to reproduce your situation nor identify possible cause. Could you please change logging to file by placing


LogFile = C:\nxagentd.log


to agent's config, run nxagentd -D and send me nxagentd.log file?

Also, one more test for crash: try to uncomment winperf.nsm and add the following to agent's config:


*WinPerf
EnableDefaultCounters = no


will it crash as well?

Best regards,
Victor

xenth

Hi :)

Here is your requested log file, I left a comment for you at the bottom


**************************************************************
[10-Jun-2008 11:28:08] Log file opened
[10-Jun-2008 11:28:08] Debug: Log file opened
[10-Jun-2008 11:28:08] Debug: Subagent API initialized
[10-Jun-2008 11:28:09] Subagent "WINNT.NSM" loaded successfully
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \System\Threads ==> \Systeem\Threads
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \System\System Up Time ==> \Systeem\Systeem ingeschakeld
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \System\Processor Queue Length ==> \Systeem\Lengte van de processorwachtrij
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \System\Processor Queue Length ==> \Systeem\Lengte van de processorwachtrij
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \System\Processor Queue Length ==> \Systeem\Lengte van de processorwachtrij
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \Processor(_Total)\% Processor Time ==> \Processor(_Total)\Percentage processortijd
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \Processor(_Total)\% Processor Time ==> \Processor(_Total)\Percentage processortijd
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \Processor(_Total)\% Processor Time ==> \Processor(_Total)\Percentage processortijd
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \PhysicalDisk(_Total)\Avg. Disk Queue Length ==> \Fysieke schijf(_Total)\Gemiddelde wachtrijlengte voor schijf
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \PhysicalDisk(_Total)\% Disk Time ==> \Fysieke schijf(_Total)\Percentage schijftijd
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \Processor(0)\% Processor Time ==> \Processor(0)\Percentage processortijd
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \Processor(0)\% Processor Time ==> \Processor(0)\Percentage processortijd
[10-jun-2008 11:28:19] Debug: WINPERF: Counter translated: \Processor(0)\% Processor Time ==> \Processor(0)\Percentage processortijd
[10-jun-2008 11:28:19] Subagent "winperf.nsm" loaded successfully
[10-jun-2008 11:28:19] Counter set B is empty, collector thread for that set will not start
[10-jun-2008 11:28:19] Counter set C is empty, collector thread for that set will not start
[10-jun-2008 11:28:20] Unable to bind socket: Elk socketadres (protocol/netwerkadres/poort) kan normaal slechts één keer worden gebruikt.


=========== Victor: Bottom line of this log says:   Every socket address (protocol/networkadress/port) can only be used once. ===========


I am rebooting right now to test the default counter config setting :)


xenth

Hi!

After changing the config fiel in netxms\etc to your specified settings the crashing problem has been fixed :)

What does this mean?

Victor Kirhenshtein

This means that WINPERF subagent has some problems when translating English counter names to Dutch names. I will check that code. It never runs on English version of Windows (it's why I was unable to reproduce this problem on my systems), and as far as I know runs without problem on Russian version.

Regarding startup time - I see that there are only 12 seconds between agent start and socket binding error (11:28:08 - 11:28:20). Socket error most likely caused by already running nxagentd.exe instance. Could you please check that agent service is stopped and nobody listens on port 4700, and run agent in debug mode again so we have full startup timing?

Best regards,
Victor