Errors Starting 2.2.7

Started by StanHubble, July 23, 2018, 01:00:15 AM

Previous topic - Next topic

StanHubble

Hi,
We are having some problems with 2.2.7.  nxdbmgr reports clean on a "check" and "check_data_tables".
On Startup things start up it seems ok but throws an error:


2018.07.22 12:48:57.007 *D* 28 network device drivers loaded
2018.07.22 12:48:57.007 *D* Built-in objects created
2018.07.22 12:48:57.007 *D* SQLite version 3.22.0
2018.07.22 12:48:57.007 *I* Database driver "sqlite.ddr" loaded and initialized successfully
2018.07.22 12:48:57.007 *D* [db.conn            ] New DB connection opened: handle=000000FE67A8E630
2018.07.22 12:48:57.007 *D* Caching object configuration tables
2018.07.22 12:49:17.694 *D* Loading built-in object properties...
2018.07.22 12:49:17.710 *D* Loading zones...
2018.07.22 12:49:17.710 *D* NetObj::loadCommonProperties() failed for object Default [4] class=6
2018.07.22 12:49:17.819 *D* Loading conditions...

It continues on and loads everything else ... starts polling and tunnel creation and if I check the show q and show thr  in nxadm everything looks ok.  Show watchdog responds:

netxmsd: sho w
Thread                                           Interval Status
----------------------------------------------------------------------------
Item Poller                                      10       Running
Syncer Thread                                    30       Sleeping
Poll Manager                                     5        Sleeping
Ad hoc scheduler                                 5        Sleeping
Recurrent scheduler                              5        Sleeping

A couple of minutes later polling seems to stop (with a lot of active requests waiting) and the watchdog does not respond any more.

netxmsd: sho thr
MAIN
   Threads:            42 (8/1024)
   Load average:       0.32 0.12 0.04
   Current load:       0%
   Usage:              4%
   Active requests:    0
   Scheduled requests: 0

POLLERS
   Threads:            4096 (2048/4096)
   Load average:       10964.91 3197.29 1102.58
   Current load:       414%
   Usage:              100%
   Active requests:    16967
   Scheduled requests: 0

DATACOLL
   Threads:            1024 (10/1024)
   Load average:       1220.03 569.57 211.84
   Current load:       2%
   Usage:              100%
   Active requests:    28
   Scheduled requests: 0

SCHEDULER
   Threads:            1 (1/64)
   Load average:       0.00 0.00 0.00
   Current load:       0%
   Usage:              1%
   Active requests:    0
   Scheduled requests: 0

AGENT
   Threads:            61 (4/512)
   Load average:       12.35 2.86 0.94
   Current load:       100%
   Usage:              11%
   Active requests:    61
   Scheduled requests: 0

SYNCER
   Threads:            1 (1/10)
   Load average:       0.00 0.00 0.00
   Current load:       0%
   Usage:              10%
   Active requests:    0
   Scheduled requests: 0

netxmsd: sho w
Thread                                           Interval Status
----------------------------------------------------------------------------



I can leave this for hours and nothing changes.  The only activities are tunnels opening and closing and threads stopping in data collection due to inactivity.

The only other error message that I can find in the log files is:

Cannot establish connection with ISC peer ::1

I have restarted the netxmsd service many times.
I have rebooted multiple times.
Windows updates are all up to date.  (Windows Server 2012r2 running in a hyperv vm, 32GB 10CPU's)
Database is on a separate 2012 server with MSSQL 2014 with lots of space.

Any help would be appreciated



Victor Kirhenshtein

Hi,

so "show watchdog" actually shows empty list? Can you enter any command in nxadm after that?
Could you please create process dump for netxmsd.exe (using Task Manager) and provide it to us for debug?

Best regards,
Victor

StanHubble

#2
The show watchdog displays the dashes and then hangs, (Ctrl-C) will drop you back to command line.
I can restart nxadm after and do a show queue or show thread, but sho stat, sho watchdog, or show tunnel will hang.

I created the process dump but it is almost 10GB.  Zipped it is 500MB.  Will this forum accept an upload that big?

Victor Kirhenshtein

Quote from: StanHubble on July 23, 2018, 02:07:02 PM
I created the process dump but it is almost 10GB.  Will this forum accept an upload that big?

I don't think so. I'll send you upload link via PM.

Best regards,
Victor

StanHubble

I hate #$%@#$  windows.......
removed and reapplied the latest set of windows updates  and 10 reboots later
, re-allow netxms thru windows firewall rules (even though it is disabled) and held my left foot while drinking a beer
and like magic netxmsd starts properly.

I think it was the beer that did it.

Victor Kirhenshtein

I doubt it's Windows bug, more likely just lucky timing. I checked provided process dump - it is deadlocked in a code related to applying templates and sending events. There were execution errors for template auto apply scripts - in that case server generates event - and this event generation from within auto apply code somehow caused deadlock. I will investigate it further. As a workaround I suggest to check auto-apply scripts for possible errors.

Best regards,
Victor

StanHubble

Yeah ! was too optimistic.....it died after about 12 hours and wont restart.
I have checked the DCI's throwing errors and they were complaining about null values.  They were all dependent dci's that used a GETDCIVALUEBYDESCRIPTiON function in the transformation script.





Victor Kirhenshtein

Would you like to test build with experimental patch that changes some internal synchronization mechanism?

Best regards,
Victor

StanHubble


Victor Kirhenshtein

This is download link for both minimal and full installers: https://cloud.netxms.org/index.php/s/yWU2wcC3c4WO2Y6

Best regards,
Victor

StanHubble

Thanks for the help.  It has been 24 hrs now since I installed 2.2.7.17 and everything is seems to be running fine.

Start up was longer than it used to be, but I can live with that.
I still have a number of errors on some Template and Container Autobind scripts during startup that behave normally when the system is up and running.  These are complaining about null values on variables that are populated either from a GETDCIVALUEBYDESCRIPTION function or a custom attribute of the node being tested. 

Now with 2.2.7 recognizing Node types of "Virtual" I can change some of these autobind scripts key on this instead, but there is a problem.  The config poll is correctly identifying actual virtual machines (if they have an agent) both on VMware and Hyper-V.  But it is also identifying Hyper-V Host machines as virtual (both 2.2.6 and 2.2.7 agents).
I don't have visibility on any VMware hosts machines to say if it is the case with them as well.



Thanks again
Stan

Victor Kirhenshtein

There is issue registered for incorrect Hyper-V host detection: https://track.radensolutions.com/issue/NX-1472. Possible reason is that we are using CPUID "hypervisor" bit to detect if agent is running within a VM, and I suppose that Hyper-V host OS is in fact privileged VM, so agent runs inside virtual environment anyway. We will add additional checks in 2.2.8 to report host as physical machine.

Best regards,
Victor