News:

We really need your input in this questionnaire

Main Menu

Cluster node disappers

Started by Sumit Pandya, December 07, 2010, 07:11:41 AM

Previous topic - Next topic

Sumit Pandya

Hi, I created a Cluster into All Services and created 3 nodes inside that Cluster object. I configured 2DCI with SNMP pool and custom port into cluster object. When I restart my computer then I see cluster node gets disappears. This has happened both time when I created cluster. In my first attempt on Saturday I configured 2 nodes inside and then on Monday I had configured 3 nodes inside cluster. Below is netxms.log which confirms some problem

[07-Dec-2010 10:18:57] Log file opened
[07-Dec-2010 10:18:58] Database driver "sqlite.ddr" loaded and initialized successfully
[07-Dec-2010 10:18:59] Failed to load node object with id 31 from database
[07-Dec-2010 10:18:59] Failed to load node object with id 32 from database
[07-Dec-2010 10:18:59] Failed to load node object with id 41 from database
[07-Dec-2010 10:18:59] Failed to load node object with id 42 from database
[07-Dec-2010 10:18:59] Failed to load node object with id 44 from database
[07-Dec-2010 10:18:59] Inconsistent database: cluster object 28 has reference to non-existing node object 31
[07-Dec-2010 10:18:59] Failed to load cluster object with id 28 from database
[07-Dec-2010 10:18:59] Inconsistent database: cluster object 40 has reference to non-existing node object 41
[07-Dec-2010 10:18:59] Failed to load cluster object with id 40 from database
[07-Dec-2010 10:18:59] Inconsistent database: ServiceRoot object has reference to non-existing child object 40
[07-Dec-2010 10:18:59] Inconsistent database: ServiceRoot object has reference to non-existing child object 44
[07-Dec-2010 10:18:59] NetXMS Server started
[07-Dec-2010 10:18:59] Listening for SNMP traps on UDP socket 0.0.0.0:162
[07-Dec-2010 10:18:59] Listening for client connections on TCP socket 0.0.0.0:4701

Victor Kirhenshtein

Hi!

Looks like a serious bug. Could you please run netxmsd with full debug on (by adding option -D 9) and send me debug log?

Best regards,
Victor

Sumit Pandya

#2
Debug log is attached. It is very easy to simulate. I used nxadm -> down to do proper shutdown and after bringing-up I see my entire cluster is missing. I allow you to use SNMP on Host-IP and OID for verification.

Victor Kirhenshtein

Looks like database is damaged for unknown reason. If I understand correctly, you are running NetXMS server as a service, and database corruption occurs if you restart your PC. I suspect that during shutdown server process terminated before it completes database update, so database becomes inconsistent. To check it, could you please do the following:

1. Run nxdbmgr with debug level 6 (by adding -D 6 to netxmsd.exe command line)
2. Before restarting PC, stop NetXMS Core service manually
3. Restart PC and check database for correctness

If there will be same problem, send me debug log. If problem will disappear, try to restart PC as usual (without stopping NetXMS service), and if problem will appear, send me debug log.

Best regards,
Victor

Sumit Pandya

Yes there is database corruption under some operation. I did fresh install and created 2 nodes inside a cluster. Cluster has got 2 SNMP OID defined as DCI. Then deleted a note and tried to create node again with same name and IP. There issue occurred. Please simulate by yourself it is completely easy. For your information I create everything into "unmanaged" and change cluster node to manage once I configure everything.

I'm having another serious problem. If I forget to create node unmanaged then NetXMS tries to pool my SNMP agent over default port. There is nothing on 161 port and hence NetXMS inactivate node then-after. I see alert for node down. But then how to make that node active? Despite all attempt i failed to communicate with that node.

After facing problem i deleted and tried to recreate node. There the BUG has got exposed!!!

Victor Kirhenshtein

Hi!

I've found a bug (or feature :) ). NetXMS expects that every node which have IP address is bound to at least one IP subnet. Usually this is done automatically during configuration poll. As you have configuration polls disabled, subnet bindings never occurs, and so on startup server considers these nodes as problematic (because they dos not have subnet binding) and fails to load them. I'll change this in next release. As a workaround for current version you can do manual configuration poll on each node once - this will create subnet object and bind nodes to subnet. Also, you can try to replace nxcore.dll and nxdbmgr.exe with attached patched version - this should allow server to load correctly even if nodes does not have subnet binding.

Best regards,
Victor

jdl

Dear Victor,

did you solve this issue or was there something put in the latest releases to alleviate this issue of disappearing cluster nodes?
I'm currently using your latest release (1.1.10) on CentOS 6.0. I have created two clusters X and Y which have each two nodes. All nodes have an interface on network A. The nodes of cluster X have interface on network B and those of cluster Y have interface on network C.

Auto Discovery is on. Both clusters have been created as described in forum. Nodes where discovered and allocated to Clusters. I left the setup stabilize over the week-end. Today, all nodes have disappeared. Both clusters are correctly populated. Polling of any cluster node is successful. Each node is SNMP and NetXMS agent activated. Some other nodes exist and are correctly discovered in networks A, B and C.

Any advise is welcome.

Best regards,

JDamien

Victor Kirhenshtein

Sorry for not returning to this issue for too long. Looking through the code didn't give me any ideas. Is it possible to do either: 1) run system for some time with high debug level (at least 5, 6 is better) and send me debug log after nodes disappear; or 2) give me remote access to the system so I can look around and possibly get additional information?

Best regards,
Victor

jdl

Will do point 1)
For the moment, in the log, we can see that all nodes not included in a cluster are polled. We get messages "Starting status poll for node <IP@> (ID: <ID>)" and "Finished status poll for node <IP@> (ID: <ID>)"
For those nodes that are in a cluster the messages are different: "Starting discovery poll for node <node_name> (<IP@>) in zone 0", then "Discovery poll for node <node_name> (<IP@>) - reading routing table" and finally "Finished discovery poll for node <node_name> (<IP@>)".

Those cluster nodes only appears under "Infrastructure Services" not in the "Entire Network" section and network sub sections. Our netXMS server appears in both containers and is not a cluster.

Br,

Jdamien

Victor Kirhenshtein

That's very strange that cluster nodes does not appear under "entire network". How do you have cluster object configured? Also, what interfaces are seen on cluster nodes?

Best regards,
Victor