netxmsd not listening to 4701

Started by thierryc, March 16, 2018, 06:37:53 PM

Previous topic - Next topic

thierryc

Hello,

Netxms was woring fine for some time but then suddenly (without notice it was no reachable anymore).
I run in version 2.2.4 on ubuntu with tomcat7 for the web

netstat -nl  gives me:

tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:21784         0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:6010          0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:4700            0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
tcp6       0      0 ::1:6010                :::*                    LISTEN     
tcp6       0      0 :::4700                 :::*                    LISTEN     
tcp6       0      0 127.0.0.1:8005          :::*                    LISTEN     
tcp6       0      0 :::8080                 :::*                    LISTEN     
udp        0      0 0.0.0.0:68              0.0.0.0:*                 
     

using: select var_name,var_value,default_value from config;
I see:
| CaseInsensitiveLoginNames                         | 0                                            | 0                              |
| CheckTrustedNodes                                 | 0                                            | 0                              |
| ClientListenerPort                                | 4701                                         | 4701                           |
| ClusterContainerAutoBind                          | 0                                            | 0                              |

Starting in debug 9 mode I do not see any errors but neither do I see any

[INFO ] Listening for  .... connections

Any idea what could be wrong?

Thanks

Update:

If I delete the alarms, the event_log and the alarm_events and restart netxmsd it works for some time.

At some time it will stop (10-20 minutes) it will stop working again.

Last logs:

Quote

2018.03.20 15:10:24.652 *D* New thread started in thread pool MAIN
2018.03.20 15:10:24.652 *D* New thread started in thread pool MAIN
2018.03.20 15:10:24.652 *D* New thread started in thread pool MAIN
2018.03.20 15:10:24.652 *D* New thread started in thread pool MAIN
2018.03.20 15:10:31.563 *D* [obj.poll.node      ] AcceptNewNode(192.168.37.52): host is not reachable
2018.03.20 15:10:31.566 *D* [obj.poll.node      ] NodePoller: processing node 192.168.37.144/24 in zone 0
2018.03.20 15:10:39.553 *D* Started topology poll for node BTZ-FW2101 [123]
2018.03.20 15:10:39.553 *D* Finished instance discovery poll for node BTZ-FW2101 (ID: 123)
2018.03.20 15:10:39.555 *D* VLAN list retrieved from node BTZ-FW2101 [123]
2018.03.20 15:10:39.555 *D* Failed to get switch forwarding database from node BTZ-FW2101 [123]
2018.03.20 15:10:39.556 *D* Link layer topology retrieved for node BTZ-FW2101 [123] (5 connections found)
2018.03.20 15:10:39.557 *D* Link layer topology processed for node BTZ-FW2101 [123]
2018.03.20 15:10:39.558 *D* Finished topology poll for node BTZ-FW2101 [123]
2018.03.20 15:10:40.592 *D* Started topology poll for node LUSCBCENG-SW01.net.eng [121]
2018.03.20 15:10:40.592 *D* Finished instance discovery poll for node LUSCBCENG-SW01.net.eng (ID: 121)
2018.03.20 15:10:41.652 *D* VLAN list retrieved from node LUSCBCENG-SW01.net.eng [121]
2018.03.20 15:10:43.257 *D* Switch forwarding database retrieved for node LUSCBCENG-SW01.net.eng [121]
2018.03.20 15:10:43.540 *D* Link layer topology retrieved for node LUSCBCENG-SW01.net.eng [121] (5 connections found)
2018.03.20 15:10:43.541 *D* Link layer topology processed for node LUSCBCENG-SW01.net.eng [121]
2018.03.20 15:10:43.541 *D* Finished topology poll for node LUSCBCENG-SW01.net.eng [121]
2018.03.20 15:10:45.208 *D* Starting instance discovery poll for node 192.168.37.15 (ID: 1696)
2018.03.20 15:10:45.208 *D* Node is marked as unreachable, instance discovery poll aborted
2018.03.20 15:10:45.208 *D* Finished instance discovery poll for node 192.168.37.15 (ID: 1696)
2018.03.20 15:10:45.208 *D* Started topology poll for node 192.168.37.15 [1696]
2018.03.20 15:10:45.208 *D* Failed to get switch forwarding database from node 192.168.37.15 [1696]
2018.03.20 15:10:45.209 *D* Link layer topology retrieved for node 192.168.37.15 [1696] (1 connections found)
2018.03.20 15:10:45.209 *D* Link layer topology processed for node 192.168.37.15 [1696]
2018.03.20 15:10:45.209 *D* Finished topology poll for node 192.168.37.15 [1696]
2018.03.20 15:10:49.829 *E* Thread "Item Poller" does not respond to watchdog thread
2018.03.20 15:10:50.965 *D* [obj.netmap         ] NetworkMap::updateContent(L2 [911]): cannot get topology information for node BTZ-SW2101 [133]
2018.03.20 15:10:51.034 *D* Stopping worker thread in thread pool DATACOLL due to inactivity
2018.03.20 15:10:54.641 *D* Stopping worker thread in thread pool POLLERS due to inactivity
2018.03.20 15:11:03.552 *D* [obj.poll.node      ] AcceptNewNode(192.168.37.144): host is not reachable
2018.03.20 15:11:03.558 *D* [obj.poll.node      ] NodePoller: processing node 192.168.37.163/24 in zone 0













Victor Kirhenshtein

Hi,

please check if netxmsd process is crashing. If yes, please provide a core dump or stack trace from core dump.

Best regards,
Victor

cholo7

This error has also happened to me, very thankful I saw this post. The 4701 is not listening and I've deleted all entries in event_logs, restarted the netxmsd server, and started seeing 4701 using netstat. I could not find core dump being generated because the server is still running but it is not creating 4701 port.

$ netxmsd -v
NetXMS Server Version 2.2.6 Build 9513 (2.2.6-42-g8ddf439) (UNICODE)
NXCP: 4.48.1.13 (AES-256, Blowfish-256, IDEA, 3DES, AES-128, Blowfish-128)
Built with: IBM XL C/C++ for AIX, V13.1.2 (5725-C72, 5765-J07)

Victor Kirhenshtein

Hi,

it seems that server hangs during startup. Can you try to capture threads with attached script (you'll need gdb installed on the system)?

Best regards,
Victor

Victor Kirhenshtein

Sorry, didn't notice that you are running server on AIX. This script won't work there. I will check how to capture thread stack traces there.

Best regards,
Victor

Victor Kirhenshtein

Looks like server hangs on importing templates. Please try to set server configuration parameter ImportConfigurationOnStartup to false. YOu can do this from command line with database manager:


nxdbmgr set ImportConfigurationOnStartup 0


Best regards,
Victor

cholo7

I'm in a critical stage on the project implementation and cannot test it right away. I'll provide feedback once this issue occurred again, maybe when the alarm / event logs reached again more than 100k of records.  I'll set this parameter and see if it will work. Thank you..

Victor Kirhenshtein

High number of active alarms can also be an issue as server will cache them on startup. But usually you should not have many active alarms, if you do it's likely some flaw in your event processing logic.

Best regards,
Victor