Hi there,
I moved NetXMS from an old server to two new servers. I made the move with the
nxdbmgr migrate command and with the help of these (https://wiki.netxms.org/wiki/How_to_migrate_to_another_database) instructions. Everything worked successfully, only NetXMS always crashes after some time (approx. 30 - 60 minutes, but sometimes longer). Then I can't log in to NetXMS anymore and I was stuck at
Objekte synchronisieren until the timeout.
(https://i.ibb.co/5x1D6YL/Screenshot-2019-12-31-Net-XMS-Management-Console.png)
(https://i.ibb.co/GTG9qkG/Screenshot-2019-12-31-Net-XMS-Management-Console-1.png)
But if you are already logged in at the time of the crash, you do not get kicked, but stay logged in.
The following error messages can then be seen in the NetXMS
log file:
2019.12.31 12:48:36.346 *E* [ ] Thread "Poll Manager" does not respond to watchdog thread
2019.12.31 12:48:56.347 *E* [ ] Thread "Syncer Thread" does not respond to watchdog thread
Informations about my configuration:There is a NetXMS server and a database server. In the
/etc/netxmsd.conf the database server was entered at
DBServer=. Previously (with the old server) the NetXMS server and the database were on the same machine.
Old server:
- Debian GNU/Linux 8 (jessie)
- NetXMS Server Version 2.2.13 Build 9518 (2.2.13)
New server:NetXMS-Server:
- Ubuntu 18.04.3 LTS
- NetXMS Server Version 3.1.261 Build 3.1-261-ga5f9451ddf
NetXMS configuration file at
/etc/netxmsd.conf:
## Logging
# Log file name
LogFile=/var/log/netxmsd
# Increase logging verbosity, 0 (only errors) to 9 (verbose debug)
DebugLevel=7
## Database configuration.
## Uncomment and setup ONE section.
## Option #1 - SQLite (for test installations only):
#DBDriver=sqlite.ddr
#DBName=/var/lib/netxms/netxms.db
## Option #2 - PostgreSQL (recommended):
#DBDriver=pgsql.ddr
#DBServer=127.0.0.1
#DBName=netxms
#DBLogin=netxms
#DBPassword=netxms
## Option #3 - MySQL:
DBDriver=mysql.ddr
DBServer=10.10.11.20
DBName=netxms
DBLogin=******
DBPassword=********************************
## Option #4 - Oracle:
#DBDriver=oracle.ddr
#DBServer=//127.0.0.1:1521/ORCL # Instant Client connection string or SID
#DBLogin=netxms
#DBPassword=netxms
## Option #5 - unixODBC/FreeTDS:
#DBDriver=odbc.ddr
#DBServer=NETXMS_DSN
#DBLogin=netxms
#DBPassword=netxms
MySQL-Server:
- Ubuntu 18.04.3 LTS
- mysql Ver 14.14 Distrib 5.7.28, for Linux (x86_64) using EditLine wrapper
I hope someone knows what the problem is.
Thanks in advance!
If you check running processes on netxms server, is netxmsd present in the list of processes on the moment it hangs? I mean, does it actually creash and terminate, or it hangs?
Yes, it is present in the list of processes on the moment it hangs.
# ps -A | grep netx
21769 ? 08:01:05 netxmsd
# service netxmsd status
● netxmsd.service - NetXMS Server
Loaded: loaded (/lib/systemd/system/netxmsd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-12-31 11:59:29 CET; 1 day 1h ago
Process: 21762 ExecStart=/usr/bin/netxmsd -d (code=exited, status=0/SUCCESS)
Main PID: 21769 (netxmsd)
Tasks: 498 (limit: 4915)
CGroup: /system.slice/netxmsd.service
└─21769 /usr/bin/netxmsd -d
Dez 31 11:59:29 netxms systemd[1]: Starting NetXMS Server...
Dez 31 11:59:29 netxms systemd[1]: netxmsd.service: Can't open PID file /var/run/netxmsd.pid (yet?) after start: No such file or directory
Dez 31 11:59:29 netxms systemd[1]: Started NetXMS Server.
Ok, let's try to get some debug information with this script: https://github.com/netxms/netxms/blob/master/tools/capture_netxmsd_threads.sh
In order for this to work you should have gdb and all relevant netxms-*-dbg packages installed. Is netxms installed from packages?
Script will produce output file in /tmp. Please attach it here.
Yes, NetXMS was installed from packages. I used this (https://www.netxms.org/documentation/adminguide/installation.html#installing-on-debian-or-ubuntu) guide to install NetXMS. Before I started the script, I installed gdb and all relevant NetXMS packages with these commands:
# apt install netxms-*-dbg
# apt install gdb
And after that:
# ./capture_netxmsd_threads.sh
At the moment I executed the script, NetXMS hangs.
In /tmp I found netxmsd-threads.21769.20200101-212952. I attached this file here.
I forgot to mention, that you should wait for netxms to hang first and then launch the script. So can you please wait for it to hang on it's own and then run the script.
In my post before I executed the script while NetXMS did hang. I didn't restart the netxmsd service since the crash I mentioned in my first post. 2019.12.31 12:48:36.346 *E* [ ] Thread "Poll Manager" does not respond to watchdog thread
2019.12.31 12:48:56.347 *E* [ ] Thread "Syncer Thread" does not respond to watchdog thread
So NetXMS still hangs.
I don't know if this can help, but I did it:
Now I restarted the netxmsd service and waited for netxms to hang. Than I launched the script. I have attached this file here.
Log file:
2020.01.02 10:16:52.647 *E* [ ] Thread "Poll Manager" does not respond to watchdog thread
2020.01.02 10:17:12.647 *E* [ ] Thread "Syncer Thread" does not respond to watchdog thread
I also attached the log file here. As you can see, i lauched the script 3 minutes after NetXMS hangs. Don't be confused about the time in the filename. It's because of the german time shift of one hour.
When NetXMS hangs it doesn't save any changes that I make.
I have found out, that there is a diffrence between a full restart and when I only restart the netxmsd service.
For example when I make some changes and do service netxmsd restart
my changes are there.
But when I do reboot
my changes are gone.
And some changes are removed after some reboots and that is a very big problem.
When I do # nxdbmgr check
after a reboot I get this error: Container 9190 contains non-existing child 79719. Fix it? (Yes/No/All/Skip) yes
Now my saved Nodes are away.
Yesterday we have fixed bug that cause deadlock on object access. It can be root cause for your issue as well. We will publish new patch release for 3.1 today - please check if it will help.
Best regards,
Victor
Hi,
thanks for fixing the bug. I will try it as soon as I get the new patch.
Could you please update packages? :)
deb build is in progress, will be available within hour or so.
Best regards,
Victor
Hello,
thank you for the update. I think the bug with "Thread "Poll Manager" does not respond to watchdog thread" and "Thread "Syncer Thread" does not respond to watchdog thread" has been fixed, because there was no error for more than 8 hours uptime now. But the other bug, that rainerh (https://www.netxms.org/forum/profile/?u=59258) also mentioned here (https://www.netxms.org/forum/configuration/cannot-delete-templates-group/), still exists. Here is a copy of his post:
Hello,
I want do delete 2 (old) Template Groups.
After reboot, the 2 groups are still available.
How can I delete theses groups permanent?
Thank you
I have a new problem and I think the reason is the same like above:
When I create a new node, the new node will work fine until I reboot the NetXMS Server.
After reboot the new created node is deletet.
I attached the netxmsd log file here.
Now I checked db writer queue:
# nxadm -i
netxmsd: show queues
I got this output:
netxmsd: show queues
Data collector : 459
DCI cache loader : 0
Template updates : 0
Database writer : 0
Database writer (IData) : 0
Database writer (raw DCI values) : 10321
Event processor : 0
Event log writer : 0
Poller : 0
Node discovery poller : 0
Syslog processing : 0
Syslog writer : 0
Scheduler : 0
When I execute this command multiple times I noticed that this value goes to approximately 12.000 - 20.000 and than quickly goes down to approx. 1000.
netxmsd: show queues
Data collector : 0
DCI cache loader : 0
Template updates : 0
Database writer : 0
Database writer (IData) : 0
Database writer (raw DCI values) : 1263
Event processor : 0
Event log writer : 0
Poller : 0
Node discovery poller : 0
Syslog processing : 0
Syslog writer : 0
Scheduler : 0
netxmsd: show queues
Data collector : 0
DCI cache loader : 0
Template updates : 0
Database writer : 0
Database writer (IData) : 73
Database writer (raw DCI values) : 14532
Event processor : 0
Event log writer : 0
Poller : 0
Node discovery poller : 0
Syslog processing : 0
Syslog writer : 0
Scheduler : 0
Hello,
I found one of my problems.
When I configure ImportConfigurationOnStartup with "Only missing elements" (by default) and I delete Template "Windows" or "Generic UNIX" then it will come with next Boot, because these Templates belong to NetXMS and are not made by me.
I can change the value to "Never" and after reboot they will not come again.
Templates, which are made by me can be deleted at any time.
Thank you
But the problem, when I create a new node, the node will be deleted after 30 seconds, when I reboot NetXMS.
After reboot I can see the node in Management Console and after some seconds (about 30) the node is removed automatically.
This shows how my nodes gets deleted:
1. I create a new Node
(https://i.ibb.co/rtSWjtR/Screenshot-2020-01-09-Net-XMS-admin-127-0-0-1.png) (https://ibb.co/LPWGFPX)
2. You see the node is there
(https://i.ibb.co/zFVhMP0/Screenshot-2020-01-09-Net-XMS-admin-127-0-0-1-1.png) (https://ibb.co/0Kyj7Yd)
3. Now I reboot the server
# reboot
4. For a short time the node is still there
(https://i.ibb.co/B6R6RWT/Screenshot-2020-01-09-Net-XMS-admin-127-0-0-1-2.png) (https://ibb.co/5jVjVps)
5. But after about 1 minute the node disappeares
(https://i.ibb.co/3fYDzHR/Screenshot-2020-01-09-Net-XMS-admin-127-0-0-1-3.png) (https://ibb.co/28YDNBZ)
I attached the log of this test here.
You can click on the images for better quality.
I hope someone knows what the problem is.
Thanks in advance!
Hello
I have found the reason, but cannot manage it.
I have 2 identical Networks 192.168.11.0/24
Router 1 at Customer Voelk has on interface X1 192.168.11.1/24
Router 2 at Customer Koelbl has on interface X5192.168.11.1/24
Both have same Adress and I cannot change it.
2020.01.09 17:43:56.175 *D* [poll.conf ] Primary IP address 192.168.11.1 of node Voelk SonicWALL TZ-500 [1268] found on interface X5 of node Koelbl Service SonicWALL NSA 2650 [80201]
2020.01.09 17:43:56.175 *D* [poll.conf ] Node Voelk SonicWALL TZ-500 [1268] is a duplicate of node Koelbl Service SonicWALL NSA 2650 [80201]
2020.01.09 17:43:56.175 *D* [poll.conf ] Removing node Koelbl Service SonicWALL NSA 2650 [80201] as duplicate
How can I resolve this Problem?
Thank you
Rainer
Hi,
one option is to use zones and put those routers into different zones. Another option is to mark internal interfaces as "exclude from topology". Second option is easier if you are not interested in anything behind those interfaces.
Best regards,
Victor
Hello Victor,
I have tried to "exclude from topolog" only 1 Router. But these did not work.
Then I built some zones. Now It seems to work pretty fine and no node will be deleted again.
Thank you very much
Rainer