duplicate nodes problem | can't save Nodes

Started by Woody, December 31, 2019, 04:50:00 PM

Previous topic - Next topic

Woody

Hi there,
I moved NetXMS from an old server to two new servers. I made the move with the nxdbmgr migrate command and with the help of these instructions. Everything worked successfully, only NetXMS always crashes after some time (approx. 30 - 60 minutes, but sometimes longer). Then I can't log in to NetXMS anymore and I was stuck at Objekte synchronisieren until the timeout.





But if you are already logged in at the time of the crash, you do not get kicked, but stay logged in.

The following error messages can then be seen in the NetXMS log file:


2019.12.31 12:48:36.346 *E* [                   ] Thread "Poll Manager" does not respond to watchdog thread
2019.12.31 12:48:56.347 *E* [                   ] Thread "Syncer Thread" does not respond to watchdog thread


Informations about my configuration:

There is a NetXMS server and a database server. In the /etc/netxmsd.conf the database server was entered at DBServer=. Previously (with the old server) the NetXMS server and the database were on the same machine.

Old server:

  • Debian GNU/Linux 8 (jessie)
  • NetXMS Server Version 2.2.13 Build 9518 (2.2.13)

New server:
NetXMS-Server:

  • Ubuntu 18.04.3 LTS
  • NetXMS Server Version 3.1.261 Build 3.1-261-ga5f9451ddf

NetXMS configuration file at /etc/netxmsd.conf:

## Logging
# Log file name
LogFile=/var/log/netxmsd

# Increase logging verbosity, 0 (only errors) to 9 (verbose debug)
DebugLevel=7

## Database configuration.
## Uncomment and setup ONE section.

## Option #1 - SQLite (for test installations only):
#DBDriver=sqlite.ddr
#DBName=/var/lib/netxms/netxms.db

## Option #2 - PostgreSQL (recommended):
#DBDriver=pgsql.ddr
#DBServer=127.0.0.1
#DBName=netxms
#DBLogin=netxms
#DBPassword=netxms

## Option #3 - MySQL:
DBDriver=mysql.ddr
DBServer=10.10.11.20
DBName=netxms
DBLogin=******
DBPassword=********************************

## Option #4 - Oracle:
#DBDriver=oracle.ddr
#DBServer=//127.0.0.1:1521/ORCL # Instant Client connection string or SID
#DBLogin=netxms
#DBPassword=netxms

## Option #5 - unixODBC/FreeTDS:
#DBDriver=odbc.ddr
#DBServer=NETXMS_DSN
#DBLogin=netxms
#DBPassword=netxms


MySQL-Server:

  • Ubuntu 18.04.3 LTS
  • mysql  Ver 14.14 Distrib 5.7.28, for Linux (x86_64) using  EditLine wrapper

I hope someone knows what the problem is.
Thanks in advance!

Filipp Sudanov

If you check running processes on netxms server, is netxmsd present in the list of processes on the moment it hangs? I mean, does it actually creash and terminate, or it hangs?

Woody

Yes, it is present in the list of processes on the moment it hangs.


# ps -A | grep netx
21769 ?        08:01:05 netxmsd



# service netxmsd status
● netxmsd.service - NetXMS Server
   Loaded: loaded (/lib/systemd/system/netxmsd.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2019-12-31 11:59:29 CET; 1 day 1h ago
  Process: 21762 ExecStart=/usr/bin/netxmsd -d (code=exited, status=0/SUCCESS)
Main PID: 21769 (netxmsd)
    Tasks: 498 (limit: 4915)
   CGroup: /system.slice/netxmsd.service
           └─21769 /usr/bin/netxmsd -d

Dez 31 11:59:29 netxms systemd[1]: Starting NetXMS Server...
Dez 31 11:59:29 netxms systemd[1]: netxmsd.service: Can't open PID file /var/run/netxmsd.pid (yet?) after start: No such file or directory
Dez 31 11:59:29 netxms systemd[1]: Started NetXMS Server.

Filipp Sudanov

Ok, let's try to get some debug information with this script: https://github.com/netxms/netxms/blob/master/tools/capture_netxmsd_threads.sh

In order for this to work you should have gdb and all relevant netxms-*-dbg packages installed. Is netxms installed from packages?

Script will produce output file in /tmp. Please attach it here.

Woody

Yes, NetXMS was installed from packages. I used this guide to install NetXMS. Before I started the script, I installed gdb and all relevant NetXMS packages with these commands:
# apt install netxms-*-dbg
# apt install gdb

And after that:
# ./capture_netxmsd_threads.sh

At the moment I executed the script, NetXMS hangs.
In /tmp I found netxmsd-threads.21769.20200101-212952. I attached this file here.

Filipp Sudanov

I forgot to mention, that you should wait for netxms to hang first and then launch the script. So can you please wait for it to hang on it's own and then run the script.

Woody

In my post before I executed the script while NetXMS did hang. I didn't restart the netxmsd service since the crash I mentioned in my first post. 2019.12.31 12:48:36.346 *E* [                   ] Thread "Poll Manager" does not respond to watchdog thread
2019.12.31 12:48:56.347 *E* [                   ] Thread "Syncer Thread" does not respond to watchdog thread

So NetXMS still hangs.

I don't know if this can help, but I did it:
Now I restarted the netxmsd service and waited for netxms to hang. Than I launched the script. I have attached this file here.
Log file:

2020.01.02 10:16:52.647 *E* [                   ] Thread "Poll Manager" does not respond to watchdog thread
2020.01.02 10:17:12.647 *E* [                   ] Thread "Syncer Thread" does not respond to watchdog thread


I also attached the log file here. As you can see, i lauched the script 3 minutes after NetXMS hangs. Don't be confused about the time in the filename. It's because of the german time shift of one hour.

Woody


Woody

I have found out, that there is a diffrence between a full restart and when I only restart the netxmsd service.
For example when I make some changes and do service netxmsd restart my changes are there.
But when I do reboot my changes are gone.
And some changes are removed after some reboots and that is a very big problem.
When I do # nxdbmgr check after a reboot I get this error: Container 9190 contains non-existing child 79719. Fix it? (Yes/No/All/Skip) yes
Now my saved Nodes are away.

Victor Kirhenshtein

Yesterday we have fixed bug that cause deadlock on object access. It can be root cause for your issue as well. We will publish new patch release for 3.1 today - please check if it will help.

Best regards,
Victor

Woody

Hi,
thanks for fixing the bug. I will try it as soon as I get the new patch.


Victor Kirhenshtein

deb build is in progress, will be available within hour or so.

Best regards,
Victor

Woody

Hello,
thank you for the update. I think the bug with "Thread "Poll Manager" does not respond to watchdog thread" and  "Thread "Syncer Thread" does not respond to watchdog thread" has been fixed, because there was no error for more than 8 hours uptime now. But the other bug, that rainerh also mentioned here, still exists. Here is a copy of his post:

Hello,
I want do delete 2 (old) Template Groups.
After reboot, the 2 groups are still available.
How can I delete theses groups permanent?

Thank you

I have a new problem and I think the reason is the same like above:
When I create a new node, the new node will work fine until I reboot the NetXMS Server.
After reboot the new created node is deletet.


I attached the netxmsd log file here.
Now I checked db writer queue:

# nxadm -i
netxmsd: show queues

I got this output:

netxmsd: show queues
Data collector                   : 459
DCI cache loader                 : 0
Template updates                 : 0
Database writer                  : 0
Database writer (IData)          : 0
Database writer (raw DCI values) : 10321
Event processor                  : 0
Event log writer                 : 0
Poller                           : 0
Node discovery poller            : 0
Syslog processing                : 0
Syslog writer                    : 0
Scheduler                        : 0


When I execute this command multiple times I noticed that this value goes to approximately 12.000 - 20.000 and than quickly goes down to approx. 1000.


netxmsd: show queues
Data collector                   : 0
DCI cache loader                 : 0
Template updates                 : 0
Database writer                  : 0
Database writer (IData)          : 0
Database writer (raw DCI values) : 1263
Event processor                  : 0
Event log writer                 : 0
Poller                           : 0
Node discovery poller            : 0
Syslog processing                : 0
Syslog writer                    : 0
Scheduler                        : 0

netxmsd: show queues
Data collector                   : 0
DCI cache loader                 : 0
Template updates                 : 0
Database writer                  : 0
Database writer (IData)          : 73
Database writer (raw DCI values) : 14532
Event processor                  : 0
Event log writer                 : 0
Poller                           : 0
Node discovery poller            : 0
Syslog processing                : 0
Syslog writer                    : 0
Scheduler                        : 0

rainerh

Hello,

I found one of my problems.
When I configure ImportConfigurationOnStartup with "Only missing elements" (by default) and I delete Template "Windows" or "Generic UNIX" then it will come with next Boot, because these Templates belong to NetXMS and are not made by me.
I can change the value to "Never" and after reboot they will not come again.
Templates, which are made by me can be deleted at any time.

Thank you

But the problem, when I create a new node, the node will be deleted after 30 seconds, when I reboot NetXMS.
After reboot I can see the node in Management Console and after some seconds (about 30) the node is removed automatically.