Massive Packet Loss and Server Restarts on 4.0

Started by Storm-Donovan, February 14, 2022, 02:27:00 PM

Previous topic - Next topic

Storm-Donovan

On Friday morning I upgraded our installation from 3.9 to 4.0.  Ever since, the daemon restarts every hour on the hour and we have massive packet loss reported within NetXMS.  If I ping the devices from the command line, there is no packet loss, but NetXMS reports over half of our devices are unreachable by ICMP at any given time.

Everything worked fine on 3.9, I'm at my wit's end and am about ready to restore from Thursday night's backup.  What can I do to resolve this without going back to 3.9?

Storm-Donovan

A bit about our installation,

We're running on Debian 11.2 with MariaDB 10.6.  The system is a container on a Proxmox v7 host.  Below are our statistics from Proxmox, as you can see, we are not stressing the CPU or RAM.  We have a 100GB partition for the SQL DB and it's only using 32GB of space, so we haven't run out of hard drive space.

Storm-Donovan

Here's the output from JournalCTL with debug level 9.  The crashes happen every hour on the hour, and there's nothing here except that it's doing a SEGV/11 and then restarts.

Feb 15 09:00:30 netxms netxmsd[104008]: Node::connectToAgent(dhcp1 [101832]): already connected
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 755 [Warning] Aborted connection 755 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 731 [Warning] Aborted connection 731 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 741 [Warning] Aborted connection 741 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 739 [Warning] Aborted connection 739 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 734 [Warning] Aborted connection 734 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 756 [Warning] Aborted connection 756 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 742 [Warning] Aborted connection 742 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 740 [Warning] Aborted connection 740 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 738 [Warning] Aborted connection 738 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 752 [Warning] Aborted connection 752 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 746 [Warning] Aborted connection 746 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 749 [Warning] Aborted connection 749 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 737 [Warning] Aborted connection 737 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 744 [Warning] Aborted connection 744 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms systemd[1]: netxmsd.service: Main process exited, code=killed, status=11/SEGV
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 750 [Warning] Aborted connection 750 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 735 [Warning] Aborted connection 735 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 748 [Warning] Aborted connection 748 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 743 [Warning] Aborted connection 743 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 732 [Warning] Aborted connection 732 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 745 [Warning] Aborted connection 745 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 754 [Warning] Aborted connection 754 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 757 [Warning] Aborted connection 757 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 751 [Warning] Aborted connection 751 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 733 [Warning] Aborted connection 733 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms systemd[1]: netxmsd.service: Failed with result 'signal'.
Feb 15 09:00:50 netxms systemd[1]: netxmsd.service: Consumed 38min 56.319s CPU time.
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 753 [Warning] Aborted connection 753 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 758 [Warning] Aborted connection 758 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 747 [Warning] Aborted connection 747 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 736 [Warning] Aborted connection 736 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 760 [Warning] Aborted connection 760 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms mariadbd[990]: 2022-02-15  9:00:50 759 [Warning] Aborted connection 759 to db: 'netxms' user: 'netxms' host: 'localhost' (Got an error reading communication packets)
Feb 15 09:00:50 netxms systemd[1]: netxmsd.service: Scheduled restart job, restart counter is at 19.
Feb 15 09:00:50 netxms systemd[1]: Stopped NetXMS Server.
Feb 15 09:00:50 netxms systemd[1]: netxmsd.service: Consumed 38min 56.319s CPU time.

Storm-Donovan

Alright, so, we just updated MariaDB to v10.7 to see if it was a MariaDB issue.  We started the daemon at 09:58:57 and it ran for 1m33s before SEGV/11 and restart.  It's doing something every hour, on the hour that's causing it to crash.

Storm-Donovan

Here's the sigsegv for the daemon.

Quote[10890669.800354] SNMPTrapRecv[4081047]: segfault at 2562 ip 00007f0d34a3ad04 sp 00007f0d071ed880 error 4 in libnxcore.so.40.0.0[7f0d34856000+220000]
[10890669.800382] Code: 49 89 e8 4c 89 e9 48 89 44 24 18 be 05 00 00 00 31 c0 48 8d 15 ed 2b 11 00 48 8d 3d 9e 16 11 00 e8 41 02 e2 ff 48 8b 74 24 60 <80> be 62 25 00 00 00 0f 85 0f 06 00 00 48 8d 35 08 65 1c 00 80 3e

Storm-Donovan

We're at 5, almost 6 days since our NetXMS installation worked.  Is there anyone that can assist with these issues?

Victor Kirhenshtein

Hi, do you have core dump from crash? If not, could you run netxmsd under gdb until it crashes? If you can, make sure that you have installed the following packages:

netxms-server-dbg
netxms-dbdrv-mariadb-dbg

The run as following in terminal:

gdb netxmsd
On gdb prompt, enter
run -D1

When server crashes, you will get gdb prompt again. Enter command
bt

and share the output.

Best regards,
Victor

Storm-Donovan

[New Thread 0x7fffa662f700 (LWP 223140)]
--Type <RET> for more, q to quit, c to continue without paging--RET

Thread 97 "SNMPTrapRecv" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffcd0f6700 (LWP 221049)]
ProcessTrap (pdu=0x7fff899f7f00, srcAddr=..., zoneUIN=0, srcPort=62369, snmpTransport=0x7fffcc61d000, localEngine=<optimized out>, isInformRq=false) at ../../../src/server/include/nms_objects.h:3363
3363    ../../../src/server/include/nms_objects.h: No such file or directory.
(gdb) bt
#0  ProcessTrap (pdu=0x7fff899f7f00, srcAddr=..., zoneUIN=0, srcPort=62369, snmpTransport=0x7fffcc61d000, localEngine=<optimized out>, isInformRq=false) at ../../../src/server/include/nms_objects.h:3363
#1  0x00007ffff7dcd653 in SNMPTrapReceiver () at snmptrap.cpp:932
#2  0x00007ffff7cb6146 in ThreadCreate_Wrapper_0 (function=<optimized out>) at ../../../include/nms_threads.h:504
#3  0x00007ffff7a53ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007ffff779adef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95