nxAgent, crash, sqlite (4.3.3,linux,x86_64,static)

Started by sodalist, April 04, 2023, 12:21:11 PM

Previous topic - Next topic

sodalist

Hello,

Problem description:
- agent crashes immediately after trying to vacuum local sqlite database

Environment description:
- LXC container running (buster) on Proxmox running (bullseye) all up2date @04.04.2023
- kernel v6.2.6
- agent nxagent-4.3.3-linux-x86_64-static

Etc:
- sqlite database was created by nxagent-4.3.3-linux-x86_64-static
- filesystem zfs

---
# /opt/netxms/bin/nxagentd -v
NetXMS Core Agent Version 4.3.3 Build 4.3-416-g5911d89 (NON-UNICODE)

# sha256sum /opt/netxms/bin/nxagentd
2c208fb7eb7d8b4a99babad8bf2a11717d666652d8a0a5edb78c1e9ff62f1568  /opt/netxms/bin/nxagentd

# gdb /opt/netxms/bin/nxagentd

GNU gdb (Debian 8.2.1-2+b3) 8.2.1
...
(gdb) run -f -D 9
Starting program: /opt/netxms/bin/nxagentd -f -D 9
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
...
2023.04.04 08:59:26.011 *I* [config            ] Main configuration file: /etc/nxagentd.conf
2023.04.04 08:59:26.011 *I* [config            ] Configuration tree:
2023.04.04 08:59:26.011 *I* [config            ] config
2023.04.04 08:59:26.011 *I* [config            ]  +- CORE
2023.04.04 08:59:26.011 *I* [config            ]      +- ServerConnection
2023.04.04 08:59:26.011 *I* [config            ]      |    value: <removed>
2023.04.04 08:59:26.011 *I* [config            ]      +- Servers
2023.04.04 08:59:26.011 *I* [config            ]      |    value: <removed>
2023.04.04 08:59:26.011 *I* [config            ]      +- ControlServers
2023.04.04 08:59:26.011 *I* [config            ]      |    value: <removed>
2023.04.04 08:59:26.011 *I* [config            ]      +- MasterServers
2023.04.04 08:59:26.011 *I* [config            ]          value: <removed>
...
2023.04.04 08:59:26.014 *D* [crypto            ] Validating ciphers
2023.04.04 08:59:26.014 *D* [crypto            ]    AES-256 enabled
2023.04.04 08:59:26.014 *D* [crypto            ]    Blowfish-256 enabled
2023.04.04 08:59:26.014 *D* [crypto            ]    IDEA enabled
2023.04.04 08:59:26.014 *D* [crypto            ]    3DES enabled
2023.04.04 08:59:26.014 *D* [crypto            ]    AES-128 enabled
2023.04.04 08:59:26.014 *D* [crypto            ]    Blowfish-128 enabled
2023.04.04 08:59:26.014 *I* [crypto            ] Crypto library initialized (OpenSSL 1.1.1i  8 Dec 2020)
2023.04.04 08:59:26.014 *D* [dload              ] DLOpen: file="(null)", module=0x7ffff7ffe190
2023.04.04 08:59:26.014 *D* [dload              ] DLGetSymbolAddr: module=0x7ffff7ffe190, symbol=drvAPIVersion, address=0xbae490
2023.04.04 08:59:26.014 *D* [dload              ] DLGetSymbolAddr: module=0x7ffff7ffe190, symbol=drvName, address=0xbae488
2023.04.04 08:59:26.015 *D* [dload              ] DLGetSymbolAddr: module=0x7ffff7ffe190, symbol=drvCallTable, address=0xbae480
2023.04.04 08:59:26.015 *D* [db.drv.sqlite      ] SQLite version 3.36.0
2023.04.04 08:59:26.015 *I* [db.drv            ] Database driver ":self:" loaded and initialized successfully
2023.04.04 08:59:26.015 *D* [db.conn            ] DBConnect: server=(null) db=/opt/netxms/var/lib/netxms/nxagentd.db login=(null) schema=(null)
2023.04.04 08:59:26.015 *D* [db.conn            ] New DB connection opened: handle=0xc4afe0
2023.04.04 08:59:26.016 *D* [db.query          ] DB Library: global long running query threshold set to 250
2023.04.04 08:59:26.020 *I* [db.agent          ] Local database opened successfully
...
[New Thread 0x7ffff7856700 (LWP 1161774)]
.
2023.04.04 08:59:28.122 *D* [db.query          ] COMMIT TRANSACTION successful (level 0)
...
2023.04.04 08:59:33.123 *D* [db.query          ] BEGIN TRANSACTION successful (level 1)
2023.04.04 08:59:33.124 *D* [db.query          ] COMMIT TRANSACTION successful (level 0)
2023.04.04 08:59:33.124 *D* [dc                ] Database writer: 61 records inserted
...
2023.04.04 08:59:54.533 *D* [dc                ] 0 SNMP targets received from server <removed>
2023.04.04 08:59:54.533 *D* [dc                ] 0 proxies received from server <removed>
2023.04.04 08:59:54.533 *D* [dc                ] 62 data collection elements received from server <removed> (extended data: YES)
2023.04.04 08:59:54.534 *D* [dc                ] Data collection for server <removed> reconfigured
2023.04.04 08:59:54.534 *D* [comm.cs.64        ] Sending message CMD_REQUEST_COMPLETED (ID 6; size 32; uncompressed)
2023.04.04 08:59:54.534 *D* [comm.cs.64        ] Outgoing message dump:
  ** 000000 | 00 1D 50 00 00 00 00 20 00 00 00 06 00 00 00 01 | ..P.... ........
  ** 000010 | 00 00 00 1C 00 00 00 00 00 00 00 00 00 00 00 00 | ................
  ** code=0x001D (CMD_REQUEST_COMPLETED) version=5 flags=0x0000 id=6 size=32 numFields=1
  ** 000000: [    28] INT32      0
...
2023.04.04 08:59:54.564 *D* [comm.cs.64        ] Message dump:
  ** 000000 | 00 BA 50 00 00 00 00 10 00 00 00 07 00 00 00 00 | ..P.............
  ** code=0x00BA (CMD_ENABLE_AGENT_TRAPS) version=5 flags=0x0000 id=7 size=16 numFields=0

2023.04.04 08:59:54.564 *D* [comm.cs.64        ] Received message CMD_ENABLE_AGENT_TRAPS (7)
2023.04.04 08:59:54.565 *D* [notifications      ] RegisterSessionForNotifications: starting background sync for server <removed>
2023.04.04 08:59:54.565 *D* [comm.cs.64        ] Sending message CMD_REQUEST_COMPLETED (ID 7; size 32; uncompressed)
2023.04.04 08:59:54.565 *D* [comm.cs.64        ] Outgoing message dump:
  ** 000000 | 00 1D 50 00 00 00 00 20 00 00 00 07 00 00 00 01 | ..P.... ........
  ** 000010 | 00 00 00 1C 00 00 00 00 00 00 00 00 00 00 00 00 | ................
  ** code=0x001D (CMD_REQUEST_COMPLETED) version=5 flags=0x0000 id=7 size=32 numFields=1
  ** 000000: [    28] INT32      0

2023.04.04 08:59:54.565 *D* [notifications      ] Notification synchronization started
2023.04.04 08:59:54.565 *D* [notifications      ] Vacuum local database
free(): invalid pointer

Thread 5 "$COMM/WRK" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff7856700 (LWP 1161774)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) thread
[Current thread is 5 (Thread 0x7ffff7856700 (LWP 1161774))]
(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7ad6535 in __GI_abort () at abort.c:79
#2  0x00007ffff7b2d648 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7c372a0 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff7b33d6a in malloc_printerr (str=str@entry=0x7ffff7c3544e "free(): invalid pointer") at malloc.c:5359
#4  0x00007ffff7b3559c in _int_free (av=<optimized out>, p=<optimized out>, have_lock=<optimized out>) at malloc.c:4180
#5  0x0000000000547068 in sqlite3_free (p=0x2) at sqlite3.c:28180
#6  0x000000000056dc48 in sqlite3DbFree (p=<optimized out>, db=0x0) at sqlite3.c:28239
#7  sqlite3DeleteColumnNames (db=0x0, pTable=<optimized out>, pTable=<optimized out>) at sqlite3.c:47566
#8  0x000000000056f9b2 in deleteTable (db=0x0, pTable=0xc90cc8) at sqlite3.c:113163
#9  0x000000000056fb19 in sqlite3DeleteTable (db=<optimized out>, pTable=<optimized out>) at sqlite3.c:113180
#10 0x00000000005701ab in sqlite3SchemaClear (p=0xc3d618) at sqlite3.c:118313
#11 0x00000000005702bb in sqlite3ResetAllSchemasOfConnection (db=0xc3cd18) at sqlite3.c:113070
#12 0x00000000005c95c3 in sqlite3RunVacuum (pzErrMsg=pzErrMsg@entry=0xc47390, db=db@entry=0xc3cd18, iDb=<optimized out>, pOut=0x0) at sqlite3.c:143149
#13 0x00000000005a644b in sqlite3VdbeExec (p=<optimized out>) at sqlite3.c:93837
#14 0x00000000005ae200 in sqlite3Step (p=<optimized out>) at sqlite3.c:84869
#15 sqlite3_step (pStmt=0xc472e8) at sqlite3.c:19390
#16 0x00000000005aef85 in sqlite3_exec (db=0xc3cd18, zSql=<optimized out>, xCallback=0x0, pArg=0x0, pzErrMsg=0x0) at sqlite3.c:126198
#17 0x000000000052ea20 in QueryInternal (conn=conn@entry=0xc3d1b0, query=query@entry=0x7fffd4002af0 "VACUUM", errorText=errorText@entry=0x7ffff7854890 L"") at sqlite/sqlite.cpp:261
#18 0x000000000052eab3 in Query (connection=0xc3d1b0, query=<optimized out>, errorText=0x7ffff7854890 L"") at sqlite/sqlite.cpp:287
#19 0x00000000004b0412 in DBQueryEx (hConn=hConn@entry=0xc4afe0, query=query@entry=0x7e189e "VACUUM", errorText=errorText@entry=0x7ffff78558d0 "") at session.cpp:245
#20 0x00000000004b067f in DBQuery (hConn=hConn@entry=0xc4afe0, query=query@entry=0x7e189e "VACUUM") at session.cpp:291
#21 0x0000000000495098 in NotificationSync (session=std::shared_ptr<class CommSession> (use count 3, weak count 1) = {...}) at nproc.cpp:256
#22 0x0000000000495a0c in __ThreadPoolExecute_SharedPtr_Wrapper<CommSession> (arg=0x7fffc8015050) at ../../../include/nms_threads.h:993
#23 0x00000000004e4c4d in ProcessSerializedRequests (data=0x7fffc8079400) at tp.cpp:487
#24 0x00000000004e4eb9 in WorkerThread (threadInfo=0xc63de0) at tp.cpp:199
#25 0x00000000004e6b2a in ThreadCreate_Wrapper_1<WorkerThreadInfo*> (context=0xc63e00) at ../../include/nms_threads.h:539
#26 0x00007ffff7fa1fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#27 0x00007ffff7bad06f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

thank you,

Br, Aleš

Victor Kirhenshtein

Hi,

from the dump it looks like bug within SQLite. I updated bundled SQLite to latest version (3.41.2) - if this will not help, we will need to investigate further.
It also worth trying to use matching agent package from deb repository - it will have dependence on SQLite version built for that particular distribution.

Best regards,
Victor

sodalist

Hi,

Thank you.

We will update agent and monitor systems after release. This error was only occurrence of failure that was
repeatable 8/10 cases. I did not want to report issues with agent from deb repository as we are noticing
multiple errors (segfaults) after v4.0.x -> v4.3.0 + debian (stretch,buster,bullseye) latest updates.
We are still unable to isolate/repeat most of them. (this is not critc, just did not want to throw everything
at once and clear any issues that could be due our infrastructure beforehand)

Br,

Aleš

sodalist

Hello,

I apologize for late reply. Excluding potential infrastructure issues took longer than anticipated.
After analysis of all crashes, issue was found to be associated with single debian/bullseye package
update:

old: lxcfs=4.0.12-pve1
new: lxcfs=5.0.3-pve1

Further analysis was not done.

Upgrade of package causes segmentation fault, general protection faults in multiple nxagentd linked libraries
that unfortunately manifest in multiple nxagentd failures, consequential local sqlite database corruption
and associated 2nd level of errors. We observed crashes in: libnxsqlite.so, libc-2.31.so, libnetxms.so.43...

Downgrading proxmox v7 lxcfs package to 4.0.12-pve1 resolved issues.

thank you for your help and apologies for any inconvenience caused.

Br, Aleš