Hi,
Upgraded from 4.2.395 to 4.4.2 and now server crashes on some strange sqlquery, we had this query on 4.2.395 also but never caused crash.
Any idea on what this query is?
2023.09.14 16:04:19.220 *E* [db.drv ] SQL query failed (Query = "UPDATE nodes SET primary_ip=?,primary_name=?,snmp_port=?,capabilities=?,snmp_version=?,community=?,agent_port=?,secret=?,snmp_oid=?,uname=?,agent_version=?,platform_name=?,poller_node_id=?,zone_guid=?,proxy_node=?,snmp_proxy=?,icmp_proxy=?,required_polls=?,use_ifxtable=?,usm_auth_password=?,usm_priv_password=?,usm_methods=?,snmp_sys_name=?,bridge_base_addr=?,down_since=?,driver_name=?,rack_image_front=?,rack_position=?,rack_height=?,physical_container_id=?,boot_time=?,agent_cache_mode=?,snmp_sys_contact=?,snmp_sys_location=?,last_agent_comm_time=?,syslog_msg_count=?,snmp_trap_count=?,node_type=?,node_subtype=?,ssh_login=?,ssh_password=?,ssh_key_id=?,ssh_port=?,ssh_proxy=?,port_rows=?,port_numbering_scheme=?,agent_comp_mode=?,tunnel_id=?,lldp_id=?,fail_time_snmp=?,fail_time_agent=?,fail_time_ssh=?,rack_orientation=?,rack_image_rear=?,agent_id=?,agent_cert_subject=?,hypervisor_type=?,hypervisor_info=?,icmp_poll_mode=?,chassis_placement_config=?,vendor=?,product_code=?,product_name=?,product_version=?,serial_number=?,cip_device_type=?,cip_status=?,cip_state=?,eip_proxy=?,eip_port=?,hardware_id=?,cip_vendor_code=?,agent_cert_mapping_method=?,agent_cert_mapping_data=?,snmp_engine_id=?,snmp_context_engine_id=?,syslog_codepage=?,snmp_codepage=?,ospf_router_id=?,mqtt_proxy=?,modbus_proxy=?,modbus_tcp_port=?,modbus_unit_id=? WHERE id=?"): [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]String or binary data would be truncated.
2023.09.14 16:04:19.220 *D* [event.proc ] EVENT SYS_DB_QUERY_FAILED [52] at {0} (ID:57617035 F:0x0001 S:4 TAGS:"") FROM netxmsd: Database query failed (Query: UPDATE nodes SET primary_ip=?,primary_name=?,snmp_port=?,capabilities=?,snmp_version=?,community=?,agent_port=?,secret=?,snmp_oid=?,uname=?,agent_version=?,platform_name=?,poller_node_id=?,zone_guid=?,proxy_node=?,snmp_proxy=?,icmp_proxy=?,required_polls=?,use_ifxtable=?,usm_auth_password=?,usm_priv_password=?,usm_methods=?,snmp_sys_name=?,bridge_base_addr=?,down_since=?,driver_name=?,rack_image_front=?,rack_position=?,rack_height=?,physical_container_id=?,boot_time=?,agent_cache_mode=?,snmp_sys_contact=?,snmp_sys_location=?,last_agent_comm_time=?,syslog_msg_count=?,snmp_trap_count=?,node_type=?,node_subtype=?,ssh_login=?,ssh_password=?,ssh_key_id=?,ssh_port=?,ssh_proxy=?,port_rows=?,port_numbering_scheme=?,agent_comp_mode=?,tunnel_id=?,lldp_id=?,fail_time_snmp=?,fail_time_agent=?,fail_time_ssh=?,rack_orientation=?,rack_image_rear=?,agent_id=?,agent_cert_subject=?,hypervisor_type=?,hypervisor_info=?,icmp_poll_mode=?,chassis_placement_config=?,vendor=?,product_code=?,product_name=?,product_version=?,serial_number=?,cip_device_type=?,cip_status=?,cip_state=?,eip_proxy=?,eip_port=?,hardware_id=?,cip_vendor_code=?,agent_cert_mapping_method=?,agent_cert_mapping_data=?,snmp_engine_id=?,snmp_context_engine_id=?,syslog_codepage=?,snmp_codepage=?,ospf_router_id=?,mqtt_proxy=?,modbus_proxy=?,modbus_tcp_port=?,modbus_unit_id=? WHERE id=?; Error: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]String or binary data would be truncated.)
Server crashed, or just failed query? If it's crashed, please share stack trace from core file or just sent us core file and we'll check it.
Running server on docker making core dump a challange.
I was about to install server on a clean temp VM and test it without docker and then i noticed an issue.
Why does netxms still depend on libssl1.1 and not libssl3?
Quote from: MarcusH on September 15, 2023, 10:03:20 AMWhy does netxms still depend on libssl1.1 and not libssl3?
Because Debian 11 ships with OpenSSL 1.1:
root@da539131bae5:/# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 11 (bullseye)
Release: 11
Codename: bullseye
root@da539131bae5:/# apt-cache search libssl
libssl-ocaml - OCaml bindings for OpenSSL (runtime)
libssl-ocaml-dev - OCaml bindings for OpenSSL
libssl-dev - Secure Sockets Layer toolkit - development files
libssl-doc - Secure Sockets Layer toolkit - development documentation
libssl1.1 - Secure Sockets Layer toolkit - shared libraries
libssl-utils-clojure - library for SSL certificate management on the JVM
root@da539131bae5:/# apt-cache show libssl-dev|grep Version
Version: 1.1.1n-0+deb11u5
Version: 1.1.1n-0+deb11u4
There are no reference to libssl1 in the official packages for debian 12:
root@d483efbfc136:~# ldd /usr/bin/netxmsd | grep ssl
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007ffb9ae88000)
root@d483efbfc136:~# dpkg -l netxms-server
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================-============-============-=================================
ii netxms-server:amd64 4.4.2-1 amd64 meta package
Thanks, this was my bad i imported old source list for netxms
On the issue i reverted back to 4.2.395 making at least server stable and this strange query is outputed to the server console and only this i have nothing else outputed to server console. Any way to figure out what creates this query?
Quote from: MarcusH on September 15, 2023, 10:03:20 AMRunning server on docker making core dump a challange.
it's rather straighforward if you can control host's core pattern
sysctl -w kernel.core_pattern='/core/core.%e.%p.%t'
docker volume create core_vol
docker run --ulimit core=-1 --mount source=core_vol,target=/core container
sysctl -w kernel.core_pattern=core # reset it back to the default
Quote from: Alex Kirhenshtein on September 15, 2023, 10:31:49 AMQuote from: MarcusH on September 15, 2023, 10:03:20 AMRunning server on docker making core dump a challange.
it's rather straighforward if you can control host's core pattern
sysctl -w kernel.core_pattern='/core/core.%e.%p.%t'
docker volume create core_vol
docker run --ulimit core=-1 --mount source=core_vol,target=/core container
sysctl -w kernel.core_pattern=core # reset it back to the default
I have a core dump but it gives a lot of reference error and shows no stack trace.
Quote from: MarcusH on September 15, 2023, 10:36:51 AMI have a core dump but it gives a lot of reference error and shows no stack trace.
Have you installed netxms-dbg package? It contains all debug symbols for the product.
Quote from: MarcusH on September 15, 2023, 10:30:30 AMAny way to figure out what creates this query?
It's executed by object syncer thread, which saves node changes back into the database.
From the error message it's unclear which field is not accepted by the SQL server, this link might help with tracing it: https://stackoverflow.com/a/62905763
Quote from: Alex Kirhenshtein on September 15, 2023, 10:38:19 AMQuote from: MarcusH on September 15, 2023, 10:36:51 AMI have a core dump but it gives a lot of reference error and shows no stack trace.
Have you installed netxms-dbg package? It contains all debug symbols for the product.
I have not i will have look at this on the test VM
Quote from: Alex Kirhenshtein on September 15, 2023, 10:40:07 AMQuote from: MarcusH on September 15, 2023, 10:30:30 AMAny way to figure out what creates this query?
It's executed by object syncer thread, which saves node changes back into the database.
From the error message it's unclear which field is not accepted by the SQL server, this link might help with tracing it: https://stackoverflow.com/a/62905763
I though all the =? values was for log obfuscation but even the trace on the SQL server only shows =? i guess that would cause this issue since a lot of the columns are int and it tries to update ? into that.
Any idea on what could generate this type of node update?
Ah it is the "INSERT INTO event_log" for the issue i see in the trace that explains the "=?"
Think i found it
exec sp_prepexec @p1 output,N'@P1 varchar(15),@P2 varchar(15),@P3 int,@P4 int,@P5 int,@P6 varchar(7),@P7 int,@P8 varchar(1),@P9 varchar(567)
@P9 varchar(567), P9 is snmp_oid and it is max 255
this is has been strange i removed it and will see if the issue is gone.
Removed the node that caused the faulty query on poll and now 4.4.2 server is stable.
Noticed that there is another thread that also restarted server on "SQL query failed" is this intended behavior now or bug?
Scratch that server still crashes now without any obvious error.
Might not have time to trace this error and revert again to 4.2.395
Had some time today to look at this issue and it is very illusive.
My knowledge on debugging is limited and the core dump that is saved shows nothing, a few addresses that points to ??
I have tried running with debug 6 and see if i can see anything in the logs but no errors there but i see a trend.
Line before "Log file opened" is always "NetworkDeviceDriver::getInterfaces"
example:
2023.09.20 09:45:19.570 *D* [ndd.common ] NetworkDeviceDriver::getInterfaces(0x7f0dae84c740): completed, ifList=0x7f0db5c81300
2023.09.20 09:45:21.466 *I* [logger ] Log file opened (rotation policy 2, max size 16777216)
2023.09.20 09:45:21.466 *I* [startup ] Starting NetXMS server version 4.4.2 build tag 4.4-568-g3a9a8aa557
if i search for 0x7f0dae84c740 in the log i found witch node it was
2023.09.20 09:45:18.722 *D* [node.iface ] Node::getInterfaceList(node=TPFIBSW02 [10402]): calling driver (useIfXTable=true)
2023.09.20 09:45:18.722 *D* [ndd.common ] NetworkDeviceDriver::getInterfaces(0x7f0dae84c740,true)
I started server and quickly unmanaged TPFIBSW02 and now server has been running for a while without crashing.
Since nothing is output to the log even with level 6 i guess this needs core dump to get why it crashes on this node on poll for interfaces but there i am at a loss.
Do you have any non-Ethernet interfaces there with MAC address longer than 6 bytes?
Quote from: Alex Kirhenshtein on September 20, 2023, 02:29:27 PMDo you have any non-Ethernet interfaces there with MAC address longer than 6 bytes?
not that i can see