Server Consistently reaching "Too many client sessions open"

Started by michaelk, April 13, 2022, 04:08:04 AM

Previous topic - Next topic

michaelk

Hi,

Currently every couple of days we have to restart our NetXMS-Server as we will not be able to connect to it through any method (desktop client, web client). When i check the logs in /var/log/netxmsd i see the following line repeated numerous times
" 2022.04.13 10:00:44.397 *W* [client.session     ] Too many client sessions open - unable to accept new client connection"

When i run the nxadm command "show sessions" i see 240 sessions defined as "DESKTOP <not logged in> [n/a]" this would sometimes be 255 however usually Grafana will occupy around 15 sessions (and still not be able to get data).

I have attached the output from the show session command and the netxmsd log file from when the server was started.

Versions
Desktop Client - 4.0.2227
NetXMS Server (Linux/Ubuntu) - 4.0.2227
radensolutions-netxms-datasource - 1.2.3
Grafana - 8.3.0
Java Version:
    openjdk 11.0.14.1 2022-02-08
    OpenJDK Runtime Environment (build 11.0.14.1+1-Ubuntu-0ubuntu1.18.04)
    OpenJDK 64-Bit Server VM (build 11.0.14.1+1-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)

Filipp Sudanov

We will add displaying of IP address in list of sessions in a future version.
Currently you can set debug to level 5. This could be done on the fly in Tools->Server Console
debug 5

The messages that we want to see are these:
2022.04.13 13:10:01.442 *D* [                   ] SocketListener/Clients: Incoming connection from 127.0.0.1
2022.04.13 13:10:01.442 *D* [                   ] SocketListener/Clients: Connection from 127.0.0.1 accepted

so you can grep your server log by "SocketListener/Clients"

This way we will see from where these connections are coming.
Do you have any other integrations besides Grafana (nxshell, WEB API)?

michaelk

Hi Filipp,

Thank you for your reply and advice, I will update the debug level now and will have to wait a bit for those logs to come through as it can sometimes happen randomly when the queue fills.

I will post back here in the coming days when I find something, and currently no it is only the Grafana integration and the WebUI

michaelk

Hi Team,

Sorry for the delay, as this problem only happens every couple of days-weeks it is annoying to reproduce. I did some testing and was able to narrow it down to netxms-websvc.war web service. I use this integration to build Grafana dashboards. I test with no war files and then with the root only and with netxms-websvc only and it was when the netxms-websvc was re-introduced that the problem came back. Has this been experienced by anyone else? as we would really like to continue using our Grafana dashboards.

Filipp Sudanov

Can you check in Audit Log in netxms if there are any unsuccessful logins happening?

michaelk

Hi Filipp,

I checked the audit log and found no failed login attempts. It appears that the grafana user (Used by Grafana over the API) logs in successfully and then 5-10mins later logs out. I believe this would be normal behavior?

I have also attached a screenshot should this help (Was unable to attach the image, received entity to large error from nginx so instead i have uploaded to imgur https://imgur.com/a/aHEaSMp)

A little more information should that help as well.

  • When this max session problem happens, NetXMS seems to stop collecting metrics as once we restart it there is missing data in our graphs
  • If i enter "nxadm -i" and try to kill the dead sessions with "kill %SESSION_ID%" it says the sessions was killed yet "show sessions" shows that the session is still there, so it appears these sessions can't be killed either the only way around is to restart the server

Filipp Sudanov

Yes, grafana users logging out 5-10 mins later is normal.

Also, sessions that are marked as DESKTOP but without any user could be some connections to port 4701 (e.g. just telnet to that port would give such thing). You can use netstat to check from which IPs these connections may be coming.

If data collection interruption is observed, it could be that something actually hangs in netxmsd process. The other possible options is some issues with the DB.

I've attached script that collects debug information. If you observe the issue, pls run this script and share the results.

The second script is here: https://raw.githubusercontent.com/netxms/netxms/master/tools/capture_netxmsd_threads.sh
It collects information about netxmsd threads. When the issue happens, please run this script 3 times with 20-60 seconds interval. It will produce files in /tmp. It requires gdb on the system.

There's a new release. We suggest upgrading as, anyways, if a bug will be found it will be fixed in the new version.

P.S. nx-collect-server-diag script uses nxadm which requires authentication in 4.1. The script is not updated yet to do that. You can turn off this by changing server parameter Server.Security.RestrictLocalConsoleAccess



Nem0

Version 4.3.7 - problem not resolved

Altlinux 10 + PostgreSQL 14 + TimeScaleDB 


Nem0

41  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)
 67  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)
 76  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)
 57  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)
 40  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)
 95  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)
 11  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)
 52  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)
 50  | 10.177.2.4                     | 127.0.0.1            | WEB     | NONE     | DashBoards           | nxjclient/4.3.7 (Linux 5.10.170-std-def-alt1; libnxcl 4.3.7)

99 active sessions


MOBILE DEVICE SESSIONS
 ID  | CIPHER   | USER                 | CLIENT
-----+----------+----------------------+----------------------------

0 active sessions

Can be added  timestamp for session open time and idle time in seconds  for opened sessions?   

Victor Kirhenshtein

We made a fix recently (in 4.4.x) for stalled client sessions from web API - that may fix this issue too.

Best regards,
Victor