NetXMS crashes, out of sockets

Started by Borgso, November 13, 2016, 09:05:51 AM

Previous topic - Next topic

Borgso

Our server have been unstable since upgrading to 2.0.x branch.
We have been getting more nodes at same time, so problem could exist on older versions too..

Server Setup:
OS: Ubuntu 14.04.05-LTS (ESXi)
CPU: 4x E5-2690 @ 2.90GHz
Mem: 8GB

Server stats:
Total number of objects:     10490
Number of monitored nodes:   3737
Number of collectable DCIs:  33514

Server config:
PollerThreadPoolBaseSize: 300
PollerThreadPoolMaxSize: 800
NumberOfDataCollectors: 800


Been talking on Telegram about this, and this night one of our NOC had some time to do debug and found this:

-- Quote --
It seems that Netxms doesn't handle more than 1024 sockets very well and crashes if an attempt to retransmit data when the send buffers are full on a fd equal to or larger than 1024.

_opt_netxms206_bin_netxmsd.0.crash

(gdb) bt
#0  0x00007ff8ffe79c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ff8ffe7d028 in __GI_abort () at abort.c:89
#2  0x00007ff8ffeb62a4 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ff8fffc2113 "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ff8fff4dbbc in __GI___fortify_fail (msg=<optimized out>, msg@entry=0x7ff8fffc20aa "buffer overflow detected") at fortify_fail.c:38
#4  0x00007ff8fff4ca90 in __GI___chk_fail () at chk_fail.c:28
#5  0x00007ff8fff4db07 in __fdelt_chk (d=<optimized out>) at fdelt_chk.c:25
#6  0x00007ff9004603bb in SendEx (hSocket=1149, data=data@entry=0x7ff8b226e580, len=1016, flags=flags@entry=0, mutex=0x7ff8b4172160) at tools.cpp:1084
#7  0x00007ff90097dd1f in ClientSession::sendMessage (this=0x7ff8b418a910, msg=<optimized out>) at session.cpp:1588
#8  0x00007ff900980060 in ClientSession::sendAllObjects (this=this@entry=0x7ff8b418a910, pRequest=pRequest@entry=0x7ff8b02cbcf0) at session.cpp:2294
#9  0x00007ff90099f08d in ClientSession::processingThread (this=0x7ff8b418a910) at session.cpp:798
#10 0x00007ff90099f219 in ClientSession::processingThreadStarter (pArg=<optimized out>) at session.cpp:215
#11 0x00007ff900210184 in start_thread (arg=0x7ff7c5359700) at pthread_create.c:312
#12 0x00007ff8fff3d37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


2016-11-12_22-34

(gdb) bt
#0  0x00007f47494aac37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f47494ae028 in __GI_abort () at abort.c:89
#2  0x00007f47494e72a4 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f47495f3113 "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007f474957ebbc in __GI___fortify_fail (msg=<optimized out>, msg@entry=0x7f47495f30aa "buffer overflow detected") at fortify_fail.c:38
#4  0x00007f474957da90 in __GI___chk_fail () at chk_fail.c:28
#5  0x00007f474957eb07 in __fdelt_chk (d=<optimized out>) at fdelt_chk.c:25
#6  0x00007f4749a913bb in SendEx (hSocket=1180, data=data@entry=0x7f470036ab00, len=424, flags=flags@entry=0, mutex=0x7f470c262ff0) at tools.cpp:1084
#7  0x00007f4749faed1f in ClientSession::sendMessage (this=0x7f470c17dce0, msg=<optimized out>) at session.cpp:1588
#8  0x00007f4749faf0a5 in ClientSession::updateThread (this=0x7f470c17dce0) at session.cpp:658
#9  0x00007f4749faf2b9 in ClientSession::updateThreadStarter (pArg=<optimized out>) at session.cpp:224
#10 0x00007f4749841184 in start_thread (arg=0x7f4627cc1700) at pthread_create.c:312
#11 0x00007f474956e37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


2016-11-12_22-46.crash

(gdb) bt
#0  0x00007f77edd68c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f77edd6c028 in __GI_abort () at abort.c:89
#2  0x00007f77edda52a4 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f77edeb1113 "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007f77ede3cbbc in __GI___fortify_fail (msg=<optimized out>, msg@entry=0x7f77edeb10aa "buffer overflow detected") at fortify_fail.c:38
#4  0x00007f77ede3ba90 in __GI___chk_fail () at chk_fail.c:28
#5  0x00007f77ede3cb07 in __fdelt_chk (d=<optimized out>) at fdelt_chk.c:25
#6  0x00007f77ee34f3bb in SendEx (hSocket=1125, data=data@entry=0x7f77980f3380, len=424, flags=flags@entry=0, mutex=0x7f77a004f3e0) at tools.cpp:1084
#7  0x00007f77ee86cd1f in ClientSession::sendMessage (this=0x7f77a0239b70, msg=<optimized out>) at session.cpp:1588
#8  0x00007f77ee86d0a5 in ClientSession::updateThread (this=0x7f77a0239b70) at session.cpp:658
#9  0x00007f77ee86d2b9 in ClientSession::updateThreadStarter (pArg=<optimized out>) at session.cpp:224
#10 0x00007f77ee0ff184 in start_thread (arg=0x7f76c3891700) at pthread_create.c:312
#11 0x00007f77ede2c37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111



excerpt of SendEx(SOCKET hSocket, const void *data, size_t len, int flags, MUTEX mutex) in tools.cpp:

do
{
retry:
#ifdef MSG_NOSIGNAL
nRet = send(hSocket, ((char *)data) + (len - nLeft), nLeft, flags | MSG_NOSIGNAL);
#else
nRet = send(hSocket, ((char *)data) + (len - nLeft), nLeft, flags);
#endif
if (nRet <= 0)
{
if ((WSAGetLastError() == WSAEWOULDBLOCK)
#ifndef _WIN32
    || (errno == EAGAIN)
#endif
   )
{
// Wait until socket becomes available for writing
struct timeval tv;
fd_set wfds;

tv.tv_sec = 60;
tv.tv_usec = 0;
FD_ZERO(&wfds);
FD_SET(hSocket, &wfds);
nRet = select(SELECT_NFDS(hSocket + 1), NULL, &wfds, NULL, &tv);
if ((nRet > 0) || ((nRet == -1) && (errno == EINTR)))
goto retry;
}
break;
}
nLeft -= nRet;
} while (nLeft > 0);


line 1084 is FD_SET(hSocket, &wfds);


To quote "man select":
Quote
       An  fd_set is a fixed size buffer.  Executing FD_CLR() or FD_SET() with
       a value of fd that is negative or is equal to or larger than FD_SETSIZE
       will result in undefined behavior.


hSocket is 1149, 1180 and 1125 in our crashdumps.


FD_SETSIZE on Linux is 1024:
Quote
    /usr/include/sys/select.h:#define   FD_SETSIZE      __FD_SETSIZE
    /usr/include/bits/typesizes.h:#define   __FD_SETSIZE        1024


Also consider the conditions for a crash. send() must fail with WSAEWOULDBLOCK, meaning that the send buffers are full. This can happen if the network is saturated or if the other side simply doesn't acknowledge the received data. Only then and iif the socket fd is equal to or larger than 1024 would lead to this crash. This would explain the inconsistent behaviour and perceived correlation with external factors.