Netxms Agent Cache Problem (timeout synchronization)

Started by innerfire, September 05, 2018, 11:59:31 AM

Previous topic - Next topic

innerfire

Hello, Everyone

I Post here because we can't find the solution of particular problem with cache synchronization (that we enabled before a while).
We use Server/Client version 2.2.8

This is the output log from nxagent log file:


2018.09.05 09:03:06.068 *D* ReconciliationThread: timeout on bulk send
2018.09.05 09:03:19.267 *D* [CS-0(1086)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:03:27.419 *D* [CS-0(1087)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:03:35.535 *D* [CS-0(1088)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:03:36.815 *D* [CS-1(1335)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:03:36.948 *D* [CS-3(1336)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:03:37.107 *D* [CS-4(1337)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:03:37.107 *D* [CS-5(1338)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:04:06.121 *D* ReconciliationThread: timeout on bulk send
2018.09.05 09:04:49.216 *D* [CS-0(1930)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:04:57.492 *D* [CS-0(1943)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:05:05.404 *D* [CS-0(1944)] Server capabilities: IPv6: yes; bulk reconciliation: yes; compression: yes
2018.09.05 09:05:06.174 *D* ReconciliationThread: timeout on bulk send


And here is the configuration file оf nxagentd.conf:


#Log File
LogFile=/var/log/nxagentd
LogHistorySize = 6
#IP white list, can contain multiple records separated by comma.
#CIDR notation supported for subnets.
MasterServers = *.*.*.*
ServerConnection = *.*.*.*
DebugLevel = 6
EnableWatchdog = yes
EnableProxy = yes
EnableSNMPProxy = yes
DataReconciliationBlockSize = 64
DataReconciliationTimeout = 60000
MaxSessions = 32
SubAgent = ping.nsm


We try with different variables size (MaxSessions, DataReconciliationBlockSize), but nothing helps.

Both of machines running on linux ubuntu server.

Now performance tab of nodes are empty, because of synchronisation failure.

And in the first attachment is the output of nxagent in debug level 9 and second is  from the server.



innerfire

Hello, Everyone

Now i found another problem with data synchronization, we disable the cache from agent polling properties and now we have another problem i post log from agent and server below:
from agent:

2018.09.19 10:26:50.385 *D* [CS-43(639)] Receiver thread stopped
2018.09.19 10:26:50.385 *D* [tunnel             ] *.*.*.*: Channel 638 closed
2018.09.19 10:26:50.385 *D* [tunnel             ] *.*.*.*: Sending message CMD_CLOSE_CHANNEL
2018.09.19 10:26:50.385 *D* [CS-33(629)] Session with *.*.*.* closed
2018.09.19 10:26:50.385 *D* [CS-33(629)] Session unregistered
2018.09.19 10:26:50.385 *D* [CS-33(629)] Receiver thread stopped
2018.09.19 10:26:50.385 *D* [tunnel             ] *.*.*.*: Channel 628 closed
2018.09.19 10:26:50.385 *D* [tunnel             ] *.*.*.*: Sending message CMD_CLOSE_CHANNEL
2018.09.19 10:26:50.385 *D* [CS-41(637)] Session with *.*.*.* closed
2018.09.19 10:26:50.385 *D* [CS-41(637)] Session unregistered
2018.09.19 10:26:50.386 *D* [CS-41(637)] Receiver thread stopped
2018.09.19 10:26:50.386 *D* [tunnel             ] *.*.*.*: Channel 636 closed
2018.09.19 10:26:50.386 *D* [tunnel             ] *.*.*.*: Sending message CMD_CLOSE_CHANNEL
2018.09.19 10:26:50.386 *D* [CS-40(636)] Session with *.*.*.* closed
2018.09.19 10:26:50.386 *D* [CS-40(636)] Session unregistered
2018.09.19 10:26:50.386 *D* [CS-40(636)] Receiver thread stopped
2018.09.19 10:26:50.386 *D* [tunnel             ] *.*.*.*: Channel 635 closed
2018.09.19 10:26:50.386 *D* [tunnel             ] *.*.*.*: Sending message CMD_CLOSE_CHANNEL
2018.09.19 10:26:50.386 *D* [CS-31(627)] Session with *.*.*.* closed
2018.09.19 10:26:50.386 *D* [CS-31(627)] Session unregistered
2018.09.19 10:26:50.386 *D* [CS-31(627)] Receiver thread stopped
2018.09.19 10:26:50.386 *D* [tunnel             ] *.*.*.*: Channel 626 closed
2018.09.19 10:26:50.386 *D* [tunnel             ] *.*.*.*: Sending message CMD_CLOSE_CHANNEL
2018.09.19 10:26:50.386 *D* [CS-25(621)] Session with *.*.*.* closed
2018.09.19 10:26:50.386 *D* [CS-25(621)] Session unregistered
2018.09.19 10:26:50.386 *D* [CS-25(621)] Receiver thread stopped
2018.09.19 10:26:50.386 *D* [tunnel             ] *.*.*.*: Channel 620 closed
2018.09.19 10:26:50.386 *D* [tunnel             ] *.*.*.*: Sending message CMD_CLOSE_CHANNEL
2018.09.19 10:26:50.387 *D* [CS-2(598)] Session with *.*.*.* closed
2018.09.19 10:26:50.387 *D* [CS-2(598)] Session unregistered
2018.09.19 10:26:50.387 *D* [CS-2(598)] Receiver thread stopped
2018.09.19 10:26:50.387 *D* [tunnel             ] *.*.*.*: Channel 597 closed
2018.09.19 10:26:50.387 *D* [tunnel             ] *.*.*.*: Sending message CMD_CLOSE_CHANNEL
2018.09.19 10:26:50.387 *D* [CS-1(175)] Communication channel closed by peer
2018.09.19 10:26:50.387 *D* [CS-1(175)] Session with *.*.*.* closed
2018.09.19 10:26:50.387 *D* [CS-1(175)] Session unregistered
2018.09.19 10:26:50.387 *D* [CS-1(175)] Receiver thread stopped


From server:
2018.09.19 11:28:38.118 *D* [agent.tunnel.231   ] createChannel: new channel created (ID=620)
2018.09.19 11:28:38.120 *D* [client.session.1   ] Sending update for object Gi1/0/12 [8998]
2018.09.19 11:28:38.120 *D* StatusPoll(UGA_OBUL_SA_MSK01.opportunitybank.co.ug [4279]): boot time set to 1536857606 from SNMP
2018.09.19 11:28:38.120 *D* [agent.conn.36      ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261283) - cannot accept new data
2018.09.19 11:28:38.122 *D* [agent.conn.1456    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261283) - cannot accept new data
2018.09.19 11:28:38.122 *D* Finished status poll for node UGA_OBUL_SA_MSK01.opportunitybank.co.ug (ID: 4279)
2018.09.19 11:28:38.123 *D* [agent.tunnel.231   ] createChannel: new channel created (ID=608)
2018.09.19 11:28:38.123 *D* [agent.conn.11      ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261283) - cannot accept new data
2018.09.19 11:28:38.124 *D* [agent.conn.2088053 ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261283) - cannot accept new data
2018.09.19 11:28:38.125 *D* [agent.conn.1467    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261283) - cannot accept new data
2018.09.19 11:28:38.127 *D* [agent.conn.558509  ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261283) - cannot accept new data
2018.09.19 11:28:38.127 *D* [client.session.1   ] Sending update for object Gi1/0/12 [6664]
2018.09.19 11:28:38.128 *D* [agent.conn.421     ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261283) - cannot accept new data
2018.09.19 11:28:38.179 *D* [obj.sync           ] Object Ethernet1/0/25 [33226] modified
2018.09.19 11:28:38.180 *D* [agent.conn.1887    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.182 *D* [agent.conn.2093904 ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.182 *D* [client.session.1   ] Sending update for object Gi1/0/4 [21193]
2018.09.19 11:28:38.183 *D* [agent.conn.32      ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.183 *D* [agent.conn.9893    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.184 *D* [agent.conn.2026    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.185 *D* [agent.conn.1962    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.189 *D* [agent.conn.514313  ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.240 *D* [client.session.1   ] Sending update for object Gi1/0/17 [1467]
2018.09.19 11:28:38.244 *D* [agent.conn.1988    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261283) - cannot accept new data
2018.09.19 11:28:38.245 *D* [agent.conn.1887    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.247 *D* [agent.conn.476008  ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.247 *D* [agent.conn.1796826 ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261282) - cannot accept new data
2018.09.19 11:28:38.302 *D* [agent.conn.2093904 ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261274) - cannot accept new data
2018.09.19 11:28:38.302 *D* [agent.conn.1887    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261274) - cannot accept new data
2018.09.19 11:28:38.358 *D* [agent.conn.1883    ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261274) - cannot accept new data
2018.09.19 11:28:38.360 *D* [agent.conn.580284  ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261273) - cannot accept new data
2018.09.19 11:28:38.361 *D* [agent.conn.36      ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261271) - cannot accept new data
2018.09.19 11:28:38.362 *D* [agent.conn.157     ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261270) - cannot accept new data
2018.09.19 11:28:38.362 *D* [agent.conn.11      ] AgentConnectionEx::processBulkCollectedData: database writer queue is too large (261270) - cannot accept new data

Victor Kirhenshtein

Hi,

what is your idata writer queue size? You can check it in server debug console with command "show queue". If you have historical graph for idata queue please also share it.
Also, how many DCIs you have? You can check it in server debug console with command "show stats".

Best regards,
Victor

innerfire

#3
when i execute the "show queue"
i get numbers only in "Database writer (raw DCI values) : 20763".

The command "show stats"
give me this output:

Objects......... 50649
Monitored nodes.... 2198
Collectible DCIs... 58084
Active alarms....  4061

Here is the some screenshots from database queue on server that i attach.

Victor Kirhenshtein

It looks that from time to time database cannot cope with incoming data rate and so writer queue grows. You may need to fine tune your database server. But it seems to happen on few occasions so it should not be a root cause for data synchronization timeouts.

Best regards,
Victor

Victor Kirhenshtein

Can you please reproduce timeout issue again, with server running at debug level 7 and agent at debug level 6, and send me synchronized log files for period during synchronization?

Best regards,
Victor