NXAgent local SQLite database useless when becomes large

Started by blazarov, June 02, 2017, 10:36:27 AM

Previous topic - Next topic

blazarov

Hello team,
we are extensively using the NXAgent local caching (data reconciliation) function which is great.
Recently we've found out a nasty limitation of the implementation based on SQLite.

When its a busy agent with lots of DCIs and for some reason it loses connectivity with the server for relatively long period (several hours +) its local SQLite cache database (dc_queue table in particular) quickly gets large. The larger it gets the slower the select queries become which results in longer periods between the reconciliation operations between the agent and the server. This basically leads to a snowball effect where it gets worse and worse with the time and it could never catch up to sync all cached data with the server and start sending the "fresh data".
Now we have an agent that has 7+ million of rows in the dc_queue table in the SQLlite and every select query takes around 40-50 seconds to complete:

[02-Jun-2017 10:22:31.816] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [43475 ms]
[02-Jun-2017 10:22:32.638] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:23:12.913] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [40224 ms]
[02-Jun-2017 10:23:14.085] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:23:58.584] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [44448 ms]
[02-Jun-2017 10:23:59.432] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:24:43.598] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [44115 ms]
[02-Jun-2017 10:25:13.633] [DEBUG] ReconciliationThread: timeout on bulk send
[02-Jun-2017 10:26:03.604] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [49917 ms]
[02-Jun-2017 10:26:04.483] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:26:49.148] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [44614 ms]
[02-Jun-2017 10:26:50.456] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:27:35.038] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [44006 ms]
[02-Jun-2017 10:27:38.335] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:28:30.247] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [51862 ms]
[02-Jun-2017 10:28:34.130] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:29:26.638] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [52457 ms]
[02-Jun-2017 10:29:27.455] [DEBUG] ReconciliationThread: 1024 records sent

this results in a reconciliation rate of around 1500 records per minute which is much lower than the rate of the new data coming, so we are already in a snowball situation that is getting worse and worse and will never catch up.
So far the solution that we have is to delete the SQLite database and that immidiately fixes the situation, but costs us losing valuable monitoring data. Unfortunately the database format is very different between the agent and the server and so far we havent found a working way where we can "manually" dump SQLite and then import to server database. That would be a nice option to manualy fix such situations.

The hardware that runs the NXAgent is pretty decent and unfortunately giving the VM more CPU/RAM or putting it to faster storage (even tried all-flash storage) does not help significantly. It seems to me that SQLite is capped to using just one core.

So after this long introduction i have several questions:

  • Is there an option to use real database such as MySQL or PostgreSQL for the local agent caching DB? Since this is very critical for us we can live with some advanced installation or configuration just to make it work
  • Does my understanding and analysis of the issue and the cause make sense?
  • Any other ideas how we can solve our problem or maybe workaround it?

blazarov

Hi,
Just to update with progress.

After a quick chat with Victor he realized that the indexing of dc_queue table of the SQLite is suboptimal. He had released a new development release of the source which had the optimized indexes here:
https://git.netxms.org/public/netxms.git/commit/51e38637be0790e18f0b38318ec3233bab76304c?js=1

We have just updated and so far results seem amazing. At least 5-10x performance increase. Data reconciliation that seemed impossible to catch up before now works like a charm. Also CPU utilization of the machine is very low compared to constant 100% on one of the cores before.

We will do some more tests and we'll continue to monitor and post some more feedback.

Thanks Victor!