NetXMS - SQL Performance

bdefloo · December 18, 2012, 05:35:22 PM

Hi,

In the progress of investigating why our NetXMS setup still crashes a few times a day I noticed the database writer queue got very high. It starts climbing about 30min after the crash, and goes up to about 400K to 750K before the server crashes due to a memory access violation in a random module.

Our NetXMS server is running on a Windows 2003 x86 server, with MSSQL 2008 Express. I noticed in the activity monitor that a particular query was taking over a minute to complete, and causing alot of disk read activity:
SELECT event_source FROM event_log WHERE event_source=50220

Searching in the code led me to the CleanDeletedObjects function in the housekeeping module of the server. The reason it's so expensive to run is that SQL Server has to run over all the records in the event log (in my case, about 8 million of them for the default 90 days) to check if that particular event_source is used somewhere, as its not in the index. Probably, some things can be optimized here.

First off, it's searching the object ID of an interface in the event log, while the event source is always a node ID, if I'm not mistaken. Could a filter be added based on the object_class field of the deleted_objects table?

Secondly, if a record does exist for that particular record, all the rest of the records are still processed. It would seem this can be resolved by using the EXISTS condition, which is triggered as soon as one record is found:
e.g. IF EXISTS (SELECT event_source FROM event_log WHERE event_source=50220) SELECT 1 ELSE SELECT 0
http://msdn.microsoft.com/en-us/library/ms188336.aspx
However, I'm not sure if this keyword is supported in all of NetXMS' supported DB environments.

Thirdly, I'm seeing the query repeating multiple times for the same object ID. Could it be timing out, and trying over and over to delete the same object? My deleted_objects table is at 2722 records, so it does seem like something's going wrong.

Whether this is related to our crashes I don't know, but it might be a good performance improvement for anyone with a sufficiently large environment for it to be a problem. I reduced my event log size to 14 days meanwhile to see if it alleviates some of the stress on the SQL server.

Victor Kirhenshtein · December 18, 2012, 07:32:44 PM

Hi!

Thanks for pointing me on this! The housekeeper code is so wring here... One possible solution could be adding an index on event_log.event_source and replacing

SELECT event_source FROM event_log WHERE event_source=?

with

SELECT COUNT(event_source) FROM event_log WHERE event_source=?

and probably with

IF EXISTS (SELECT event_source FROM event_log WHERE event_source=50220) SELECT 1 ELSE SELECT 0

for MS SQL.

Another option is to remove this check altogether, as it is needed only for one purpose - prevent creation of new object with ID which was used before and still referenced in event log. This can happen, however, only if object(s) with highest IDs deleted, and server restarted before creating new object. It's probably better to just keep track of highest used ID elsewhere.

Best regards,
Victor

bdefloo · December 20, 2012, 10:55:19 AM

Hi,

Adding an index may indeed greatly reduce the issue. I'm not sure if using COUNT will make a difference, as it still needs to check all the records, instead of stopping after the first record found.

Just storing the last used object ID and always using a truly unique ID may be the safest approach, as they could also still be used in the event processing policy, any other log or even hardcoded in some random script, leading to unpredictable results for the user.

The high database writer queue length is also gone now, queue length is usually around 30-50 now with an occasional spike to 2000 which is gone by the next sample, when a large number of hourly DCI's come in.

Server also hasn't crashed since I reduced the event log size, so the memory access violations are probably somehow tied in with a large database writer queue.

Thanks for considering my suggestions and running an open source project, wouldn't have found this problem otherwise!

Victor Kirhenshtein · December 20, 2012, 01:24:03 PM

Hi!

I decide to drop deleted_objects table completely and just store last used ID.

If you have database writer queue always around 30-50, you can try to increase number of database writers - as updates usually goes to different tables, they often can run effectively in parallel. Default number of writers is 1.

Best regards,
Victor

bdefloo · December 20, 2012, 05:13:07 PM

Hi,

Number of database writers is already at 8.

I took a look in the history for the queue length DCI, and it appears it's often 0 with frequent spikes up to 20-40, which makes it look worse than it is in a line graph. There don't seem to be any more performance issues at the moment, so I'm going to let it run for a while to see if it's stable now.

Thanks for the help!

Kind regards,
Bastiaan

NetXMS Support Forum

News:

NetXMS - SQL Performance

bdefloo

Victor Kirhenshtein

bdefloo

Victor Kirhenshtein

bdefloo