Housekeeper / SQL Server Deadlock

Started by pzandvoort, June 09, 2020, 05:07:00 PM

Previous topic - Next topic

pzandvoort

Many nights at 2am, when the housekeeper runs, we get a few deadlocks on SQL Server.

SQL query failed (Query = "INSERT INTO idata_2301 (item_id,idata_timestamp,idata_value,raw_value) VALUES (1061,1591683195,'Ok','3')"): [Microsoft][SQL Server Native Client 11.0][SQL Server]Transaction (Process ID 57) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

We also see some of the DELETE statements from the housekeeper failing for similar reasons: "SQL query failed (Query = "DELETE FROM idata_2637 WHERE (item_id=1817 AND idata_timestamp<1583907634) OR ..." [Microsoft][SQL Server Native Client 11.0][SQL Server]Transaction (Process ID 63) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

This doesn't happen every night, and the number of deadlocks isn't the same every time, but it's often enough that it's waking people up at 2am.

Anyone have any thoughts on where to go to troubleshoot?

netxms 3.3-323
SQL Server 2016 SP2 CU12 (13.0.5698.0)

Thanks!
Peter

Filipp Sudanov

There's nothing special that NetXMS would be doing when working with the DB - during housekeeping it's issuing these DELETE statement and parallel to that INSERT statements do happen. So it's upon to the database settings how long it's waiting for these statements, until it decides to kill them.
Some information could be found here: https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2005/ms181150(v=sql.90)?redirectedfrom=MSDN

Also - how big is your system and the database?
What's the output of "sh stat" in server console?

Can you show the output of the following queries:
SELECT var_name, var_value FROM config WHERE var_name like 'DefaultDCI%';

SELECT polling_interval, retention_time, count(polling_interval) FROM items GROUP BY polling_interval, retention_time ORDER BY retention_time, polling_interval;


pzandvoort

Hi Philipp,

Thanks for your quick reply!
The system isn't anything special, but not completely bad: Dell R420, Dual E5-2430 @ 2.2GHz, 32GB Ram, 2x 1TB drive in RAID1. It's normally pretty bored. It's not doing anything other than NetXMS and SQL. The database is 19GB.

sh stat:
- netxmsd: Objects 2473
- Monitored Nodes: 141
- Collectible DCIs: 3689

SELECT var_name, var_value FROM config WHERE var_name like 'DefaultDCI%';
DefaultDCIPollingInterval   60
DefaultDCIRetentionTime   90

SELECT polling_interval, retention_time, count(polling_interval) FROM items GROUP BY polling_interval, retention_time ORDER BY retention_time, polling_interval;
0   0   3830
300   0   24
3600   0   14
60   90   1

So most things just poll every minute and retain for 90 days.

Isn't a deadlock different from a regular lock? If all NetXMS is doing is just deletes and some inserts, I would maybe expect some regular locks/waits, but they resolve themselves given enough time. From what I understand, a deadlock is normally something special where 2 things wait for each other in a circle with no way out. (https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2005/ms177433%28v%3dsql.90%29)



Filipp Sudanov


pzandvoort

None happened last night, but here's the one from yesterday.



<deadlock>
<victim-list>
  <victimProcess id="process14ebe971468" />
</victim-list>
<process-list>
  <process id="process14ebe971468" taskpriority="0" logused="40088" waitresource="PAGE: 5:1:1192460 " waittime="5506" ownerId="16049071" transactionname="user_transaction" lasttranstarted="2020-06-10T02:02:30.297" XDES="0x14eb61df590" lockMode="IX" schedulerid="24" kpid="19196" status="suspended" spid="60" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2020-06-10T02:02:31.320" lastbatchcompleted="2020-06-10T02:02:31.320" lastattention="1900-01-01T00:00:00.320" clientapp="NetXMS" hostname="MONITOR1" hostpid="8632" loginname="CAMPUSDOOR\monitor" isolationlevel="read committed (2)" xactid="16049071" currentdb="5" currentdbname="netxms" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="1" stmtstart="98" stmtend="294" sqlhandle="0x020000006b0a903581f03fcc658c71f5b8537b2dde9aac390000000000000000000000000000000000000000">
unknown    </frame>
    <frame procname="adhoc" line="1" stmtend="236" sqlhandle="0x0200000065e8a32aad77be1795d7ff93c21bc8140ac3c6b70000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
INSERT INTO idata_2349 (item_id,idata_timestamp,idata_value,raw_value) VALUES (1109,1591768951,'91.331634','91.331634')   </inputbuf>
  </process>
  <process id="process14ebe036ca8" taskpriority="0" logused="3221500" waitresource="PAGE: 5:1:1193485 " waittime="4987" ownerId="16049218" transactionname="DELETE" lasttranstarted="2020-06-10T02:02:30.450" XDES="0x14ead5fb650" lockMode="U" schedulerid="4" kpid="12052" status="suspended" spid="61" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2020-06-10T02:02:29.383" lastbatchcompleted="2020-06-10T02:02:29.383" lastattention="1900-01-01T00:00:00.383" clientapp="NetXMS" hostname="MONITOR1" hostpid="8632" loginname="CAMPUSDOOR\monitor" isolationlevel="read committed (2)" xactid="16049218" currentdb="5" currentdbname="netxms" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="1" stmtend="4878" sqlhandle="0x0200000083c4a620a2fa781ce6e93abf014fae0f6f1089850000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
DELETE FROM idata_2349 WHERE (item_id=1101 AND idata_timestamp&lt;1583992949) OR (item_id=1102 AND idata_timestamp&lt;1583992949) OR (item_id=1103 AND idata_timestamp&lt;1583992949) OR (item_id=1104 AND idata_timestamp&lt;1583992949) OR (item_id=1105 AND idata_timestamp&lt;1583992949) OR (item_id=1106 AND idata_timestamp&lt;1583992949) OR (item_id=1107 AND idata_timestamp&lt;1583992949) OR (item_id=1109 AND idata_timestamp&lt;1583992949) OR (item_id=1111 AND idata_timestamp&lt;1583992949) OR (item_id=1173 AND idata_timestamp&lt;1583992949) OR (item_id=1174 AND idata_timestamp&lt;1583992949) OR (item_id=1175 AND idata_timestamp&lt;1583992949) OR (item_id=1178 AND idata_timestamp&lt;1583992949) OR (item_id=1179 AND idata_timestamp&lt;1583992949) OR (item_id=1180 AND idata_timestamp&lt;1583992949) OR (item_id=1181 AND idata_timestamp&lt;1583992949) OR (item_id=1184 AND idata_timestamp&lt;1583992949) OR (item_id=1185 AND idata_timestamp&lt;1583992949) OR (item_id=1186 AND idata_timestamp&lt;1583992949) OR (item_id=1187 AND idata_timestamp&lt;1583992949) OR (item_id=14205   </inputbuf>
  </process>
</process-list>
<resource-list>
  <pagelock fileid="1" pageid="1192460" dbid="5" subresource="FULL" objectname="netxms.dbo.idata_2349" id="lock14dc42cf500" mode="X" associatedObjectId="72057594066829312">
   <owner-list>
    <owner id="process14ebe036ca8" mode="X" />
   </owner-list>
   <waiter-list>
    <waiter id="process14ebe971468" mode="IX" requestType="wait" />
   </waiter-list>
  </pagelock>
  <pagelock fileid="1" pageid="1193485" dbid="5" subresource="FULL" objectname="netxms.dbo.idata_2349" id="lock14e39817400" mode="IX" associatedObjectId="72057594066829312">
   <owner-list>
    <owner id="process14ebe971468" mode="IX" />
   </owner-list>
   <waiter-list>
    <waiter id="process14ebe036ca8" mode="U" requestType="wait" />
   </waiter-list>
  </pagelock>
</resource-list>
</deadlock>


There were quite a few on 6/9.... Screenshot attached.
Below is the last one from 6/9.
<deadlock>
<victim-list>
  <victimProcess id="process14ebe047468" />
</victim-list>
<process-list>
  <process id="process14ebe047468" taskpriority="0" logused="0" waitresource="PAGE: 5:1:2303392 " waittime="10425" ownerId="14398297" transactionname="DELETE" lasttranstarted="2020-06-09T02:20:42.220" XDES="0x14ebde89020" lockMode="U" schedulerid="6" kpid="16336" status="suspended" spid="63" sbid="0" ecid="8" priority="0" trancount="0" lastbatchstarted="2020-06-09T02:20:34.203" lastbatchcompleted="2020-06-09T02:20:34.203" lastattention="1900-01-01T00:00:00.203" clientapp="NetXMS" hostname="MONITOR1" hostpid="8632" isolationlevel="read committed (2)" xactid="14398297" currentdb="5" currentdbname="netxms" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="1" stmtend="7324" sqlhandle="0x020000009844b73552aecc44670076b3495534916ebe69be0000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
DELETE FROM idata_2637 WHERE (item_id=1817 AND idata_timestamp&lt;1583907634) OR (item_id=1818 AND idata_timestamp&lt;1583907634) OR (item_id=1819 AND idata_timestamp&lt;1583907634) OR (item_id=1822 AND idata_timestamp&lt;1583907634) OR (item_id=1823 AND idata_timestamp&lt;1583907634) OR (item_id=1824 AND idata_timestamp&lt;1583907634) OR (item_id=1825 AND idata_timestamp&lt;1583907634) OR (item_id=1828 AND idata_timestamp&lt;1583907634) OR (item_id=1829 AND idata_timestamp&lt;1583907634) OR (item_id=1830 AND idata_timestamp&lt;1583907634) OR (item_id=1831 AND idata_timestamp&lt;1583907634) OR (item_id=1970 AND idata_timestamp&lt;1583907634) OR (item_id=24756 AND idata_timestamp&lt;1583907634) OR (item_id=24757 AND idata_timestamp&lt;1583907634) OR (item_id=24758 AND idata_timestamp&lt;1583907634) OR (item_id=24759 AND idata_timestamp&lt;1583907634) OR (item_id=24760 AND idata_timestamp&lt;1583907634) OR (item_id=24761 AND idata_timestamp&lt;1583907634) OR (item_id=24762 AND idata_timestamp&lt;1583907634) OR (item_id=24763 AND idata_timestamp&lt;1583907634) OR (item_   </inputbuf>
  </process>
  <process id="process14ebe037c28" taskpriority="0" logused="9584" waitresource="PAGE: 5:1:2303257 " waittime="9865" ownerId="14398160" transactionname="user_transaction" lasttranstarted="2020-06-09T02:20:41.787" XDES="0x14eb6185390" lockMode="IX" schedulerid="4" kpid="16576" status="suspended" spid="59" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2020-06-09T02:20:43.150" lastbatchcompleted="2020-06-09T02:20:43.150" lastattention="1900-01-01T00:00:00.150" clientapp="NetXMS" hostname="MONITOR1" hostpid="8632" loginname="CAMPUSDOOR\monitor" isolationlevel="read committed (2)" xactid="14398160" currentdb="5" currentdbname="netxms" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="1" stmtstart="98" stmtend="294" sqlhandle="0x020000003e4d2b1ad6d052f85d4ee2831814cedcb3d912610000000000000000000000000000000000000000">
unknown    </frame>
    <frame procname="adhoc" line="1" stmtend="218" sqlhandle="0x0200000059d8b81b923ecd02bb46bd15cb5d570c8e585c430000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
INSERT INTO idata_2637 (item_id,idata_timestamp,idata_value,raw_value) VALUES (24770,1591683643,'Running','0')   </inputbuf>
  </process>
  <process id="process14ebe06f848" taskpriority="0" logused="10000" waittime="4455" schedulerid="11" kpid="15704" status="suspended" spid="63" sbid="0" ecid="7" priority="0" trancount="0" lastbatchstarted="2020-06-09T02:20:34.203" lastbatchcompleted="2020-06-09T02:20:34.203" lastattention="1900-01-01T00:00:00.203" clientapp="NetXMS" hostname="MONITOR1" hostpid="8632" isolationlevel="read committed (2)" xactid="14398297" currentdb="5" currentdbname="netxms" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="1" stmtend="7324" sqlhandle="0x020000009844b73552aecc44670076b3495534916ebe69be0000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
DELETE FROM idata_2637 WHERE (item_id=1817 AND idata_timestamp&lt;1583907634) OR (item_id=1818 AND idata_timestamp&lt;1583907634) OR (item_id=1819 AND idata_timestamp&lt;1583907634) OR (item_id=1822 AND idata_timestamp&lt;1583907634) OR (item_id=1823 AND idata_timestamp&lt;1583907634) OR (item_id=1824 AND idata_timestamp&lt;1583907634) OR (item_id=1825 AND idata_timestamp&lt;1583907634) OR (item_id=1828 AND idata_timestamp&lt;1583907634) OR (item_id=1829 AND idata_timestamp&lt;1583907634) OR (item_id=1830 AND idata_timestamp&lt;1583907634) OR (item_id=1831 AND idata_timestamp&lt;1583907634) OR (item_id=1970 AND idata_timestamp&lt;1583907634) OR (item_id=24756 AND idata_timestamp&lt;1583907634) OR (item_id=24757 AND idata_timestamp&lt;1583907634) OR (item_id=24758 AND idata_timestamp&lt;1583907634) OR (item_id=24759 AND idata_timestamp&lt;1583907634) OR (item_id=24760 AND idata_timestamp&lt;1583907634) OR (item_id=24761 AND idata_timestamp&lt;1583907634) OR (item_id=24762 AND idata_timestamp&lt;1583907634) OR (item_id=24763 AND idata_timestamp&lt;1583907634) OR (item_   </inputbuf>
  </process>
  <process id="process14ebe01d848" taskpriority="0" logused="10000" waittime="9768" schedulerid="1" kpid="13264" status="suspended" spid="63" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2020-06-09T02:20:34.203" lastbatchcompleted="2020-06-09T02:20:34.203" lastattention="1900-01-01T00:00:00.203" clientapp="NetXMS" hostname="MONITOR1" hostpid="8632" loginname="CAMPUSDOOR\monitor" isolationlevel="read committed (2)" xactid="14398297" currentdb="5" currentdbname="netxms" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack>
    <frame procname="adhoc" line="1" stmtend="7324" sqlhandle="0x020000009844b73552aecc44670076b3495534916ebe69be0000000000000000000000000000000000000000">
unknown    </frame>
   </executionStack>
   <inputbuf>
DELETE FROM idata_2637 WHERE (item_id=1817 AND idata_timestamp&lt;1583907634) OR (item_id=1818 AND idata_timestamp&lt;1583907634) OR (item_id=1819 AND idata_timestamp&lt;1583907634) OR (item_id=1822 AND idata_timestamp&lt;1583907634) OR (item_id=1823 AND idata_timestamp&lt;1583907634) OR (item_id=1824 AND idata_timestamp&lt;1583907634) OR (item_id=1825 AND idata_timestamp&lt;1583907634) OR (item_id=1828 AND idata_timestamp&lt;1583907634) OR (item_id=1829 AND idata_timestamp&lt;1583907634) OR (item_id=1830 AND idata_timestamp&lt;1583907634) OR (item_id=1831 AND idata_timestamp&lt;1583907634) OR (item_id=1970 AND idata_timestamp&lt;1583907634) OR (item_id=24756 AND idata_timestamp&lt;1583907634) OR (item_id=24757 AND idata_timestamp&lt;1583907634) OR (item_id=24758 AND idata_timestamp&lt;1583907634) OR (item_id=24759 AND idata_timestamp&lt;1583907634) OR (item_id=24760 AND idata_timestamp&lt;1583907634) OR (item_id=24761 AND idata_timestamp&lt;1583907634) OR (item_id=24762 AND idata_timestamp&lt;1583907634) OR (item_id=24763 AND idata_timestamp&lt;1583907634) OR (item_   </inputbuf>
  </process>
</process-list>
<resource-list>
  <pagelock fileid="1" pageid="2303392" dbid="5" subresource="FULL" objectname="netxms.dbo.idata_2637" id="lock14ddf8c6080" mode="IX" associatedObjectId="72057594076790784">
   <owner-list>
    <owner id="process14ebe037c28" mode="IX" />
   </owner-list>
   <waiter-list>
    <waiter id="process14ebe047468" mode="U" requestType="wait" />
   </waiter-list>
  </pagelock>
  <pagelock fileid="1" pageid="2303257" dbid="5" subresource="FULL" objectname="netxms.dbo.idata_2637" id="lock14e0026c980" mode="U" associatedObjectId="72057594076790784">
   <owner-list>
    <owner id="process14ebe06f848" mode="U" />
   </owner-list>
   <waiter-list>
    <waiter id="process14ebe037c28" mode="IX" requestType="wait" />
   </waiter-list>
  </pagelock>
  <exchangeEvent id="Pipe15935351630" WaitType="e_waitPipeNewRow" waiterType="Producer" nodeId="2" tid="11" ownerActivity="receivedData" waiterActivity="tryToSendData" merging="true" spilling="true" waitingToClose="true">
   <owner-list>
    <owner id="process14ebe01d848" />
   </owner-list>
   <waiter-list>
    <waiter id="process14ebe06f848" />
   </waiter-list>
  </exchangeEvent>
  <exchangeEvent id="Pipe14d50d18a30" WaitType="e_waitPipeGetRow" waiterType="Coordinator" nodeId="2" tid="0" ownerActivity="sentData" waiterActivity="needMoreData" merging="true" spilling="false" waitingToClose="false">
   <owner-list>
    <owner id="process14ebe047468" />
   </owner-list>
   <waiter-list>
    <waiter id="process14ebe01d848" />
   </waiter-list>
  </exchangeEvent>
</resource-list>
</deadlock>


Filipp Sudanov

Second one is surprising - there are several processes trying to delete from one and the same table with the same queries.
Could you set debug level for housekeeper to 7
nxadm -c "debug housekeeper 7"
This will produced detailed log for housekeeper process. We'd be interested to see the log for the night when such deadlock with several DELETE statements occurs.

pzandvoort

I've set the debug level to 7 for the housekeeper.
Of course, no deadlocks last night. It's the proverbial watched pot that never boils :)

pzandvoort

Here we go! 4 deadlocks last night....

FYI: our logging goes to the Windows event log, so "eventlog 2020-06-23.csv" is an export of that. Note the reverse order! I have replace devicenames with *** to protect the innocent.
Also attached: the 4 deadlock events from SQL Server.

I don't see anything specific in here, but hope this helps!
Thanks,

Peter

Filipp Sudanov

As per some post on stackoverflow, mssql is able to get into a deadlock on simultaneous insert and delete due to some things related to indexes.

Some changes will be introduced into db writing mechanism in coming patch release. Currently, when housekeeping is happening, there are two separate queues - one for housekeeping tasks and another for new data inserts.
In the coming version a running housekeeping query will pause insert queries. After housekeeper query completes, it will check the length of current insert queue, if it's above Housekeeper.Throttle.HighWatermark setting housekeeper will pause (until reaching LowWatermark), otherwise another housekeeper query would run.
As a result this should fix the deadlocking issue.

pzandvoort

Filipp,

Thanks for the research.
I look forward to the next version!

Peter