Last weekend, my brother-in-law and I fished a couple lakes in northern Wisconsin in anticipation of a tournament we will be competing in next month. We did very well on the first lake, seeing several fish and hooking up with one of them, but the second lake we didn’t even see a fish. The fish that we did hook up with has been weighing on my mind quite heavily. My brother-in-law had just moved us to our second spot of the morning. He fired off a cast while I was changing to a lure that was more appropriate for the location we were fishing. Within a few seconds of his lure hitting the water a large muskie had come up and hit the lure. He set the hook solidly and was soon fighting a very energetic fish. The fish was only hooked in the beak and had a good chance of shaking the lure before we got it in the net. The muskie darted through the water, moving left and then coming back right. The second time we got it to the boat, I lowered the net into the water and stabbed toward the fish. It, however, wasn’t ready to be netted. It turned away from the net and, in the process, the hook slipped out. As it swam away I pushed the net after it and got it halfway in, but I couldn’t keep up and it swam back down into the depths. Just seeing such a large fish (I figure it was about 45″, based on the multiple views of it I got) was very exciting, but failing to get it in the net was crushing. In retrospect, I did one important thing wrong. I didn’t wait for my brother-in-law’s command to attempt the net job. Since it was his fish, I should have waited until he said “Now!” to attempt to net the fish. There is a chance the fish would have shook off before we got it into the net, but that was his call to make. In the excitement and chaos of the moment, the thought didn’t even cross my mind. I feel bad for having cost him the successful catch of a nice muskie, but it did teach me an important lesson. I just wish the lesson could have been at my own expense instead of someone else’s.
In part 1 of this post I described how an upgrade of a third party application was causing problems in TempDB by allocating TempDB objects and not deallocating the objects via the code. Since the connections were staying open, TempDB continued to use up space until it ran out and crashed the SQL Server service. Here in part 2, I will describe how I found exactly what exactly was sitting in TempDB and what was causing the objects not to deallocate.
I began by considering the results of the SQL statement described in my part1. I noticed that several spids seemed to often allocate tempdb internal object space without deallocating it, while others deallocated it just as I’d expect.
I decided that if I could capture what those spids where doing when they allocated more space, I would be able to find code that I could provide the application support staff to prove they were not deallocating objects properly. My plan was to pick a spid or two that appeared to be consistent offenders, such as 213 above, and run a trace on them. At the same time I planned to use some SQL that would show internal objects being allocated and deallocated in TempDB. Microsoft has provided trace flag 1106, which logs info like this to the sys.dm_os_ring_buffers dmv when it is turned on. However, this trace flag has the potential to grind database processing to a halt, and should only be used in production with great care (Microsoft recommends not using in production at all). Once this is turned on, I would match the SQL commands caught in the profiler trace with the allocation events caught from the trace flag and shown via this query:
SELECT TOP 500 record.value('(Record/@id)', 'int') AS record_id, inf.cpu_ticks, inf.ms_ticks, CONVERT (VARCHAR, DATEADD (MINUTE, -1 * ((CAST(inf.cpu_ticks AS BIGINT) / CAST(inf.ms_ticks AS BIGINT)) - CAST([timestamp] AS BIGINT))/60000, GETDATE()), 126) AS EventTime, [timestamp] , record.value('(Record/@id)', 'int') AS RingBuffer_Record_Id, record.value('(Record/ALLOC/Event)', 'int') AS AllocationEventType, record.value('(Record/ALLOC/SpId)', 'int') AS SpId, record.value('(Record/ALLOC/EcId)', 'int') AS EcId, record.value('(Record/ALLOC/PageId)', 'nvarchar(50)') AS AllocatedPageId, record.value('(Record/ALLOC/AuId)', 'nvarchar(50)') AS AllocationUnitId, record.value('(Record/ALLOC/LsId)', 'nvarchar(50)') AS LsId, t.record FROM sys.dm_os_sys_info inf CROSS JOIN ( SELECT TIMESTAMP, CONVERT (XML, record) AS record FROM sys.dm_os_ring_buffers WHERE ring_buffer_type = 'RING_BUFFER_ALLOC_TRACE' AND ( record LIKE '%<Event>23</Event>%' -- uniform extent allocation OR record LIKE '%<Event>22</Event>%' -- uniform extent deallocation OR record LIKE '%<Event>24</Event>%' -- mixed extent allocation OR record LIKE '%<Event>25</Event>%' -- mixed extent deallocation OR record LIKE '%<Event>10</Event>%' -- page allocation OR record LIKE '%<Event>11</Event>%' -- page deallocation )) AS t ORDER BY record.value('(Record/@id)', 'int') ASC
My plan was to run this trace first thing in the morning. I needed to run it during the course of our normal business operations because this was when the internal object was building up, but I also didn’t want to run it during our peak time of day. As I began to prepare to run it the next morning, I was setting up the profiler trace and something caught my attention.
The application uses cursors for all database operations instead of the much superior method of stored procedures. I happen to notice that some transactions had sp_cursoropen and sp_cursorfetch, but no sp_cursorclose. Many did have sp_cursorclose lines, but many also didn’t. In the above screenshot, you can see the top line is the last statement in the previous transaction, 164507091. Then in transaction 164507096 you have an sp_cursoropen and an sp_cursorfetch, but there is not sp_cursorclose. The last line indicates a new transaction has begun. This led me to believe the problem was a bunch of open cursors for each of these connections.
My last step was to confirm this using the sys.dm_exec_cursors dynamic management function. When I checked SPID 213, which had been open about 2 1/2 days, I found a whopping 18,000 open cursors:
When I checked SPID 181, which was on the top of the first screenshot above, it should no open cursors:
This is consistent with the fact that 181 has deallocated almost the same amount of Internal Object that it has allocated. In this Microsoft Technet article it states that TempDB internal objects are used to store query results from cursors.
I sent this explanation, along with the results of the trace, to the Application’s support staff. Using the trace they were able to find the places in their code where they were not closing their cursors and fix that code.