Why is TempDB Filling with Internal Object? Part 2

Last weekend, my brother-in-law and I fished a couple lakes in northern Wisconsin in anticipation of a tournament we will be competing in next month.  We did very well on the first lake, seeing several fish and hooking up with one of them, but the second lake we didn’t even see a fish.  The fish that we did hook up with has been weighing on my mind quite heavily.  My brother-in-law had just moved us to our second spot of the morning.  He fired off a cast while I was changing to a lure that was more appropriate for the location we were fishing.  Within a few seconds of his lure hitting the water a large muskie had come up and hit the lure.  He set the hook solidly and was soon fighting a very energetic fish.  The fish was only hooked in the beak and had a good chance of shaking the lure before we got it in the net.  The muskie darted through the water, moving left and then coming back right.  The second time we got it to the boat, I lowered the net into the water and stabbed toward the fish.  It, however, wasn’t ready to be netted.  It turned away from the net and, in the process, the hook slipped out.  As it swam away I pushed the net after it and got it halfway in, but I couldn’t keep up and it swam back down into the depths.  Just seeing such a large fish (I figure it was about 45″, based on the multiple views of it I got) was very exciting, but failing to get it in the net was crushing.  In retrospect, I did one important thing wrong.  I didn’t wait for my brother-in-law’s command to attempt the net job.  Since it was his fish, I should have waited until he said “Now!” to attempt to net the fish.  There is a chance the fish would have shook off before we got it into the net, but that was his call to make.  In the excitement and chaos of the moment, the thought didn’t even cross my mind.  I feel bad for having cost him the successful catch of a nice muskie, but it did teach me an important lesson.  I just wish the lesson could have been at my own expense instead of someone else’s.
In part 1 of this post I described how an upgrade of a third party application was causing problems in TempDB by allocating TempDB objects and not deallocating the objects via the code. Since the connections were staying open, TempDB continued to use up space until it ran out and crashed the SQL Server service. Here in part 2, I will describe how I found exactly what exactly was sitting in TempDB and what was causing the objects not to deallocate.
I began by considering the results of the SQL statement described in my part1. I noticed that several spids seemed to often allocate tempdb internal object space without deallocating it, while others deallocated it just as I’d expect.
20160922-fix3

I decided that if I could capture what those spids where doing when they allocated more space, I would be able to find code that I could provide the application support staff to prove they were not deallocating objects properly. My plan was to pick a spid or two that appeared to be consistent offenders, such as 213 above, and run a trace on them. At the same time I planned to use some SQL that would show internal objects being allocated and deallocated in TempDB. Microsoft has provided trace flag 1106, which logs info like this to the sys.dm_os_ring_buffers dmv when it is turned on. However, this trace flag has the potential to grind database processing to a halt, and should only be used in production with great care (Microsoft recommends not using in production at all). Once this is turned on, I would match the SQL commands caught in the profiler trace with the allocation events caught from the trace flag and shown via this query:

SELECT TOP 500
record.value('(Record/@id)[1]', 'int') AS record_id,
inf.cpu_ticks,
inf.ms_ticks,
CONVERT (VARCHAR, DATEADD (MINUTE, -1 * ((CAST(inf.cpu_ticks AS BIGINT) / CAST(inf.ms_ticks AS BIGINT)) - CAST([timestamp] AS BIGINT))/60000, GETDATE()), 126) AS EventTime,
[timestamp] ,
 record.value('(Record/@id)[1]', 'int') AS RingBuffer_Record_Id,
 record.value('(Record/ALLOC/Event)[1]', 'int') AS AllocationEventType,
 record.value('(Record/ALLOC/SpId)[1]', 'int') AS SpId,
 record.value('(Record/ALLOC/EcId)[1]', 'int') AS EcId,
 record.value('(Record/ALLOC/PageId)[1]', 'nvarchar(50)') AS AllocatedPageId,
 record.value('(Record/ALLOC/AuId)[1]', 'nvarchar(50)') AS AllocationUnitId,
 record.value('(Record/ALLOC/LsId)[1]', 'nvarchar(50)') AS LsId,
 t.record
FROM sys.dm_os_sys_info inf CROSS JOIN (
SELECT TIMESTAMP, CONVERT (XML, record) AS record 
FROM sys.dm_os_ring_buffers 
WHERE ring_buffer_type = 'RING_BUFFER_ALLOC_TRACE'
AND ( record LIKE '%<Event>23</Event>%' -- uniform extent allocation
OR record LIKE '%<Event>22</Event>%' -- uniform extent deallocation
OR record LIKE '%<Event>24</Event>%' -- mixed extent allocation
OR record LIKE '%<Event>25</Event>%' -- mixed extent deallocation
OR record LIKE '%<Event>10</Event>%' -- page allocation
OR record LIKE '%<Event>11</Event>%' -- page deallocation
)) AS t
    ORDER BY record.value('(Record/@id)[1]', 'int') ASC 

My plan was to run this trace first thing in the morning. I needed to run it during the course of our normal business operations because this was when the internal object was building up, but I also didn’t want to run it during our peak time of day. As I began to prepare to run it the next morning, I was setting up the profiler trace and something caught my attention.
20160922-fix4

The application uses cursors for all database operations instead of the much superior method of stored procedures. I happen to notice that some transactions had sp_cursoropen and sp_cursorfetch, but no sp_cursorclose. Many did have sp_cursorclose lines, but many also didn’t. In the above screenshot, you can see the top line is the last statement in the previous transaction, 164507091. Then in transaction 164507096 you have an sp_cursoropen and an sp_cursorfetch, but there is not sp_cursorclose. The last line indicates a new transaction has begun. This led me to believe the problem was a bunch of open cursors for each of these connections.
My last step was to confirm this using the sys.dm_exec_cursors dynamic management function. When I checked SPID 213, which had been open about 2 1/2 days, I found a whopping 18,000 open cursors:
20150817 Tempdb 213 Open Cursors
When I checked SPID 181, which was on the top of the first screenshot above, it should no open cursors:
20150817 Tempdb 181 Open Cursors None
This is consistent with the fact that 181 has deallocated almost the same amount of Internal Object that it has allocated. In this Microsoft Technet article it states that TempDB internal objects are used to store query results from cursors.

I sent this explanation, along with the results of the trace, to the Application’s support staff. Using the trace they were able to find the places in their code where they were not closing their cursors and fix that code.

Why is TempDB Filling with Internal Object? Part 1

It is important to study the underwater landscape while fishing. There are always nuances to every location, and by finding these “spot on the spot” locations, you can know where to focus your efforts. For example, you might be fishing a large weedy bay that spans several acres. Before diving in and fishing the middle, you’d want to fish the weed edges, where the predators hide just inside the weeds, ready to ambush any bait fish that foolishly wanders into open water. An example of a spot on a spot would be a clump of weeds that extends out of the main weedbed and into the open water. This clump could conceal a large predator and at the same time give the predator a way to watch a lot of open water. Another spot might be a small hole in the main weedbed. A predator can wait for a baitfish to swim out into the hole where there is no cover, and then rush out and strike. If you aren’t paying close attention to the underwater environment while you are fishing, it can be easy to miss these spots.

We recently upgraded one of our third party applications (our biggest one, in fact), and we ran into some problems that involved a very close study of SQL Server to resolve. After the upgrade I was told by some of the application’s support staff that other customers who had upgraded were having problems (not a good sign). The main problem was TempDB filling up and not releasing it’s used space, and the only way to fix was to reboot SQL Server or recycle all of the application’s services. This was a problem that the application’s developers had not been able to figure out, so of course I was lured by the challenge of figuring out a tough problem.

The first problem was that the vendor could not see TempDB filling up. They only saw the files growing. They confirmed that they were monitoring the tempdb space used by clicking on the properties in the SSMS object explorer and looking at the space available in the General tab.
20150814 TempDB Pt1 Properties
I found that this uses the sp_spaceused system stored procedure to find these values.
20150814 TempDB Pt1 sp_spaceused
In this MSDN article, Microsoft explains that the unallocated_space considers only data pages. We can see that it does not consider internal objects by looking at the sys.dm_db_file_space_usage dynamic management view. This view is specific to TempDB and shows how much space is being used by each different TempDB object – Version Store, User Object, and Internal Object.
20150814 TempDB Pt1 space used DMV
By looking at this screenshot, you can already see that far more than 1 GB of space is being used. Each page hold 8KB of data and there are 520,440 pages being used by internal objects – that’s 4.16 GB being used. This explains why the support staff was not able to see TempDB filling up. I had already set up monitoring on our system to track this as well, and I was able to pull up the following graph:
20150814 TempDB Pt1 Space Used Graph
So I was able to establish that TempDB was filling up with Internal Objects – now the question was why. Microsoft explains here that the Internal Objects are created by the database engine and are used for, among other things, storing intermediate results for spools or sorting, as well as cursor results. Tempdb usually deallocates these objects after they are finished being used, but that is often triggered by a connection to the database being closed. So let’s look at the connections to SQL Server by the third party application. The app uses a connection pool which aims to reduce connection overhead by leaving connections open for a certain period of time. I created a job to look at our connections once every five minutes to see how long they’d been open and how much internal object had been allocated and deallocated. I did this through the use of the stored procedure sp_who2 joined to the dmv’s sys.dm_exec_sessions, sys.dm_db_task_space_usage, and sys.dm_db_session_space_usage. Here is the SQL statement that I put together:

    CREATE TABLE #sp_who2
        (
          SPID INT ,
          Status VARCHAR(1000) NULL ,
          Login SYSNAME NULL ,
          HostName SYSNAME NULL ,
          BlkBy SYSNAME NULL ,
          DBName SYSNAME NULL ,
          Command VARCHAR(1000) NULL ,
          CPUTime INT NULL ,
          DiskIO INT NULL ,
          LastBatch VARCHAR(1000) NULL ,
          ProgramName VARCHAR(1000) NULL ,
          SPID2 INT ,
          RequestID INT NULL ,
          rundate DATETIME DEFAULT ( GETDATE() )
        ) 


    INSERT  INTO #sp_who2
            ( SPID ,
              Status ,
              Login ,
              HostName ,
              BlkBy ,
              DBName ,
              Command ,
              CPUTime ,
              DiskIO ,
              LastBatch ,
              ProgramName ,
              SPID2 ,
              RequestID
            )
            EXEC sp_who2




    CREATE TABLE #DBCCOutput
        (
          EventType NVARCHAR(100) ,
          Parameters INT ,
          EventInfo VARCHAR(MAX)
        )
    CREATE TABLE #SQL
        (
          spid INT ,
          EventType NVARCHAR(100) ,
          Parameters INT ,
          EventInfo VARCHAR(MAX)
        )


    DECLARE @max INT;
    DECLARE @i INT;
    SELECT  @max = MAX(session_id)
    FROM    sys.dm_exec_sessions
    SET @i = 51
    WHILE @i <= @max
        BEGIN
            IF EXISTS ( SELECT  session_id
                        FROM    sys.dm_exec_sessions
                        WHERE   session_id = @i )
                INSERT  #DBCCOutput
                        EXEC ( 'DBCC INPUTBUFFER (' + @i + ')'
                            )
            INSERT  INTO #SQL
                    ( spid ,
                      EventType ,
                      Parameters ,
                      EventInfo
                   )
                    SELECT  @i ,
                            EventType ,
                            Parameters ,
                            EventInfo
                    FROM    #DBCCOutput;
            TRUNCATE TABLE #DBCCOutput;
            SET @i = @i + 1
        END;

   


    WITH    all_task_usage
              AS ( SELECT   session_id ,
                            SUM(internal_objects_alloc_page_count) AS task_internal_objects_alloc_page_count ,
                            SUM(internal_objects_dealloc_page_count) AS task_internal_objects_dealloc_page_count
                   FROM     sys.dm_db_task_space_usage
                   GROUP BY session_id
                 ),
            InternalObjectPages
              AS ( SELECT   R1.session_id ,
                            R1.internal_objects_alloc_page_count
                            + R2.task_internal_objects_alloc_page_count AS session_internal_objects_alloc_page_count ,
                            R1.internal_objects_dealloc_page_count
                            + R2.task_internal_objects_dealloc_page_count AS session_internal_objects_dealloc_page_count
                   FROM     sys.dm_db_session_space_usage AS R1
                            INNER JOIN all_task_usage AS R2 ON R1.session_id = R2.session_id
                   WHERE    R1.internal_objects_alloc_page_count
                            + R2.task_internal_objects_alloc_page_count <> 0
                            OR R1.internal_objects_dealloc_page_count
                            + R2.task_internal_objects_dealloc_page_count <> 0
                 )

                SELECT  sp_sql.spid ,
                        ses.login_time ,
                        ses.last_request_start_time ,
                        ses.last_request_end_time ,
                        sp_sql.EventInfo ,
                        spwho2.Login ,
                        spwho2.HostName ,
                        spwho2.DBName ,
                        InternalObjectPages.session_internal_objects_alloc_page_count ,
                        InternalObjjectPages.session_internal_objects_dealloc_page_count ,
                        ses.host_process_id,
                       GETDATE() AS CaptureDate
                FROM    InternalObjectPages
                        INNER JOIN #sp_who2 spwho2 ON spwho2.SPID = InternalObjectPages.session_id
                        INNER JOIN #SQL sp_sql ON sp_sql.spid = spwho2.SPID
                        INNER JOIN sys.dm_exec_sessions ses ON ses.session_id = InternalObjectPages.session_id;

    DROP TABLE #SQL
    DROP TABLE #DBCCOutput
    DROP TABLE #sp_who2

And here are the results:

20160922-fix
Note that these are all connections from one of the application servers to the database using the app’s SQL login. The EventInfo field is the last SQL that was run. If you compare the login time with the last request start and end times, you can see some connections have been open a long time. In the case of spid 182, the connection had been open for a day and a half but it had not be doing anything for the last 14 hours. You can also see that there were far more pages overall that had been allocated than had been deallocated. Pages can be deallocated directly by the code, or when the connection is closed. It appears the application’s code was not properly deallocating the internal object, and since the connections were being reused, the internal object was stuck indefinitely in a state of allocation, even though it was not really being used anymore.
I was able to prove this by doing a reboot of the Host Application2, then running the SQL statment again:
20160922-fix2
When comparing this screenshot to the one directly above, you can see that all the internal objects that were allocated by Application2 (primarily spids 182,74,242,234…) were now gone. The connections closed and the internal objects were immediately deallocated.
Here’s my overall TempDB space usage after the reboot:
20150814 TempDB Pt1 Graph After App2 Reboot
At this point I had proved that these connections were causing TempDB to blow up, but what exactly was sitting in TempDB taking up space? Look out for Part2 to see how I looked in even more detail to figure out what was really happening.