Never Assume – An Index Defragmentation Story

I can remember muskie fishing a couple years ago with my brother in law. We were fishing down a shoreline and moved to a part of the lake where I’d never even seen a muskie before. I was wondering how much longer I should wait before asking him if he wanted to move on to a new location. Suddenly he encountered a viscious strike, and shortly after landed a fat 38″ muskie. Sometimes you think you know where the fish will be, but you never really know you so you’ve got to check.

This week I got an email from one of our vendors. The vendor was reminding me that I had declined some of their maintenance script for their database. I didn’t see the need for their script as I run maintenance over all the databases on the instance at the same time. I update statistics and rebuild/reorganize indexes based on fragmentation levels and page counts. Their maintenance also included a file shrink, which is usually a waste of resources – size it out correctly the first time! Anyway, I responded that the software was performing fine and my maintenance plan was covering everything that theirs would have. Then, just for kicks and giggles, I decided to check out the fragmentaton on the database in question. It is a small database and only 6 tables are over 1000 pages anyway, so fragmentation probably wouldn’t ever cause a noticable performance impact, but I was surprised to see some pretty high numbers. I went to my defragmentation log and could not find a record of any of those tables being defragmented in the recent past. I went in and checked the SQL Agent job and found it was running as expected. Then I found the issue – the stored procedure I use to loop through the tables and defragment takes a @databases parameter that can be filled in if you only want to defragment select databases. Since there were a few I wanted to skip, I put all the others in, but the parameter was defined as varchar(128), which was woefully short. The defragmentation routine was only getting the first several databases that I had specified, and the others were not being touched.
I quickly corrected this by extending the parameter to varchar(4000) and ran it (overnight) to catch up. Now I can see fragmentation is down where it should be. The lesson in this story is to check and double check. As a DBA you can never be too careful, and you don’t ever really know something unless you can prove it by looking at the raw data.


Sargability – make sure your indexes are being used!

My favorite time of year to fish is fall. The fish are bulking up for the winter, so they are fatter than usual. They also will often binge feed, providing some great fishing. However, fall can be rather uncomfortable here in Wisconsin. We get freezing temperatures, blustery days, and snow or rainfail for extended periods in the fall. Just this last October I was fishing cold conditions, with a pretty strong wind and light snow that turned heavy every so often. I remember looking around, seeing no one else on the lake, and thinking to myself “This is living!”. I’ve come to enjoy the sting of the weather because I associate it with the joy of fishing and being in the outdoors.
One thing that can really sting database end users is a slow query. Web pages that take seconds to load instead of miliseconds and BI queries that take minutes instead of seconds can make people awfully grumpy toward the DBA. One way to make sure you can aviod this sting is to ensure your queries are sargable. Sarbable is a made up term that comes from the expression Search ARGument ABLE. Basically, you want to make sure any fields that are referenced in the WHERE, ORDER BY, GROUP BY, and HAVING clauses are able to use an index. The main way to do this is to avoid functions in your predicate. Let’s take a look at a quick example. I’ll create a table with a clustered primary key and a some nonclustered indexes and fill it with some data:

--Create table with primary key clustered
CREATE TABLE dbo.BaitPurchases (ID INT IDENTITY(1,1), Baitname VARCHAR(255) NOT NULL, Purchasedate DATETIME2 NULL, Price MONEY NULL
([ID] ASC)

--Create nonclustered index on PurchaseDate
CREATE NONCLUSTERED INDEX IX_BaitPurchases_PurchaseDate ON [dbo].BaitPurchases
   Purchasedate ASC
)INCLUDE (Baitname)

--Create nonclustered index on Baitname
CREATE NONCLUSTERED INDEX IX_BaitPurchases_Baitname ON [dbo].BaitPurchases
   Baitname ASC
)INCLUDE (Price)

--Create nonclustered index on Price
CREATE NONCLUSTERED INDEX IX_BaitPurchases_Price ON [dbo].BaitPurchases
   Price ASC
)INCLUDE (Baitname)

--Throw in some data
INSERT INTO dbo.BaitPurchases
        ( Baitname, Purchasedate, Price )
VALUES  ( 'Bulldawg','2010-04-29',15.99),
( 'Double Cowgirl','2010-04-30',16.99),
( 'Shallow Raider','2010-05-29',15.49),
( 'Depth Raider','2010-05-30',18.99),
( 'Reef Runner','2011-05-29',14.99),
( 'Mepps Muskie Killer','2011-04-29',12.49),
( 'Suick Thriller','2011-04-30',16.09),
( 'Softtail Phantom','2011-05-30',26.99),
( 'Shallow Invader','2012-04-29',18.29),
( 'Medussa','2012-04-30',17.99),
( 'Grandma','2012-05-29',15.39),
( 'Jake','2012-05-30',15.39),
( 'Believer','2013-04-29',16.59),
( 'Harasser','2013-04-30',13.49),
( 'Doubledawg','2013-05-29',17.99),
( 'Buchertail 700','2013-05-30',13.89),
( 'Hawg Wobbler','2014-04-29',21.99),
( 'Pacemaker','2014-04-30',20.99),
( 'Top Raider','2014-05-29',18.50),
( 'Swim Whizz','2014-05-30',16.98),
( 'Creek Chub Pike Minnow','2015-01-12',15.00)

Now we’ll look for all items that were purchased in 2012, attempting to utilize the nonclustered index that was created on the Purchasedate field:

SELECT Baitname, Purchasedate
FROM dbo.BaitPurchases
WHERE Purchasedate BETWEEN '2013-01-01' AND '2013-12-31 23:59:59.999'

20150112 Sargability Date Seek

You can see from the execution plan that the database engine is doing an index seek on our nonclustered index, just as we were expecting.  However, your junior DBA comes across this code and thinks it’ll be much more readable if he re-writes it with a function to look at the purchase year:

SELECT  Baitname ,
FROM    dbo.BaitPurchases
WHERE   YEAR(Purchasedate) = 2013

This does look more readable to me, but if we take a look at the execution plan, we find a very bad negative consequence of this change:

20150112 Sargability Date Scan

We are now doing a nonclustered index scan, meaning the entirety of the index was read. This doesn’t make a big difference in our current tiny table, but with a table with millions of rows you could be seeing the query execution time going from less than a second to minutes.

Let’s take a look at one more common filtering technique that causes index scans instead of seeks. Let’s say you can’t quite remember the name of that lure, but you’re sure it started with an ‘S’. The obvious way to search for this would be:

SELECT  Baitname
FROM    dbo.BaitPurchases
WHERE   Baitname LIKE 'S%'

Looking at the execution plan, we can see we got a nonclustered index seek, so we’re happy with our results.
20150112 Sargability String Seek

However, now we remembered the name didn’t start with an ‘S’, but it definitely had ‘Raider’ in it somewhere. Now we change up our query like this:

SELECT  Baitname
FROM    dbo.BaitPurchases
WHERE   Baitname LIKE '%Raider%'

Now we’re getting a nonclustered index scan:
20150112 Sargability String Scan

Before, the database engine could look at the beginning of each baitname to find the results we were looking for. With the index, the engine jumped right to the section of the index that had the ‘S’s and returned the results from there. Now it has to look in every single bait name and see if it contains the string ‘Raider’. Searches like the second one are usually going to be slow, but SQL Server does provide one tool for getting that data a bit faster, and that is Full-Text indexes. I’ll discuss those in a later blog post.

So take the sting out of database queries by making sure your indexes are being used.

SQL Server Table Structure

Fishing Hotspot maps are a popular tool to help fishermen catch more fish. Particularly helpful when you want to try a new lake, these maps contain the depth and key underwater features of a lake, as well as places where you are likely to hook into a fish. The map will also contain a list of key features of the lake, such as water clarity, max depth, acreage, and bottom type. These maps can sometimes be compiled into entire books, so you’ll always have a map of a nearby lake no matter where you are in the state. The table of contents in these books provide an alphabetical list of lakes, so if I want to find the map for Lake Tomahawk, I just look it up in the table of contents, find it’s on page 96, and open the book to that page. But let’s say I want to choose a small lake to fish – I have half a day to fish and I want to cover an entire lake. The only way to find all lakes under 300 acres would be to page through the entire book and check the acreage on each page to see if it is less than 300.

Wisconsin Fishing Guide
Relational Database Management Systems (RDBMS) use indexes to allow queries to quickly find information like this. There are two basic types of indexes in SQL Server – Clustered and NonClustered. Before we get into those, let’s briefly touch on the lack of an index. If there is no clustered index on a table, the table is referred to as a heap. Although data in a heap is usually stored in the order rows are inserted into the table, the Database Engine is free to move the data at will for the purpose of storing the rows efficiently. This means INSERTs into a heap are usually fast – SQL Server can just throw the data whereever. However, SELECTs, UPDATEs, and DELETEs are going to be considerably slower because the IO system is going to have to read data from many different places on the disk.
There are a few other disadvantages of heaps. One is that the only way to reclaim space is to rebuild the table. If a table is 1 GB in size at its largest, then all the data is deleted out, the size will remain 1 GB until the table is rebuilt. The second is that rebuilding can be a burden, pparticularlywhen nonclustered indexes are involved. We’ll discusses this more later, but nonclustered indexes on a heap contain a pointer between the data that is contained in the index and the remainder of the data for that row, which is stored in the heap. Since rebuilding the table moves the data, the nonclustered index needs to also be rebuilt so the row pointer can be updated to point to the correct location. This is not the case for nonclustered indexes on tables with a clustered index. So the point to take away from this little digression is that there a certain times a heap will be preferable over a clustered index, but in most cases a table will be more performant with a clustered index.
So what is a clustered index? A clustered index is the index that contains all the data in the bottom level of index. For this reason, the clustered index contains all the data in the table, and provides a logical order to it. Each table can only have one clustered index, and once it has this index it is no longer a heap. Data in the table will always be logically stored in order according to the definition of the index. A clustered index is organized in what is referred to as a B-tree structure. This can be shown in a hierarchical diagram with the root node on top, intermediate index pages in the middle, and the leaf node or data pages on bottom. Let’s demonstrate this with an example. I create a table with a clustered index and fill it as follows:

CREATE TABLE dbo.ClusteredIndex
      ID INT IDENTITY(1, 1) ,
      string VARCHAR(108) ,
      digits DECIMAL(38, 3)
CREATE CLUSTERED INDEX IX_ClusteredIndex_ID ON dbo.ClusteredIndex (ID)
SELECT TOP 1000000
        IDENTITY( INT,1,1 ) AS N
INTO    #temp
FROM    MASTER.dbo.syscolumns sc1 ,
        MASTER.dbo.syscolumns sc2
INSERT  INTO dbo.ClusteredIndex
        ( string ,
        SELECT  REPLICATE(NEWID(), ABS(CHECKSUM(NEWID()) % 3) + 1) ,--Random string of length 36, 72, or 108
                CAST(CHECKSUM(NEWID()) AS FLOAT) / 1000 --Random decimal
        FROM    #temp

Next I find all pages being used by the table by using DBCC IND(database, table, index number):

--Get a list of pages used by the table
 DBCC IND('sandbox', 'clusteredindex',1)

20141231 Table Structure Clustered Pages

There are a few things to note from these results for our discussion. First is the PagePID. This is the actual pageID within SQL Server. This is unique to the database and represents on 8KB page. Next is the PageType. This shows what is contained on the page. 10 indicates the IAM page, which I will explain in a later blog post. 2 indicates the index page, rather than a data page, which is represented by a 1. The NextPagePID and PrevPagePID indicate the logical ordering of the table. So Page 303 is the first data page in the table, followed by 15762 and so on.
To demonstrate the B-tree structure of the clustered index, let’s take a look at the top Index level page. I find this page by finding the record with the PageType = 2 and the highest index level. For this table the highest index level is 2, as seen here:

20141231 Table Structure Root Node

To do this we’ll use DBCC Page:

DBCC PAGE (sandbox/*database*/,1/*file number*/,16383/*Page Number*/,3/*Print Option 0 - 3*/)

20141231 Table Structure Root Node Index

This undocumented SQL Statement provides a very interesting look at the root node of the index. From here we can see each of the intermediate nodes PageID, which is represented by the ChildPageId field. Let’s dig into one of the intermediate nodes before explaining this in a bit more detail. The first ChildPageId is 15761, so lets display the contents of this page:

DBCC PAGE (sandbox/*database*/,1/*file number*/,15761/*Page Number*/,3/*Print Option 0 - 3*/)

20141231 Table Structure Clustered Intermediate Nodes

We now see the contents of the intermediate layer of the B-tree. We can see that again each row has a ChildPageID and at least two key fields. The ChildPageId field is a pointer to the data pages that hold the data for this part of the index. In our example we can see that Page 303 is going to contain all the data in the table from ClusteredIndex.ID = 1 through ClusteredIndex.ID = 78. Page 15762 will contain all data in the table from ClusteredIndex.ID = 80 through ClusteredIndex.ID = 152, and so on. We can see this based on the fields that are marked with (key). In this example, the index was only clustered on the ClusteredIndex.ID field, but if we had multiple fields here they would each get a separate column in this result set. The UNIQUIFIER is added on by SQL in case there are very dense or large rows that cause the clustered key to span multiple pages. The UNIQUIFIER tracks which of these pages come first in the order of the index. So a graphical representation of this clustered index would be:

20141231 Table Structure Clustered B-Tree

Lastly, if we grab the first data page we get the actual records in the table (after some initial header information, excluded from the screenshot below):

20141231 Table Structure Leaf Node

So we can see that a clustered index orders the data in the table, and also contains all data for the table in the leaf node of the index.  If a clustered index orders the table and contains all the data, why do I need a nonclustered index?  Let’s think back to our example of the book of fishing maps.  The book has all the data, and the table of contents show where each map is by lake name.  But we wanted to find all the lakes that were 300 acres or less, remember?  So we create a nonclustered index – that is, a new table of contents to list lake maps by size, starting with the smallest and working our way up.  Now we can see the names of all the small lakes together.  We’ll still have to flip around the book to see the actual maps, but our new “lake size” table of contents let’s us know which pages to go to.

I create an example using identical table structure as before, but this time I’ll add a nonclustered index onto the decimal field:

CREATE TABLE dbo.NonClusteredIndex
      ID INT IDENTITY(1, 1) ,
      string VARCHAR(108) ,
      digits DECIMAL(38, 3)
CREATE CLUSTERED INDEX IX_NonClusteredIndex_ID ON dbo.NonClusteredIndex (ID)
CREATE NONCLUSTERED INDEX IX_NonclusteredIndex_digit ON dbo.NonClusteredIndex (digits)
SELECT TOP 1000000
        IDENTITY( INT,1,1 ) AS N
INTO    #temp
FROM    MASTER.dbo.syscolumns sc1 ,
        MASTER.dbo.syscolumns sc2
INSERT  INTO dbo.NonClusteredIndex
        ( string ,
        SELECT  REPLICATE(NEWID(), ABS(CHECKSUM(NEWID()) % 3) + 1) ,--Random string of length 36, 72, or 108
                CAST(CHECKSUM(NEWID()) AS FLOAT) / 1000 --Random decimal
        FROM    #temp

To show the advantage we have now given ourselves, lets try selecting all the records with a negative number in the digits column, first from the dbo.ClusteredIndex table, and then from the dbo.NonClusteredIndex table.  We’ll look at what’s happening behind the scenes using Profiler.  Here’s the SELECT from the table without the nonclustered index:

20141231 Table Structure Clustered Index Select

I’ve highlighted the important stats, 343 CPU, 13,536 Reads, and it took 900 miliseconds to complete.  Here’s the SELECT from the table with the nonclustered index:

20141231 Table Structure NonClustered Index Select

We can see a very impressive performance improvement by selecting from a nonclustered index.  CPU has been cut by more than 50%, the number of reads has fallen by 88%, and the duration has also fallen.  We can see how the database engine found our data by looking at the execution plan.  Here is the plan for the Select from dbo.ClusteredIndex:

20141231 Table Structure Clustered Index Plan

We can see that SQL had to scan the entire clustered index to find our data – this means the entire table was read.  SQL Server has even suggested that we can improve performance by adding a nonclustered index.  Here is the execution plan for the Select from dbo.Nonclusteredindex:

20141231 Table Structure NonClustered Index Plan

Here the database engine was able to use a index seek on the Nonclustered index.  This means it didn’t even have to traverse the entire nonclustered index – it was able to simply find the section of the index needed and read those records.

The last interesting thing to note relates to our earlier point about how nonclustered indexes reference the remaining data for heaps versus clustered indexes.  If we look at the lowest level of a page from our current nonclustered index (which lives in a table that contains a clustered index), we can see the clustered index is contained on the nonclustered index pages:

20141231 Table Structure Nonclustered page with clustered index

Here, we see the ID column is included, even though the nonclustered index was only built on the digits column.  This changes if we first drop the clustered index, then look at the lowest level of the nonclustered index:

DROP INDEX IX_NonClusteredIndex_ID ON dbo.NonClusteredIndex

Here is the lowest level of the nonclustered index:

20141231 Table Structure Nonclustered page with heap

Now we see a Row ID instead of the clustered index key.  This tells the database engine where to look for the rest of the data relating to this row (in our case, the ID and string values that correspond to each particular digits value).  The two things to note on this:

  1. Again, since this points to a physical location, any time that heap data moves (such as during a table rebuild) the nonclustered index will need to be rebuilt.  This does not happen during a clustered index rebuild because the nonclustered index only has a pointer to the clustered key.
  2. This illustrates the reason for keeping clustered index keys small.  Since each nonclustered index contains the clustered index key, a large clustered index key will also force our nonclustered indexes to be larger than may be necessary.

So, there is my (long) explanation of table structures in SQL Server.  Hopefully this will help you understand why certain best practices exist, and how to improve performance on your tables.