Index a Computed Column to Improve Query Performance

This summer started out with a below-average fishing volume. The birth of my fourth child in June used up much of the time I would normally spend fishing. As the summer grew late, I was able to get out a bit more, but it’s mostly been a small lake near my house. I’ve only gotten down to Pewaukee to muskie fish once, though I am happy to say I did have some luck that outing. The fish came in hot behind my lure, but veered off when it got to the side of the boat. As I began the figure eight, I turned to my brother-in-law to let him know I saw a fish and to estimate the size. While I was looking at him the fish came back and struck. After a brief but spirited fight, we netted the fish, a 40.5″ muskie. She was very healthy and I feel blessed to have gotten one my first time out this summer.

20180728 Pewaukee Musky

Doing a figure-eight while musky fishing is something you should do on every cast, just like making sure every query you’re executing is using an index as expected.  In some cases though, indexes can be hard to use.  I wrote an earlier post about SARGability – making sure your indexes can be used.  As I explained in that post, some things, such as functions on the columns being searched, can prevent SQL Server from doing a seek on the index.

I recently was facing a situation like that while loading a data warehouse table.  These loads use a large number of row and can be very time consuming, so it is important to make sure the performance is as efficient as possible.  In this case, I was matching customers in one table to customers in another by comparing their addresses and cities.  The relevant part of the join was:

REPLACE(LEFT(Table1.StreetAddress, 10) + LEFT(Table1.City, 5), ' ', '') = REPLACE(LEFT(Table2.StreetAddress, 10) + LEFT(Table2.City, 5), ' ', '')

Both the REPLACE and LEFT functions prevent the index from being seeked.  An initial test run of the join on the two big tables came in at about 3 hours, much longer than I needed to make it.

To solve the problem, I created a view that added a computed column to hold the match code:

CREATE VIEW [cwi].[Customer Sold To Indexed Address Match]
WITH SCHEMABINDING

AS
SELECT dncomp AS [CMP Company]
     , dnjynb AS [CST Customer #]
    , [...] AS [Many other columns]
    , REPLACE(LEFT(dnl1tx, 10) + LEFT(dnl5tx, 5), ' ', '') AS AddressMatch
FROM vendorschema.vendortable;
GO

The important thing to note here is the SCHEMABINDING.  This tells SQL Server that the view is dependent upon the object (in this case, vendorschema.vendortable) staying the same.  If we were to go into vendorschema.vendortable and attempt to drop the dnl1tx column, SQL Server would give me an error informing me that the cwi.[Customer Sold To Indexed Address Match] view depends upon that object and the drop would fail.

Once I have the computed column set up, I can build an index on it just as I would any other column:

CREATE UNIQUE CLUSTERED INDEX [IX_lwt_CustSoldToIndexAddressMatch_AddressMatch] ON [cwi].[Customer Sold To Indexed Address Match]
(
   [AddressMatch] ASC,
   [CMP Company] ASC,
   [CST Customer #] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

Now when I run the same query that was before taking about 3 hours, the results are returned in a minute and a half. That’s an improvement any DBA can be proud of!

Advertisements

How Index Fill Factor affects new rows

Fishing is not what it used to be. I recently began reading the Wisconsin Outdoor News regularly. It has helped me to understand just how pervasive fish stocking is. Without it, the population of fishermen in Wisconsin, which includes me, would quickly overcome the population of fish, leaving our lakes and rivers largely empty. Stocking fish isn’t a bad thing; it allows more people to enjoy fishing that much more. It allows for bigger bag limits and greater participation. But I do believe we should not trick ourselves into thinking that fishing is the same now as it used to be before we settled and tamed our great state. Many of the lakes we enjoy fishing have species of fish that are not native to that body of water. So yes, it’s great that my local lake has a nice population of walleye (most of which have been stocked), but the natural balance between the species of fish in the lake has been artificially disrupted for the purposes of catering to fishermen who prize walleye above all else. Is it fair to fisherman who enjoy fishing other species, or even to the ecosystem of the lake, to maintain this stocking? That’s a tough question to ask. We just don’t know the long-term affects of maintain such an unnatural lack of balance will be.

Fish Stocking.jpg
One thing I like about SQL Server is you can test almost anything, so you can be sure what the long term affects of a change will be. I recently answered a trivia question regarding Index Fill Factor incorrectly, so I decided to test out the question in SQL. The question asked whether fill factor affects newly inserted rows on the table. The Fill Factor of an index controls how much empty space the database engine will attempt to leave on each page for the purposes of minimizing page splits during updates or inserts.

For example, if I have a page that is 100% full and someone updates a varchar(1000) field to go from a value that is using 10 bytes to a value using 250 bytes, SQL Server’s got some work to do. It can’t just fill in the empty space on the page because their is none. It’s going to have to take some rows of data on the page, move it to another page, create a pointer on the original page to the new page, and then perform the data update. This is a lot of extra work and it increases index fragmentation. So if you have a table where you expect many updates, you might intentionally leave some blank space on the page. That way updates that increase used space in a row can simply push the extra data down into the free space without doing a bad page split.
Now, I assumed that when SQL Server inserts new rows onto the end of a table, it will insert those rows with a fill factor equal to the existing fill factor on the index of the table. However, when I submitted that answer to the trivia question, it was marked incorrect. So I decided to set up this test. I create an index with a low fill factor, then insert a bunch of rows on the end. If the overall used space as seen in the sys.dm_db_index_physical_stats view goes up, then we can be confident the new rows are being inserted at close to 100%. If it stays about the same, then we know the new rows are being inserted with the fill factor defined on the index.

First I create a new table for the test:

CREATE TABLE [dbo].[GuggTest_20180411](
   [id] [int] IDENTITY(1,1) NOT NULL,
   [somevarchardata] [varchar](100) NULL,
   [somenumericdata] [decimal](10, 2) NULL,
 CONSTRAINT [PK_GuggTest_20180411] PRIMARY KEY CLUSTERED 
(
   [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 20) ON [PRIMARY]
) ON [PRIMARY]
GO

Next I’m going to fill it in with some data:

DECLARE @i INT = 0;

WHILE @i < 100000
BEGIN
    INSERT INTO dbo.GuggTest_20180411
    (
        somevarchardata
      , somenumericdata
    )
    SELECT CONVERT(VARCHAR(100), NEWID())
         , ROUND(RAND(CHECKSUM(NEWID())) * (100000000), 2);

    SELECT @i = @i + 1;
END;

Now I’ll run the system dmv to see the existing page space usage:

20180411 Index Usage Before Rebuild

The leaf level is where we expect to see the page space used around 20%. To see the leaf level we look for the lowest index level in the DMV. So we can see even now the average space used is close to 100% despite having created the index with a fill factor of 20%.  To get it down to 20 %, I’ll rebuild the index:

ALTER INDEX [PK_GuggTest_20180411] ON [dbo].[GuggTest_20180411] REBUILD PARTITION = ALL WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 20)

Now I’ll rerun the DMV to see the space used, which should be closer to 20%.

20180411 Index Usage After Rebuild.png

So we can see the average space used is now just over 20%.  (As an aside, we can also see the page count has ballooned to 3704 from 770.  This shows the danger of reducing the fill factor.  Selecting all rows from this table just became 481% more expensive!)

I’ll now insert another 10,000 rows into this table.  If the new rows go in with a fill factor of 20%, the average page space used metric should remain around 20%.  If they go in at 100%, I expect the average space used will go up significantly.

DECLARE @i INT = 0;

WHILE @i < 100000
BEGIN
    INSERT INTO dbo.GuggTest_20180411
    (
        somevarchardata
      , somenumericdata
    )
    SELECT CONVERT(VARCHAR(100), NEWID())
         , ROUND(RAND(CHECKSUM(NEWID())) * (100000000), 2);

    SELECT @i = @i + 1;
END;

And now I check the DMV one last time:

20180411 Index Usage After Insert.png

We can see the space used increased to 34%, indicating the rows are not inserted with the specified fill factor as I had assumed.

Never Assume – An Index Defragmentation Story

I can remember muskie fishing a couple years ago with my brother in law. We were fishing down a shoreline and moved to a part of the lake where I’d never even seen a muskie before. I was wondering how much longer I should wait before asking him if he wanted to move on to a new location. Suddenly he encountered a viscious strike, and shortly after landed a fat 38″ muskie. Sometimes you think you know where the fish will be, but you never really know you so you’ve got to check.

Muskie
This week I got an email from one of our vendors. The vendor was reminding me that I had declined some of their maintenance script for their database. I didn’t see the need for their script as I run maintenance over all the databases on the instance at the same time. I update statistics and rebuild/reorganize indexes based on fragmentation levels and page counts. Their maintenance also included a file shrink, which is usually a waste of resources – size it out correctly the first time! Anyway, I responded that the software was performing fine and my maintenance plan was covering everything that theirs would have. Then, just for kicks and giggles, I decided to check out the fragmentaton on the database in question. It is a small database and only 6 tables are over 1000 pages anyway, so fragmentation probably wouldn’t ever cause a noticable performance impact, but I was surprised to see some pretty high numbers. I went to my defragmentation log and could not find a record of any of those tables being defragmented in the recent past. I went in and checked the SQL Agent job and found it was running as expected. Then I found the issue – the stored procedure I use to loop through the tables and defragment takes a @databases parameter that can be filled in if you only want to defragment select databases. Since there were a few I wanted to skip, I put all the others in, but the parameter was defined as varchar(128), which was woefully short. The defragmentation routine was only getting the first several databases that I had specified, and the others were not being touched.
I quickly corrected this by extending the parameter to varchar(4000) and ran it (overnight) to catch up. Now I can see fragmentation is down where it should be. The lesson in this story is to check and double check. As a DBA you can never be too careful, and you don’t ever really know something unless you can prove it by looking at the raw data.

Sargability – make sure your indexes are being used!

My favorite time of year to fish is fall. The fish are bulking up for the winter, so they are fatter than usual. They also will often binge feed, providing some great fishing. However, fall can be rather uncomfortable here in Wisconsin. We get freezing temperatures, blustery days, and snow or rainfail for extended periods in the fall. Just this last October I was fishing cold conditions, with a pretty strong wind and light snow that turned heavy every so often. I remember looking around, seeing no one else on the lake, and thinking to myself “This is living!”. I’ve come to enjoy the sting of the weather because I associate it with the joy of fishing and being in the outdoors.
One thing that can really sting database end users is a slow query. Web pages that take seconds to load instead of miliseconds and BI queries that take minutes instead of seconds can make people awfully grumpy toward the DBA. One way to make sure you can aviod this sting is to ensure your queries are sargable. Sarbable is a made up term that comes from the expression Search ARGument ABLE. Basically, you want to make sure any fields that are referenced in the WHERE, ORDER BY, GROUP BY, and HAVING clauses are able to use an index. The main way to do this is to avoid functions in your predicate. Let’s take a look at a quick example. I’ll create a table with a clustered primary key and a some nonclustered indexes and fill it with some data:

--Create table with primary key clustered
CREATE TABLE dbo.BaitPurchases (ID INT IDENTITY(1,1), Baitname VARCHAR(255) NOT NULL, Purchasedate DATETIME2 NULL, Price MONEY NULL
PRIMARY KEY CLUSTERED 
([ID] ASC)
) ON [PRIMARY]
GO

--Create nonclustered index on PurchaseDate
CREATE NONCLUSTERED INDEX IX_BaitPurchases_PurchaseDate ON [dbo].BaitPurchases
(
   Purchasedate ASC
)INCLUDE (Baitname)
GO

--Create nonclustered index on Baitname
CREATE NONCLUSTERED INDEX IX_BaitPurchases_Baitname ON [dbo].BaitPurchases
(
   Baitname ASC
)INCLUDE (Price)
GO

--Create nonclustered index on Price
CREATE NONCLUSTERED INDEX IX_BaitPurchases_Price ON [dbo].BaitPurchases
(
   Price ASC
)INCLUDE (Baitname)
GO


--Throw in some data
INSERT INTO dbo.BaitPurchases
        ( Baitname, Purchasedate, Price )
VALUES  ( 'Bulldawg','2010-04-29',15.99),
( 'Double Cowgirl','2010-04-30',16.99),
( 'Shallow Raider','2010-05-29',15.49),
( 'Depth Raider','2010-05-30',18.99),
( 'Reef Runner','2011-05-29',14.99),
( 'Mepps Muskie Killer','2011-04-29',12.49),
( 'Suick Thriller','2011-04-30',16.09),
( 'Softtail Phantom','2011-05-30',26.99),
( 'Shallow Invader','2012-04-29',18.29),
( 'Medussa','2012-04-30',17.99),
( 'Grandma','2012-05-29',15.39),
( 'Jake','2012-05-30',15.39),
( 'Believer','2013-04-29',16.59),
( 'Harasser','2013-04-30',13.49),
( 'Doubledawg','2013-05-29',17.99),
( 'Buchertail 700','2013-05-30',13.89),
( 'Hawg Wobbler','2014-04-29',21.99),
( 'Pacemaker','2014-04-30',20.99),
( 'Top Raider','2014-05-29',18.50),
( 'Swim Whizz','2014-05-30',16.98),
( 'Creek Chub Pike Minnow','2015-01-12',15.00)

Now we’ll look for all items that were purchased in 2012, attempting to utilize the nonclustered index that was created on the Purchasedate field:

SELECT Baitname, Purchasedate
FROM dbo.BaitPurchases
WHERE Purchasedate BETWEEN '2013-01-01' AND '2013-12-31 23:59:59.999'

20150112 Sargability Date Seek

You can see from the execution plan that the database engine is doing an index seek on our nonclustered index, just as we were expecting.  However, your junior DBA comes across this code and thinks it’ll be much more readable if he re-writes it with a function to look at the purchase year:

SELECT  Baitname ,
        Purchasedate
FROM    dbo.BaitPurchases
WHERE   YEAR(Purchasedate) = 2013

This does look more readable to me, but if we take a look at the execution plan, we find a very bad negative consequence of this change:

20150112 Sargability Date Scan

We are now doing a nonclustered index scan, meaning the entirety of the index was read. This doesn’t make a big difference in our current tiny table, but with a table with millions of rows you could be seeing the query execution time going from less than a second to minutes.

Let’s take a look at one more common filtering technique that causes index scans instead of seeks. Let’s say you can’t quite remember the name of that lure, but you’re sure it started with an ‘S’. The obvious way to search for this would be:

SELECT  Baitname
FROM    dbo.BaitPurchases
WHERE   Baitname LIKE 'S%'

Looking at the execution plan, we can see we got a nonclustered index seek, so we’re happy with our results.
20150112 Sargability String Seek

However, now we remembered the name didn’t start with an ‘S’, but it definitely had ‘Raider’ in it somewhere. Now we change up our query like this:

SELECT  Baitname
FROM    dbo.BaitPurchases
WHERE   Baitname LIKE '%Raider%'

Now we’re getting a nonclustered index scan:
20150112 Sargability String Scan

Before, the database engine could look at the beginning of each baitname to find the results we were looking for. With the index, the engine jumped right to the section of the index that had the ‘S’s and returned the results from there. Now it has to look in every single bait name and see if it contains the string ‘Raider’. Searches like the second one are usually going to be slow, but SQL Server does provide one tool for getting that data a bit faster, and that is Full-Text indexes. I’ll discuss those in a later blog post.

So take the sting out of database queries by making sure your indexes are being used.

SQL Server Table Structure

Fishing Hotspot maps are a popular tool to help fishermen catch more fish. Particularly helpful when you want to try a new lake, these maps contain the depth and key underwater features of a lake, as well as places where you are likely to hook into a fish. The map will also contain a list of key features of the lake, such as water clarity, max depth, acreage, and bottom type. These maps can sometimes be compiled into entire books, so you’ll always have a map of a nearby lake no matter where you are in the state. The table of contents in these books provide an alphabetical list of lakes, so if I want to find the map for Lake Tomahawk, I just look it up in the table of contents, find it’s on page 96, and open the book to that page. But let’s say I want to choose a small lake to fish – I have half a day to fish and I want to cover an entire lake. The only way to find all lakes under 300 acres would be to page through the entire book and check the acreage on each page to see if it is less than 300.

Wisconsin Fishing Guide
Relational Database Management Systems (RDBMS) use indexes to allow queries to quickly find information like this. There are two basic types of indexes in SQL Server – Clustered and NonClustered. Before we get into those, let’s briefly touch on the lack of an index. If there is no clustered index on a table, the table is referred to as a heap. Although data in a heap is usually stored in the order rows are inserted into the table, the Database Engine is free to move the data at will for the purpose of storing the rows efficiently. This means INSERTs into a heap are usually fast – SQL Server can just throw the data whereever. However, SELECTs, UPDATEs, and DELETEs are going to be considerably slower because the IO system is going to have to read data from many different places on the disk.
There are a few other disadvantages of heaps. One is that the only way to reclaim space is to rebuild the table. If a table is 1 GB in size at its largest, then all the data is deleted out, the size will remain 1 GB until the table is rebuilt. The second is that rebuilding can be a burden, pparticularlywhen nonclustered indexes are involved. We’ll discusses this more later, but nonclustered indexes on a heap contain a pointer between the data that is contained in the index and the remainder of the data for that row, which is stored in the heap. Since rebuilding the table moves the data, the nonclustered index needs to also be rebuilt so the row pointer can be updated to point to the correct location. This is not the case for nonclustered indexes on tables with a clustered index. So the point to take away from this little digression is that there a certain times a heap will be preferable over a clustered index, but in most cases a table will be more performant with a clustered index.
So what is a clustered index? A clustered index is the index that contains all the data in the bottom level of index. For this reason, the clustered index contains all the data in the table, and provides a logical order to it. Each table can only have one clustered index, and once it has this index it is no longer a heap. Data in the table will always be logically stored in order according to the definition of the index. A clustered index is organized in what is referred to as a B-tree structure. This can be shown in a hierarchical diagram with the root node on top, intermediate index pages in the middle, and the leaf node or data pages on bottom. Let’s demonstrate this with an example. I create a table with a clustered index and fill it as follows:

CREATE TABLE dbo.ClusteredIndex
    (
      ID INT IDENTITY(1, 1) ,
      string VARCHAR(108) ,
      digits DECIMAL(38, 3)
    )
CREATE CLUSTERED INDEX IX_ClusteredIndex_ID ON dbo.ClusteredIndex (ID)
SELECT TOP 1000000
        IDENTITY( INT,1,1 ) AS N
INTO    #temp
FROM    MASTER.dbo.syscolumns sc1 ,
        MASTER.dbo.syscolumns sc2
INSERT  INTO dbo.ClusteredIndex
        ( string ,
          digits
        )
        SELECT  REPLICATE(NEWID(), ABS(CHECKSUM(NEWID()) % 3) + 1) ,--Random string of length 36, 72, or 108
                CAST(CHECKSUM(NEWID()) AS FLOAT) / 1000 --Random decimal
        FROM    #temp
DROP TABLE #temp

Next I find all pages being used by the table by using DBCC IND(database, table, index number):

--Get a list of pages used by the table
 DBCC IND('sandbox', 'clusteredindex',1)

20141231 Table Structure Clustered Pages

There are a few things to note from these results for our discussion. First is the PagePID. This is the actual pageID within SQL Server. This is unique to the database and represents on 8KB page. Next is the PageType. This shows what is contained on the page. 10 indicates the IAM page, which I will explain in a later blog post. 2 indicates the index page, rather than a data page, which is represented by a 1. The NextPagePID and PrevPagePID indicate the logical ordering of the table. So Page 303 is the first data page in the table, followed by 15762 and so on.
To demonstrate the B-tree structure of the clustered index, let’s take a look at the top Index level page. I find this page by finding the record with the PageType = 2 and the highest index level. For this table the highest index level is 2, as seen here:

20141231 Table Structure Root Node

To do this we’ll use DBCC Page:

DBCC TRACEON (3604);
DBCC PAGE (sandbox/*database*/,1/*file number*/,16383/*Page Number*/,3/*Print Option 0 - 3*/)

20141231 Table Structure Root Node Index

This undocumented SQL Statement provides a very interesting look at the root node of the index. From here we can see each of the intermediate nodes PageID, which is represented by the ChildPageId field. Let’s dig into one of the intermediate nodes before explaining this in a bit more detail. The first ChildPageId is 15761, so lets display the contents of this page:

DBCC TRACEON (3604);
DBCC PAGE (sandbox/*database*/,1/*file number*/,15761/*Page Number*/,3/*Print Option 0 - 3*/)

20141231 Table Structure Clustered Intermediate Nodes

We now see the contents of the intermediate layer of the B-tree. We can see that again each row has a ChildPageID and at least two key fields. The ChildPageId field is a pointer to the data pages that hold the data for this part of the index. In our example we can see that Page 303 is going to contain all the data in the table from ClusteredIndex.ID = 1 through ClusteredIndex.ID = 78. Page 15762 will contain all data in the table from ClusteredIndex.ID = 80 through ClusteredIndex.ID = 152, and so on. We can see this based on the fields that are marked with (key). In this example, the index was only clustered on the ClusteredIndex.ID field, but if we had multiple fields here they would each get a separate column in this result set. The UNIQUIFIER is added on by SQL in case there are very dense or large rows that cause the clustered key to span multiple pages. The UNIQUIFIER tracks which of these pages come first in the order of the index. So a graphical representation of this clustered index would be:

20141231 Table Structure Clustered B-Tree

Lastly, if we grab the first data page we get the actual records in the table (after some initial header information, excluded from the screenshot below):

20141231 Table Structure Leaf Node

So we can see that a clustered index orders the data in the table, and also contains all data for the table in the leaf node of the index.  If a clustered index orders the table and contains all the data, why do I need a nonclustered index?  Let’s think back to our example of the book of fishing maps.  The book has all the data, and the table of contents show where each map is by lake name.  But we wanted to find all the lakes that were 300 acres or less, remember?  So we create a nonclustered index – that is, a new table of contents to list lake maps by size, starting with the smallest and working our way up.  Now we can see the names of all the small lakes together.  We’ll still have to flip around the book to see the actual maps, but our new “lake size” table of contents let’s us know which pages to go to.

I create an example using identical table structure as before, but this time I’ll add a nonclustered index onto the decimal field:

CREATE TABLE dbo.NonClusteredIndex
    (
      ID INT IDENTITY(1, 1) ,
      string VARCHAR(108) ,
      digits DECIMAL(38, 3)
    )
CREATE CLUSTERED INDEX IX_NonClusteredIndex_ID ON dbo.NonClusteredIndex (ID)
CREATE NONCLUSTERED INDEX IX_NonclusteredIndex_digit ON dbo.NonClusteredIndex (digits)
SELECT TOP 1000000
        IDENTITY( INT,1,1 ) AS N
INTO    #temp
FROM    MASTER.dbo.syscolumns sc1 ,
        MASTER.dbo.syscolumns sc2
INSERT  INTO dbo.NonClusteredIndex
        ( string ,
          digits
        )
        SELECT  REPLICATE(NEWID(), ABS(CHECKSUM(NEWID()) % 3) + 1) ,--Random string of length 36, 72, or 108
                CAST(CHECKSUM(NEWID()) AS FLOAT) / 1000 --Random decimal
        FROM    #temp
DROP TABLE #temp

To show the advantage we have now given ourselves, lets try selecting all the records with a negative number in the digits column, first from the dbo.ClusteredIndex table, and then from the dbo.NonClusteredIndex table.  We’ll look at what’s happening behind the scenes using Profiler.  Here’s the SELECT from the table without the nonclustered index:

20141231 Table Structure Clustered Index Select

I’ve highlighted the important stats, 343 CPU, 13,536 Reads, and it took 900 miliseconds to complete.  Here’s the SELECT from the table with the nonclustered index:

20141231 Table Structure NonClustered Index Select

We can see a very impressive performance improvement by selecting from a nonclustered index.  CPU has been cut by more than 50%, the number of reads has fallen by 88%, and the duration has also fallen.  We can see how the database engine found our data by looking at the execution plan.  Here is the plan for the Select from dbo.ClusteredIndex:

20141231 Table Structure Clustered Index Plan

We can see that SQL had to scan the entire clustered index to find our data – this means the entire table was read.  SQL Server has even suggested that we can improve performance by adding a nonclustered index.  Here is the execution plan for the Select from dbo.Nonclusteredindex:

20141231 Table Structure NonClustered Index Plan

Here the database engine was able to use a index seek on the Nonclustered index.  This means it didn’t even have to traverse the entire nonclustered index – it was able to simply find the section of the index needed and read those records.

The last interesting thing to note relates to our earlier point about how nonclustered indexes reference the remaining data for heaps versus clustered indexes.  If we look at the lowest level of a page from our current nonclustered index (which lives in a table that contains a clustered index), we can see the clustered index is contained on the nonclustered index pages:

20141231 Table Structure Nonclustered page with clustered index

Here, we see the ID column is included, even though the nonclustered index was only built on the digits column.  This changes if we first drop the clustered index, then look at the lowest level of the nonclustered index:

DROP INDEX IX_NonClusteredIndex_ID ON dbo.NonClusteredIndex

Here is the lowest level of the nonclustered index:

20141231 Table Structure Nonclustered page with heap

Now we see a Row ID instead of the clustered index key.  This tells the database engine where to look for the rest of the data relating to this row (in our case, the ID and string values that correspond to each particular digits value).  The two things to note on this:

  1. Again, since this points to a physical location, any time that heap data moves (such as during a table rebuild) the nonclustered index will need to be rebuilt.  This does not happen during a clustered index rebuild because the nonclustered index only has a pointer to the clustered key.
  2. This illustrates the reason for keeping clustered index keys small.  Since each nonclustered index contains the clustered index key, a large clustered index key will also force our nonclustered indexes to be larger than may be necessary.

So, there is my (long) explanation of table structures in SQL Server.  Hopefully this will help you understand why certain best practices exist, and how to improve performance on your tables.