Evaluating the use of Sparse Columns

A few summers ago I was fishing a very clear lake up in Oneida county, WI for muskie. Oneida county and the county just to the north of it, Vilas county, offer some of the best muskie fishing in the Midwest. It was a hot, calm day with not a cloud in the sky. These types of conditions are usually not very good for muskie – they are ambush predators and prefer low light conditions or stained water so they can close in on their prey without being noticed until the last second. I was fishing a deep weedline with a deep-running crankbait. As I was reeling in I felt the dull tug of weed getting caught in the hooks of the lure. I gave a sharp tug with the rod to try and free the lure, but the weeds would not let go. As my lure limped back to the boat, I was surprised to see a good sized muskie following the lure even though it was surrounded by the weeds it was dragging. One fun thing about muskie is their willingness to attack lures right next to the boat, and for this reason the seasoned muskie angler will swim the bait in a circle a few times at the end of each retrieve (commonly called a figure-eight). Although I was sure the fish wouldn’t hit the fouled lure, I still went into my standard figure-eight. I was surprised and completely unprepared as the muskie shot forward and grabbed the lure, still full of weeds, about a quarter way through my first circle. I had a split second decision on which way to try the hookset, and I made the wrong one. I tried setting the hood back against the direction the fish was moving, but I ended up striking the side of the boat with my rod, and the fish dropped the lure and swam off. I had made a wrong decision, and had to live with losing that fish. Fortunately when deciding whether or not to use Sparse columns in SQL Server, we don’t have a split second decision – we can experiment to see whether they are a good fit for our particular tables.
Previously I had posted on the amount of space used by fixed length fields, especially when storing the NULL value. One way to get around this “limitation” of these data types is to use Sparse columns. Microsoft’s Developer Network defines Sparse columns as ordinary columns that have an optimized storage for NULL values. Sparse columns reduce the space needed for NULL values, but the tradeoff is the space used and cost to retrieve the non-NULL values. Microsoft suggests using sparse columns when the space saved is at least 20 to 40 percent. There are also a couple of important restrictions on Sparse columns:

  • Obviously, the column has to be nullable (no identity or primary key)
  • These data types are not allowed to use sparse columns: text, ntext, image, timestamp, user-defined, geometry, or geography
  • The columns can’t have a default value or be a computed column
  • Sparse columns are incompatible with compression. Sparse columns can’t be added to compressed tables and tables with existing sparse columns can’t be compressed

That last point is pretty important because you can often get significant space savings with compression. The main downside of compression is the extra CPU usage required when working with these tables, but that’s another discussion altogether.
When SQL Server is changing a column from sparse to nonsparse or vice versa, that table is unavailable (locked). This is because SQL Server follows this procedure to make the change:

  1. Adds a new column to the table with the new storage size and format.
  2. Copies the data from the old row to the new
  3. Removes the old column
  4. Rebuilds the table or clustered index to reclaim the space used by the old column.

It’s important to remember there is a row size limit in SQL Server that is 8060 bytes for tables without Sparse columns and 8018 bytes for tables with Sparse columns. If changing a column from nonsparse to Sparse violates these limits, the change will fail.
From a storage perspective, Sparse columns work by not storing any data for columns that have been marked sparse and are NULL. However, in order to store values in Sparse columns that aren’t NULL, SQL server has to not only store the value, but also what column is holding that value. This causes the size required for each non-NULL value to increase by 4 to 8 bytes (for example, non-NULL integers increase from 4 to 8 bytes).

Now that I’ve hopefully provided a little background on what Sparse columns are and how they work, let’s look at an example. I have created a table with 1 million records in an integer field.


CREATE TABLE dbo.SparseTest
    (
      ID INT IDENTITY(1, 1) ,
      IntegerField INT ,
      CONSTRAINT PK_SparseTest_ID PRIMARY KEY CLUSTERED ( ID ASC )
        WITH ( PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
               IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
               ALLOW_PAGE_LOCKS = ON ) ON [PRIMARY]
    )
ON  [PRIMARY]
DECLARE @int INT = 1
WHILE @int <= 1000000
    BEGIN
        INSERT  INTO dbo.SparseTest
                ( IntegerField )
        VALUES  ( CHECKSUM(NEWID()) )
        SELECT  @int = @int + 1
    END

This table is full of values – none are NULL:

20141223 No Sparse No Null

The initial size of this table is about 16 MB:

20141223 No Sparse Size

Next I update the table definition to make IntegerField a Sparse column:

ALTER TABLE dbo.SparseTest ALTER COLUMN IntegerField INT SPARSE
ALTER TABLE dbo.SparseTest REBUILD

I found I did have to rebuild the table before I was able to see the change in used space.  Now our table with the Sparse column with no NULL values has ballooned to 28 MB!

20141223 Sparse Size

I change 25% of the values in the Sparse column to NULL with this statement:

UPDATE  dbo.SparseTest
SET     IntegerField = NULL
WHERE   ID % 4 = 0

20141223 Sparse Size 25 Percent NULL

The size has shrunk to approximately 4 MB, to 24 MB.  Next I change another 25% of the values to NULL.  Now half of the values in the table are NULL.

UPDATE  dbo.SparseTest
SET     IntegerField = NULL
WHERE   ID % 4 = 1

The size has shrunk another 4 MB; we are now down to about 20 MB for the entire table.

20141223 Sparse Size 50 Percent NULL

I update another 25% to NULL so now 75% of the values in the Sparse column are NULL.

UPDATE  dbo.SparseTest
SET     IntegerField = NULL
WHERE   ID % 4 = 2

20141223 Sparse Size 75 Percent NULL

We now see the table size is down to 16 MB, the same size as when the column was nonSparse.  One last update puts the table down to 12.5 MB.

UPDATE  dbo.SparseTest
SET     IntegerField = NULL

20141223 Sparse Size 100 Percent NULL

So this demonstrates the size differences of the entire table based on the amount of NULLs in a Sparse column.  The real lesson here is to experiment with Sparse columns on your tables if you have many NULL values, but understand that you need to check to make sure making columns Sparse does indeed reduce the amount of space used.  And don’t forget, this isn’t a one time decision.  The percentage of total values in a table that are NULL can change often; don’t be afraid to track the NULL percentage over time to make sure you are keeping your table optimized.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s