There is a joke among fisherman about what boat really means. The common acronyms are Break Out Another Thousand or Bankruptcy On A Trailer. These jokes are funny, but also true – owning a boat is expensive. This winter I decided to take the boat in to the marina to have it tuned up. It was having trouble starting consistently, but aside from that it ran pretty well. Now that I’ve gotten it back, I’m anxious to take it out and see how much better it performs. It feels good to know that it should be running better, but it did set me back a pretty penny.
This is part 2 in my 3 part series on the high level overview of setting up a data warehouse. My previous post dealt with the first two decisions, selecting the business process and choosing the level of granularity. This part will focus on identifying and building the dimension tables. Dimensions provide the who, what, when, where, why, and how context surrounding the business process. For my example business process, retail orders, these will include:
- Who is making each purchase?
- What are they buying?
- Which purchase channel or store are they purchasing it from?
- When are they making the purchase?
To answer these questions, I’ve identified four distinct dimension tables that will be needed: Customer, Product, Order Channel, and Date.
I’ll go over a few best practices for creating dimension tables, then I’ll provide examples for creating the first three, then I’ll describe the Date table in a bit more detail.
Each table needs a primary key that will be linked to the fact table through the use of a foreign key. The best practice is to use a surrogate key. Although using the system of record’s natural key may seem like a good idea, this can cause major complications down the line, should anything in the source system change. Additionally, this allows for the easy integration of multiple source systems, improves performance, and supports attribute change tracking (I may get into slowly changing attribute dimension tables in a later post). That said, I would include the natural key as an attribute in the dimension table.
Each table should be as denormalized as possible. Using a star schema, with all or most dimension tables joining only to the fact table, will get much better performance than the snowflake schema, where a dimension may have to go through one or more other dimension tables to eventually be related to the fact table. As an example, if you have a product table where each product has a department, category, and color, you would include the descriptions for these right in the dimension table rather than having a separate product department, product category, and product color tables that link to the product dimension table. Although this uses more space in the database, the space used by dimension tables pales in comparison to the fact tables. Additionally, the performance is much better without having to do all the additional joins to get the data that is needed.
Each table needs to have a row for the unknown value. This will allow us to report when we don’t have all the information. For example, let’s say someone returns a product that we have never sold. The employee working in the returns area accepts it back by mistake. Now to get rid of it, we list it as a non-SKU’d product for a low price. We still want to report on the sale in our data warehouse, but we don’t have any real product to tie it to. In that case, we’d tie it to a product called “Unknown”. By doing this we can keep NULLs out of our fact table foreign keys.
The last thing I’ll note on dimension tables is the load process. Because of the relative simplicity of our system, do the ETL all within stored procedures in SQL Server. Many data warehouses will require the use of Integration Services (SSIS). I would recommend sticking with the stored procedures when possible, for simplicity’s sake. However, there will be scenarios where SSIS is required. I have fact tables in my environment that are truncated and reloaded each night, as well as fact tables that are incrementally loaded (only new and changed rows are loaded). Because of this, I am required to preserve the surrogate keys in my dimension tables. With that in mind, my load process is as follows:
- Load any newly created attributes (new products, customers, etc.)
- Update any aggregate info (customer lifetime order value, product last purchase date, etc.).
I include logging on the load process, but I’ll show that in greater detail on the next post when I discuss fact tables.
So to show examples of tables for our scenario, I might have the following tables:
CREATE TABLE [dbo].[dimCustomer]( [CustomerKey] [INT] IDENTITY(1,1) NOT NULL, [CustomerID] [DECIMAL](9, 0) NOT NULL, [CustomerName] [VARCHAR](100) NULL, [CustomerAddress] [VARCHAR](100) NULL, [CustomerCity] [VARCHAR](75) NULL, [CustomerStateAbbrv] [VARCHAR](3) NULL, [CustomerCountry] [VARCHAR](30) NULL, [CustomerZIP] [VARCHAR](20) NULL, [CustomerEmail] [VARCHAR](255) NULL, [LifetimeOrderValue] [DECIMAL](18, 2) NULL, [FirstOrderChannel] [VARCHAR](25) NULL, [FirstOrderHadFreeShipping] [BIT] NULL, CONSTRAINT [PK_dimCustomer] PRIMARY KEY CLUSTERED ( [CustomerKey] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] CREATE TABLE [dbo].[dimProduct]( [ProductKey] [int] IDENTITY(1,1) NOT NULL, [Style] [varchar](20) NULL, [StyleDescription] [varchar](40) NULL, [SkuCode] [varchar](14) NULL, [SkuDescription] [varchar](40) NULL, [ColorCode] [varchar](4) NULL, [ColorName] [varchar](40) NULL, [LongSkuDivision] [varchar](30) NULL, [LongSkuDepartment] [varchar](30) NULL, [CountryOfOrigin] [varchar](30) NULL, CONSTRAINT [PK_dimProduct] PRIMARY KEY CLUSTERED ( [ProductKey] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] CREATE TABLE [dbo].[dimOrderChannel]( [OrdChanKey] [INT] IDENTITY(1,1) NOT NULL, [OrderChannel] [VARCHAR](50) NULL, CONSTRAINT [PK_dimOrderChannel] PRIMARY KEY CLUSTERED ( [OrdChanKey] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
The date table requires a bit more explanation. This is the only dimension table where using a logical primary key, rather than a simple identity integer, makes sense. I prefer creating the primary key as yyyymmdd.
Virtually every model will need a date table, because almost every process that is being reported on occurred at or during a specific time. The date table can be built in advance, can be used across multiple models, and can be used in a single model multiple times. For example, Amazon may want to report on orders both on the purchase date/time and the ship date/time. Both dimensions can use a single date dimension table. It is helpful to put tons of descriptors into your date table. It is not uncommon to see an array of info including day of week, day of year, fiscal day of year, holiday indicator, etc. Below is the code I used to create my date dimesion table.
CREATE TABLE [dbo].[dimDate]( [DateKey] [INT] NOT NULL, [Date] [DATETIME2](0) NULL, [Year] [SMALLINT] NULL, [Quarter] [TINYINT] NULL, [Month] [TINYINT] NULL, [MonthName] [VARCHAR](20) NULL, [WeekOfMonth] [TINYINT] NULL, [WeekOfYear] [TINYINT] NULL, [DayOfWeek] [TINYINT] NULL, [DayOfWeekName] [VARCHAR](14) NULL, [DayOfMonth] [TINYINT] NULL, [DayOfYear] [SMALLINT] NULL, CONSTRAINT [PK_dimDate] PRIMARY KEY CLUSTERED ( [DateKey] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] SELECT TOP 10000 IDENTITY( INT,1,1 ) AS N INTO #tally FROM MASTER.dbo.syscolumns sc1 CROSS JOIN MASTER.dbo.syscolumns sc2; WITH cte AS ( SELECT DATEADD(DAY, N - 1, '2000-01-01') AS Date FROM #tally ) INSERT INTO dbo.dimDate SELECT YEAR(cte.Date) * 10000 + MONTH(cte.Date) * 100 + DAY(cte.Date) AS DateKey, Date , YEAR(cte.Date) AS YEAR , DATEPART(QUARTER,cte.Date) AS Quarter, MONTH(cte.Date) AS MONTH , DATENAME(MONTH, cte.Date) AS MonthName , DATEPART(ww, cte.Date) + 1 - DATEPART(ww, CAST(DATEPART(mm, cte.Date) AS VARCHAR) + '/1/' + CAST(DATEPART(yy, cte.Date) AS VARCHAR)) AS WeekOfMonth , DATEPART(wk, cte.Date) AS WeekOfYear , DATEPART(dw,cte.Date) AS DayOfWeek, DATENAME(dw, cte.Date) AS DayOfWeekName , DAY(cte.Date) AS DayOfMonth , DATEPART(DAYOFYEAR, cte.Date) AS DayOfYear , FROM cte UNION SELECT -1, NULL, NULL, NULL, NULL, 'Unknown', NULL, NULL, NULL, 'Unknown', NULL, NULL; DROP TABLE #tally;
So I gave you a high level overview of creating the dimension tables in a data warehouse. Stay tune for part three, where I look at the fact tables.