Data Warehouse Fact Tables

Being an avid fisherman, I love good fishing art.  I found the picture above using a Google search.  The pictures online are good, but aren’t comparable to seeing them in person.  I recently saw a picture of a muskie called Windfall Muskie, painted by Terry Doughty.  This is the best painting of a muskie I’ve seen so far, a beautiful piece of art.  I love seeing how an artist can capture the thrill of fishing on a canvas.

Previously, I described how to begin setting up a data warehouse by choosing a business process, selecting the grain, and building the dimension tables.  This post will focus on the last step, building the fact tables.  The fact tables contain a foreign key link to each associated dimension table as well as the measures that we will be using in the reports.  The measures are usually numeric values that can be aggregated in the report.  For example, for retail orders I may want to measure product price, discount, tax paid, and other metrics related to price and cost paid in an order.

Fact tables will usually be very long and narrow, as opposed to dimension tables which are generally short and wide.  Fact tables will hold one line for each transaction in your business process, based on your chosen grain.  For retail orders, each line in a transaction table might be an order line.  The table would have an Order ID, Order Line Number, Line Price, Quantity Purchased, and then foreign keys to all the descriptive dimension tables, such as Customer, Product, Purchase Date, Purchase Location, etc.

Once the fact table is built and loaded, you can easily find totals at a level aggregated by any of the descriptors.  For example, you can get the quantity purchased by store, the total order value by customer, or the most popular product by date.  Here is an example of how you can join the fact to dimension tables:

SELECT  dc.CustomerState
      
, dp.LongSkuDivision
      
, dd.MONTH
      
, oc.OrderChannel
      
, SUM(fop.OrderQuantity)
      ,
SUM(fop.SalePrice)
FROM    dbo.factOrderPlaced fop
INNER JOIN dbo.dimCustomer dc ON dc.CustomerKey = fop.CustomerKey
INNER JOIN dbo.dimProduct dp ON dp.ProductKey = fop.ProductKey
INNER JOIN dbo.dimDate dd ON dd.DateKey = fop.OrderDateKey
INNER JOIN dbo.dimOrderChannel oc ON oc.OrdChanKey = fop.OrdChanKey
GROUP BY dc.CustomerState
      
, dp.LongSkuDivision
      
, dd.MONTH
      
, oc.OrderChannel;

I hope this three part series has given you a basic idea of how to go about starting a data warehouse.

Advertisements

Data Warehouse Dimension Tables

There is a joke among fisherman about what boat really means. The common acronyms are Break Out Another Thousand or Bankruptcy On A Trailer. These jokes are funny, but also true – owning a boat is expensive. This winter I decided to take the boat in to the marina to have it tuned up. It was having trouble starting consistently, but aside from that it ran pretty well. Now that I’ve gotten it back, I’m anxious to take it out and see how much better it performs. It feels good to know that it should be running better, but it did set me back a pretty penny.
This is part 2 in my 3 part series on the high level overview of setting up a data warehouse. My previous post dealt with the first two decisions, selecting the business process and choosing the level of granularity. This part will focus on identifying and building the dimension tables. Dimensions provide the who, what, when, where, why, and how context surrounding the business process. For my example business process, retail orders, these will include:

  • Who is making each purchase?
  • What are they buying?
  • Which purchase channel or store are they purchasing it from?
  • When are they making the purchase?

To answer these questions, I’ve identified four distinct dimension tables that will be needed: Customer, Product, Order Channel, and Date.
I’ll go over a few best practices for creating dimension tables, then I’ll provide examples for creating the first three, then I’ll describe the Date table in a bit more detail.
Each table needs a primary key that will be linked to the fact table through the use of a foreign key. The best practice is to use a surrogate key. Although using the system of record’s natural key may seem like a good idea, this can cause major complications down the line, should anything in the source system change. Additionally, this allows for the easy integration of multiple source systems, improves performance, and supports attribute change tracking (I may get into slowly changing attribute dimension tables in a later post). That said, I would include the natural key as an attribute in the dimension table.
Each table should be as denormalized as possible. Using a star schema, with all or most dimension tables joining only to the fact table, will get much better performance than the snowflake schema, where a dimension may have to go through one or more other dimension tables to eventually be related to the fact table. As an example, if you have a product table where each product has a department, category, and color, you would include the descriptions for these right in the dimension table rather than having a separate product department, product category, and product color tables that link to the product dimension table. Although this uses more space in the database, the space used by dimension tables pales in comparison to the fact tables. Additionally, the performance is much better without having to do all the additional joins to get the data that is needed.
Each table needs to have a row for the unknown value. This will allow us to report when we don’t have all the information. For example, let’s say someone returns a product that we have never sold. The employee working in the returns area accepts it back by mistake. Now to get rid of it, we list it as a non-SKU’d product for a low price. We still want to report on the sale in our data warehouse, but we don’t have any real product to tie it to. In that case, we’d tie it to a product called “Unknown”. By doing this we can keep NULLs out of our fact table foreign keys.
The last thing I’ll note on dimension tables is the load process. Because of the relative simplicity of our system, do the ETL all within stored procedures in SQL Server. Many data warehouses will require the use of Integration Services (SSIS). I would recommend sticking with the stored procedures when possible, for simplicity’s sake. However, there will be scenarios where SSIS is required. I have fact tables in my environment that are truncated and reloaded each night, as well as fact tables that are incrementally loaded (only new and changed rows are loaded). Because of this, I am required to preserve the surrogate keys in my dimension tables. With that in mind, my load process is as follows:

  1. Load any newly created attributes (new products, customers, etc.)
  2. Update any aggregate info (customer lifetime order value, product last purchase date, etc.).

I include logging on the load process, but I’ll show that in greater detail on the next post when I discuss fact tables.
So to show examples of tables for our scenario, I might have the following tables:

CREATE TABLE [dbo].[dimCustomer](
   [CustomerKey] [INT] IDENTITY(1,1) NOT NULL,
   [CustomerID] [DECIMAL](9, 0) NOT NULL,
   [CustomerName] [VARCHAR](100) NULL,
   [CustomerAddress] [VARCHAR](100) NULL,
   [CustomerCity] [VARCHAR](75) NULL,
   [CustomerStateAbbrv] [VARCHAR](3) NULL,
   [CustomerCountry] [VARCHAR](30) NULL,
   [CustomerZIP] [VARCHAR](20) NULL,
   [CustomerEmail] [VARCHAR](255) NULL,
   [LifetimeOrderValue] [DECIMAL](18, 2) NULL,
   [FirstOrderChannel] [VARCHAR](25) NULL,
   [FirstOrderHadFreeShipping] [BIT] NULL,
 CONSTRAINT [PK_dimCustomer] PRIMARY KEY CLUSTERED 
(
   [CustomerKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]


CREATE TABLE [dbo].[dimProduct](
   [ProductKey] [int] IDENTITY(1,1) NOT NULL,
   [Style] [varchar](20) NULL,
   [StyleDescription] [varchar](40) NULL,
   [SkuCode] [varchar](14) NULL,
   [SkuDescription] [varchar](40) NULL,
   [ColorCode] [varchar](4) NULL,
   [ColorName] [varchar](40) NULL,
   [LongSkuDivision] [varchar](30) NULL,
   [LongSkuDepartment] [varchar](30) NULL,
   [CountryOfOrigin] [varchar](30) NULL,
 CONSTRAINT [PK_dimProduct] PRIMARY KEY CLUSTERED 
(
   [ProductKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[dimOrderChannel](
   [OrdChanKey] [INT] IDENTITY(1,1) NOT NULL,
   [OrderChannel] [VARCHAR](50) NULL,
 CONSTRAINT [PK_dimOrderChannel] PRIMARY KEY CLUSTERED 
(
   [OrdChanKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

The date table requires a bit more explanation. This is the only dimension table where using a logical primary key, rather than a simple identity integer, makes sense. I prefer creating the primary key as yyyymmdd.
Virtually every model will need a date table, because almost every process that is being reported on occurred at or during a specific time. The date table can be built in advance, can be used across multiple models, and can be used in a single model multiple times. For example, Amazon may want to report on orders both on the purchase date/time and the ship date/time. Both dimensions can use a single date dimension table. It is helpful to put tons of descriptors into your date table. It is not uncommon to see an array of info including day of week, day of year, fiscal day of year, holiday indicator, etc. Below is the code I used to create my date dimesion table.

CREATE TABLE [dbo].[dimDate](
   [DateKey] [INT] NOT NULL,
   [Date] [DATETIME2](0) NULL,
   [Year] [SMALLINT] NULL,
   [Quarter] [TINYINT] NULL,
   [Month] [TINYINT] NULL,
   [MonthName] [VARCHAR](20) NULL,
   [WeekOfMonth] [TINYINT] NULL,
   [WeekOfYear] [TINYINT] NULL,
   [DayOfWeek] [TINYINT] NULL,
   [DayOfWeekName] [VARCHAR](14) NULL,
   [DayOfMonth] [TINYINT] NULL,
   [DayOfYear] [SMALLINT] NULL,
 CONSTRAINT [PK_dimDate] PRIMARY KEY CLUSTERED 
(
   [DateKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

SELECT TOP 10000
        IDENTITY( INT,1,1 ) AS N
INTO    #tally
FROM    MASTER.dbo.syscolumns sc1
        CROSS JOIN MASTER.dbo.syscolumns sc2;

WITH    cte
          AS ( SELECT   DATEADD(DAY, N - 1, '2000-01-01') AS Date
               FROM     #tally
             )
            INSERT INTO dbo.dimDate
    SELECT  YEAR(cte.Date) * 10000 + MONTH(cte.Date) * 100 + DAY(cte.Date) AS DateKey,
           Date ,
            YEAR(cte.Date) AS YEAR ,
           DATEPART(QUARTER,cte.Date) AS Quarter,
            MONTH(cte.Date) AS MONTH ,
            DATENAME(MONTH, cte.Date) AS MonthName ,
           DATEPART(ww, cte.Date) + 1 - DATEPART(ww, CAST(DATEPART(mm, cte.Date) AS VARCHAR) + '/1/' + CAST(DATEPART(yy, cte.Date) AS VARCHAR)) AS WeekOfMonth ,
           DATEPART(wk, cte.Date) AS WeekOfYear ,
           DATEPART(dw,cte.Date) AS DayOfWeek,
            DATENAME(dw, cte.Date) AS DayOfWeekName ,
           DAY(cte.Date) AS DayOfMonth ,
            DATEPART(DAYOFYEAR, cte.Date) AS DayOfYear ,
    FROM    cte
   UNION
   SELECT -1,
   NULL,
   NULL,
   NULL,
   NULL,
   'Unknown',
   NULL,
   NULL,
   NULL,
   'Unknown',
   NULL,
   NULL;

   
DROP TABLE #tally;

So I gave you a high level overview of creating the dimension tables in a data warehouse. Stay tune for part three, where I look at the fact tables.

Data Warehouse Creation

Spring fishing for walleye on the Wolf River can be really hot.  When the walleye are running up-river to spawn in the marshes, they can be extremely thick.  Catching them can be fairly easy.  The one bad thing about this is that almost every angler knows it.  As you can see in the picture above, boats can get stacked right on top of each other.  I was hoping to head up one day to try to get a limit of eaters, but I haven’t been in the mood to fight the crowds lately.

I’ve recently implemented a data warehouse at work.  A data warehouse is a storehouse for information collected from a wide range of sources across an organization.  This storehouse must make it easy for users to access the information they need in a timely manner, it must be consistent, it must be adaptable to change, and most importantly it must be trustworthy.  This is the first time I’ve ever set up a data warehouse.  I’m going to spend the next couple posts explaining the steps I followed in setting it up.

I started by studying the Ralph Kimball method for dimensional modeling.  I used The Data Warehouse Toolkit, 3rd Edition.  I feel it’s very important to spend time researching and planning in advance, because poor design can be very difficult and onerous to fix.

The Kimball method proposes a four step dimensional design process:

  1. Select the business process
  2. Declare the grain
  3. Identify the dimensions
  4. Identify the facts

We chose retail orders as the business process we wanted to report on.  It’s a good idea to choose a fairly simple process to start with.

I’m going to save the dimension and fact creation for later blog posts, but I will discuss the grain here.  The grain is basically the detail level of the fact table.  The Kimball method suggests starting at the atomic grain, or the lowest level at which data is captured by a given business process.  For my purpose, since I began with retail orders, the lowest level is the order line.  Other grains that I could have considered would have been at the order level or even the daily, weekly, or yearly order level.  Every time you go up a level you lose details about the order.  For example, at the order line level I can see a separate product for each line.  But if I look at the order level, I can no longer see the individual products within the order.  If I go up another level and look at all order taken on a day, I lose the different customers that placed orders.

The only advantage of using a higher level is that you will be dealing with less data since it has been aggregated, which will make processing run faster.  To offset this disadvantage at the lower levels, Analysis Cubes can be used.  These cubes pre-aggregate various cross sections of the data so analysis can be performed quickly at the aggregate level while preserving the pre-aggregated details.

Stay tuned for my next post where I will define and describe dimension table creation.

Ordering attributes in Power BI

Earlier this summer I got to fish for Salmon on Lake Michigan for the first time.  A little while ago I got to do another first: deep ocean fishing off the coast of California.  My wife and I took a vacation out near Los Angeles since one of my college buddies was getting married.  We stayed an extra couple days after the wedding and spent one of them on a fishing charter out of Santa Barbara.

It was a cool experience, definitely not the type of fishing I’m used to.  The boat had about 30 people standing around the outside with fishing rods ready.  The captain stopped us over a pod of fish and everyone dropped their lines at the same time.  Since we fished between 125 and 300 foot depths, it took 2 to 3 minutes for the bait (which was sliced squid and chopped up anchovies) to reach the fish.  After catching a fish it took another 2 – 3 minutes to reel it up.  Since most people had monofilament and we had so much line out, it was pretty difficult to feel the bites.  If you weren’t sure, you basically wasted the 5 – 6 minutes dropping down and reeling back up.

My wife and I finished with 21 rockfish (the limit) and one ocean whitefish.  The whitefish tasted incredibly good.  It was a fun trip and definitely worth trying at least once.  I also want to commend the fishing crew that took us out.  They were patient and efficient, and did a great job making sure everyone had fun and caught fish.

rockfish

All the fish my wife and I caught. The ocean whitefish is in the upper right, partially obscured by a rockfish’s tail.

I’ve been delving heavily into Business Intelligence at work the last few months.  I had no previous Business Intelligence training or experience, and am now being asked to create a data warehouse and build various reports off it.  It is quite a daunting task, and I hope to share what I’ve learned in the next several blog posts.  I have chosen to follow strategy proposed by Ralph Kimball for creating the data warehouse, and I’ve found it is serving me well.  This blog post will however focus on a single problem and the solution that I found.

Power BI is Microsoft’s venture into the self service business intelligence world.  The two main players in this market are Qlikview and Tableau, but Power BI is a rapidly growing solution with a low price point, so we have decided to pursue that for now.  I attended a SQL Saturday pre-con dedicated to Power BI back in April and I came away very impressed with the tool.  However, there are still a lot of holes that need to be filled.  One such hole is the ability to sort columns.  In the method where you load all your data into the model, you can choose to sort on another column in the table.  This would work well, for example, when you want to sort by month number but show the month name on the visual’s axis.  However, there is currently no way to do this in the method where your data is stored locally or in Azure and the model queries that data each time it loads the data.  A simple way to get around this is to prepend the month number to the month name, showing a visual such as this:

20160921-sort-by-month-with-number

This looks ok, though not great.  However, I came upon another version of this problem that wouldn’t look quite so good.  I need to display a bunch of apparel sizes.  These sizes would have names like Small, Medium, and Large, but would also include shoe sizes, which are simply numbers like 10, 10.5, 11, etc.

Having numbers to order the size name would result in name such as 14 10, 15 10.5, and 16 11, which would be all kinds of confusing.  After thinking about it for a little while, I came up with a solution that I felt was pretty clever.  I decided to put a number of zero length strings in front of each name. The more zero length-ed strings there are, that later the name would show in the order.

In order to do this, I had to convert the name column from varchar to nvarchar, since the zero length string is a unicode character.  Then I adjusted the names by using a size order table which had the sort order stored as an integer column:

20160921-size-sort-table

I used the ROW_NUMBER window function and the REPLICATE function to put the correct number of blanks in front of each size name and code.

--Add blank spaces to the sizes for sorting purposes
WITH    cte
          AS ( SELECT   Size ,
                        ROW_NUMBER() OVER ( ORDER BY [Order] ) rn
               FROM     dbo.SizeOrder
             )
    UPDATE  dprod
    SET     dprod.SizeCode = REPLICATE(NCHAR(8203), cte.rn) + cte.Size ,
            dprod.SizeName = REPLICATE(NCHAR(8203), cte.rn) + sizedesc.[SEW Desc]
    FROM    cte
            INNER JOIN dbo.dimProduct dprod ON cte.Size = dprod.SizeCode
            INNER JOIN dbo.[SKU Element 2] sizedesc ON cte.Size = sizedesc.[SEW Code];

For the final result, here is my visual with sizes that are sorted smallest to largest rather than alphabetically, as would normally be the case:

20160921 Price total by size.PNG

I hope this helps if you ever need to sort on a text column but not alphabetically.