Spring fishing for walleye on the Wolf River can be really hot. When the walleye are running up-river to spawn in the marshes, they can be extremely thick. Catching them can be fairly easy. The one bad thing about this is that almost every angler knows it. As you can see in the picture above, boats can get stacked right on top of each other. I was hoping to head up one day to try to get a limit of eaters, but I haven’t been in the mood to fight the crowds lately.
I’ve recently implemented a data warehouse at work. A data warehouse is a storehouse for information collected from a wide range of sources across an organization. This storehouse must make it easy for users to access the information they need in a timely manner, it must be consistent, it must be adaptable to change, and most importantly it must be trustworthy. This is the first time I’ve ever set up a data warehouse. I’m going to spend the next couple posts explaining the steps I followed in setting it up.
I started by studying the Ralph Kimball method for dimensional modeling. I used The Data Warehouse Toolkit, 3rd Edition. I feel it’s very important to spend time researching and planning in advance, because poor design can be very difficult and onerous to fix.
The Kimball method proposes a four step dimensional design process:
- Select the business process
- Declare the grain
- Identify the dimensions
- Identify the facts
We chose retail orders as the business process we wanted to report on. It’s a good idea to choose a fairly simple process to start with.
I’m going to save the dimension and fact creation for later blog posts, but I will discuss the grain here. The grain is basically the detail level of the fact table. The Kimball method suggests starting at the atomic grain, or the lowest level at which data is captured by a given business process. For my purpose, since I began with retail orders, the lowest level is the order line. Other grains that I could have considered would have been at the order level or even the daily, weekly, or yearly order level. Every time you go up a level you lose details about the order. For example, at the order line level I can see a separate product for each line. But if I look at the order level, I can no longer see the individual products within the order. If I go up another level and look at all order taken on a day, I lose the different customers that placed orders.
The only advantage of using a higher level is that you will be dealing with less data since it has been aggregated, which will make processing run faster. To offset this disadvantage at the lower levels, Analysis Cubes can be used. These cubes pre-aggregate various cross sections of the data so analysis can be performed quickly at the aggregate level while preserving the pre-aggregated details.
Stay tuned for my next post where I will define and describe dimension table creation.