The Baker’s Dozen: 13 Tips for Basics of Data Warehousing and Dimensional Modeling
Imagine you’re working for a company that has accumulated a tremendous amount of transaction data. The business users want to perform all sorts of analysis, monitoring and analytics on the data. Some OLTP developers might reply with, “Just create views or stored procedures to query all the data the way the users want.” Many companies initially take that approach - however, just like certain technologies and system hardware configurations don’t scale well, certain methodologies don’t scale well either. Fortunately, this is where data warehousing and dimensional modeling can help. In this article, I’ll provide some basic information for developers on the basics of data warehousing and dimensional modeling - information that might help you if you want to provide even more value for your company.
This area of interest has many good books including the “desert island classics” like the famous Design Patterns: Elements of Reusable Object-Oriented Software by the Gang of Four, and the Refactoring: Improving the Design of Existing Code book by Martin Fowler. If you want to learn about data warehousing and dimensional modeling, then THE book to read is The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, by Ralph Kimball. First, this is NOT a book on technology - it is a book about methodologies and repeatable patterns for assembling data based on business entities. Kimball wrote the book in 2004, and the book is just as relevant today as it was almost a decade ago.
This article will introduce many of the concepts that Kimball covers. I’m going to keep the examples very basic and fundamental. I strongly recommend you check out the Kimball website: http://www.kimballgroup.com. You’ll find many categories on the website, including an excellent section of design tips: http://www.kimballgroup.com/category/design-tips/
(Note that I’m not formally associated with Kimball or the Kimball Group in any way. I’ve implemented his methodologies in work projects and I - along with countless other developers - have benefited from his efforts).
What’s on the Menu?
Here are the topics for this article:
- Major components of a data warehouse
- Cumulative transactional fact tables
- Factless fact tables
- Periodic snapshot fact tables
- General contents of dimension tables
- Snowflake dimension schemas
- Role-playing dimensions
- Junk dimensions
- Many-to-many dimension relationships
- Type 2 Slowly Changing Dimensions (SCD)
- A word about storing NULL foreign keys in fact tables - DON’T!
- Conformed (common) dimensions
- What (not) to store in fact tables
Tip 1: Major Components of a Data Warehouse
While you can walk into ten different companies and see up to ten different database topologies, a very common challenge that project managers face is integrating data from multiple data sources. Whether a database team needs to integrate shipment, spending, budget, and retail data - or needs to integrate patient demographic, diagnosis, claims, and billing data - there will always be the need to cleverly combine data from different sources to produce one clean version of “the historical truth.”
The need arises (in part) from the fact that nearly every organization has a value chain - a progression (chain) of related activities associated with a company’s business processes. In most of the activities, data is collected, though the collection process might span many disparate systems, with data collected at different levels of granularity, time periods, etc. A goal of data warehousing is to bring all of the data “under one umbrella,” after a series of validations, grouping, splitting and any other forms of synthesis.
Figure 1 shows a diagram that many have probably seen at some point in their careers: a high-level process flow that begins with multiple data sources (often from transactional systems) and ends with a data warehouse (comprised of a set of data marts that might be used to create OLAP cubes for more advanced analytic purposes).
Figure 1: Basic diagram of a data warehouse environment.
Note that developers will rarely transfer data directly from the original OLTP sources to the data warehouse - usually developers will load data into a temporary staging area (on an overnight basis, weekly, etc.) and perform validations or possibly “re-shape” the data before eventually pumping the data into the data warehouse/data marts.
One of the signature characteristics of most data warehouses is that the data is structured into fact and dimension tables. Fact tables contain the business measures (metrics), which can be anything from revenue to costs to premium payment amounts. Dimension tables provide business context to the fact tables, which can be anything from products to customers to cost centers, etc. Ultimately, business users evaluate the measures “by” the different related business dimensions.
These fact and dimension tables are usually organized in a de-normalized (star-schema) form. This is often culture shock to long-term OLTP developers who are used to databases in third-normal form. The thing to remember is that normalization is necessary to save data as efficiently as possible. In a data warehouse, the goal is to retrieve data as efficiently as possible.
Figure 2 shows a simple but meaningful star-schema, fact-dimension model using some of the AdventureWorks data. The fact and dimension tables are usually related through surrogate integer keys. The dimension tables contain primary keys for each business entity value, along with a business key and all of the attributes that further describe the entity value (for instance, a specific product has a SKU number, a color, a list price, and belongs to a particular subcategory, category, brand, etc.). The fact tables contain foreign key references to the dimension values, along with the measures. End users often will “roll up” the measures by any related dimension attribute.
Figure 2: A basic data model using the Microsoft AdventureWorks demo database.
By: Kevin S Goff
Kevin S. Goff, a Microsoft SQL Server MVP, has been writing “The Baker’s Dozen” productivity series in CoDe Magazine since 2004. Kevin hosts a weekly webcast on SQL Server and Business Intelligence topics, at http://www.BDBIRadio.com.
Kevin has been an applications developer, architect, and technology mentor since 1987. Kevin has worked in many lines of business, including insurance, manufacturing, health care, consumer packaged goods, accounting and finance, advertising, and many others. Along the way, Kevin has won several awards in both the public and private sector for applications development. Additionally, Kevin has taught SQL Server, Business Intelligence, and Data Warehousing classes to hundreds of students over a period of five years and has built numerous custom courseware modules along the way.
Kevin has authored one book on reporting development and contributed chapters on MDX programming to a second book. He is currently developing a set of commercial video training courses for SQL Server and BI topics. He has been a SQL Server MVP since 2010 and was previously a .NET/C# MVP from 2005 to 2009.
Kevin is a frequent speaker at User Group and other community events (SQL Saturday, SharePoint Saturday, Code Camp) in the Mid-Atlantic region, and speaks occasionally at conferences. In 2012 Kevin worked with Microsoft TechNet Radio on a 13 week webcast series on new features in SQL Server 2012.
For more information, check out Kevin’s main site at http://www.KevinSGoff.net, or email him at firstname.lastname@example.org