Data Warehouse – Let’s Build it!!! – Part3

Now we are in the process of last step in desinging/building a data warehouse i.e. developing a SCHEMA. There are a number of ways in which data can be represented in Data Warehouse. One way of representing data in Data Warehouse is by means of ‘STAR’ Schema.

STAR SchemaA relational schema whose design represents a multidimensional data model. The star schema consists of one or more fact tables and one or more dimension tables that are related through foreign keys.

It is the Simplest Schema and is more effective for handling simpler queries. It is called so, because it’s design resembles a STAR, with FACT table in middle surrounded by Dimension tables.

So for any kind of Schema(Star or Snowflake) building that is based on ‘DIMENSION’ model, the first thing it requires is to identify FACTS and DIMENSIONs.

So how we gonna determine what a FACT & DIMENSION tables consists of ?

This information can be drawn from the Business people during the Information Gathering. Based on the answers provided by the business people for the following queries, we can determine the contents of FACT & DIMENSION tables –

1. What Busniess is looking for(or Concentrating on) ?
2. What measures/facts they want to track?
3. Based on what CRITERIA they want to track those measures/facts?
4. What level of granularity of the data, the business is looking for?
etc.,

Now let us see how this information can be used to build FACT and DIMENSION tables.

FACT Table is centralized table that is surrounded by one or more Dimension tables. It holds measures/facts and these measures are numeric in nature and additive across dimensions.

Basically a FACT table contains 2 types of columns –

* Facts/measures
* Foreign Keys to Dimension Tables

The Facts/measures in Fact table are grouped together(summarized/aggregated) by grain (level of detail) and stored in the fact table. In normal operations, each day contains several transactions. These details are stored in standard OLTP system. However, DWH might not record this level of details. Instead it could SUMMARIZE or AGGREGATE. This Summarization of data reduces the number of records in FACT table. One of the most important decision in building a star schema is the granularity of the FACT table. The lower the granularity, the more records that will exist in the fact table. This granularity(or frequency) of the data is determined by the lowest level of granularity of each dimension table (i.e.,per day, per week, per month, per year…).

The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. These fields are typically integer values. The fields in FACT table increases as the number of DIMENSIONs increases. Each added dimension can greatly increase the size of the FACT table.

DIMENSION Table contains attributes that describe fact records in the fact table. Dimensions categorize and describe data warehouse facts and measures. A dimension table has a primary key column that uniquely identifies each dimension record (row). The dimension table is associated with a fact table using this key.

It is advised to use Surrogate key as Primary key in Dimension table as against the derived keys from the source tables. A Surrogate key is a non-intelligent, system-generated, numeric values(something like a sequence number) assigned as primary key of a dimension. Using a Surrogate key will simplify the join between a fact table and dimension table. Usually, the join on the single, small numeric value will be faster, when compared to multi-part natural keys that often consist of character values that can add up to a significant amount of space when your fact tables consist of millions or billions of records.

Discussing about Dimensions will not be finished without mentioning ‘SLOWLY CHANGING DIMENSION (SCD). The attributes of a given row might change as time goes by. For example, ‘Marital Status’ of a customer might change from ‘SINGLE’ to ‘MARRIED’ or someother status overtime. In case of normal operational database, it is just get updated by overwriting the old value. But when it comes to Data Warehouse database, it should maintain/track how the information is changed overtime. It follows certain strategies to deal with that kind of situation – This is what we call ‘Slowly Changing Dimension’ and the following are the types –

SCD Type One – In this case – Simply overwrite the old value(s), doing so there will be no history of old values. This is useful when it is not necessary to keep hisotory of those values.
SCD Type Two – In this case – Add a new row containing the new value(s), and distinguish between the rows using Tuple-versioning techniques.
SCD Type Three – This is the mid-way method between Type1 and Type2. Here we don’t add an additional record, but add a new field that contains OLD VALUE.

Before concluding this post, let’s summarize the points for developing a STAR SCHEMA –

* Identify what measures/facts does the business want to track for?
* Identify FACTS & DIMENSION tables
* Identify thier attributes
* Identify the Primary Key (surrogate ?) for Dimension tables.
* Identify the level of Granularity of the FACT table data.
* How to Deal with SCD?

Thus we conclude here and discuss about SNOWFLACKE schema in next post.

This entry was posted in Data Warehouse. Bookmark the permalink.

Leave a comment