In previous posts, we come across –
What a Data Warehouse is? What a Data Warehouse is not? Why it is necessary? What needs to be considered while loading data into Data Warehouse
Now, from this post onwards we move forward in designing/building a Data Warehouse.
The initial points that we need to consider while building a Data Warehouse are –
*Keep it simple *Fast data access *Inputs to Data Warehouse(from where Data Warehouse got it's data)
But, how can you achieve this? – Does the normal Relational (two-dimensional) model is suitable for this kind of environment? Answer is NO. Of course this works well with Transactional systems, where Normalized tables are used to reduce the redundancy and require lots of table joins so as to retrieve the necessary data. Analytical queries against the Normalized tables will be very slow.
So, a Normal two-dimensional (rows, columns) way of storing data doesn’t work for a Data Warehouse. This means, data should be stored in Multi-dimensional form.
Eventough Dimensional model is different from its counterpart i.e. Relational model; it too consists of tables, rows, columns etc., and nothing more than that. The only difference is, it stores the data in different kinds of tables like Fact table, Dimension table etc., and reference each other just like in Relational model. This is the Physical form(model), that users use to access data warehouse data. But for building/modelling a data warehouse we should not just limit to final organization of the data. But as mentioned earlier we also need to consider from where data warehouse gets its data from. In normal operational database, data is directly fed into database by means of User Inputs. Whereas for Data Warehouse it is not the case, data is fed from various sources like Relational databases, non-relational databases, Flat files etc., Make note that we cannot directly load from source database to data warehouse database, as their designs are fundamentally different. So, a mapping is necessary between Source databases and data warehouse database.
What you infer from the above is that – A complete Data Warehouse Process is divided into 3 stages –
In the Extraction phase, it is required to identify from where and from what sources Data Warehouse gets its data. Once identified the sources, it is necessary to analyze suitability of the data with the target(i.e Data Warehouse). If the Source and the target are in the same format(design), then extraction will be straight-forward (a simple Export will do). Otherwise, additional analysis like Data Profiling (process of checking the suitability of the data, whether it meets expected pattern or not.)is used before being loaded into the data warehouse. In the Transformation Phase, necessary modifications needs to be performed on the extracted Source Data, so as to meet the business and technical needs of the target database. This phase also used as STAGING AREA for cleansing and transform of the extracted data. In the Loading Phase, once the data that is extracted and transformed into required format, it is directly loaded into the Data Warehouse.
During the entire Data Warehouse process, we will be taking numerous decisions through out all the phases, that will impact the end result. For example, in the first phase, we decide on which business elements are required from the source databases and how they related with each other and the model(ER model) they used will also impact the final organization of data in data warehouse.
So Keep in mind that while designing /building a data warehouse, we should take entire Data Warehouse process into consideration, but not just limit to the Data Warehouse itself.
Thus conclude this here and move forward in the next post with the physical model of data warehouse.