MS BI Blog: SQL:Data warehousing :An Overview

DATA-WAREHOUSING

What is Data-warehousing?

A Data warehouse is the collection of the historic data in a multidimensional space for a business or an application or an organization which aids the Business Analyst of an organization in Decision making. With the increase in competition among the companies in any form of business, a platform for data analysis has become very essential in key decision making. A Data warehouse holds the historic information separately from the system; it provides the key business information which are extracted and cleansed from the system database using several methods of ETL(Extract, Transact and Load).In the last couple of decades the data-warehousing has evolved significantly due to millions of dollars invested by several companies around the globe.

As defined by William H. Inmon “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process”

The data available in the data warehouse is classified according to the subjects compared to the transaction data available in the system. The data-warehouse normally tend to have n number of sources like relational database, excel or text files or sometimes even another data warehouse. The data from these multiple sources are integrated in the data-warehouse after performing several data cleansing, filtering, processing techniques in the ETL stage. The data is normally presented according to the time scale which is the most important dimension. The data warehouse is separated from the actual database where the daily transactions take place. The main motivation for doing so is to have better performance in both which serves different purpose. The data warehouse is initially loaded with the data on its creation and from then onwards it is periodically updated with the latest data from its source database. The data warehouse has no relation with the transactions and its purely for the data analysis. More often the Data warehouse may not contain the most up to date information.

The organization’s normal day to day information like user registration, bank transactions, and products sold, products bought etc are stored in the database which is referred as on-line transaction processing (OLTP) systems. This is the primary source for the data-warehouse which is also called as the on-line analytical processing (OLAP) systems. There are several distinguishing features between them. The OLTP is used by several thousand compared to OLAP which is used only the executives for decision making. The OLTP stores the current information where transaction takes place where as the OLAP stores the historic information. The OLTP is designed using ER modeling where as the OLAP follows either the snowflake or star schema. The size of the OLAP database is much bigger compared to the OLAP.

OLTP VS OLAP

	OLTP	OLAP
Purpose	Data Transaction	Data Analysis
Type of Access	Read/Write	Read(mostly)
Design Type	Entity Relationship model	Star/Snowflake schema
No of Users	1000s	100s
User Type	Software Professional, Admin	Analysts and Executives
Data Type	Flat data	Multidimensional data
Functionality	Day to day transaction data	Historic data
Database size	Less than 1 GB	More than 100 GB
Importance to	Quick transaction of data	Response time for data retrieval
Data is	Non summarized	summarized

CUBE: A MULTIDIMENSIONAL DATA MODEL

The data-warehouse stores the data in a multidimensional model, in the form of Cubes. The data in the cubes are defined my dimensions and facts.

Dimensions are the entities which define the data. The user would like to analyze the data according to the dimensions. Example time

Facts are actual numerical values or records that are mapped to the dimensions. They are also referred as measures. Example revenue

Let us consider a simple example of an Automobile Company which produces three different types of cars SUV, mid range cars and luxury cars in the regions Asia, North America and Europe. In this example time, car type, region are dimensions and revenue, no of cars sold are the measures (facts).Each dimension has a dimension table containing the properties of dimension like dimCarType will have the fields cartypeid, cartype, description, enginepower etc. The fact table holds the actual data including the measures and derived measures which are mapped to the different dimensions.

The following table shows the 2 dimensional view of the revenue data of all the car types of the automobile manufacturer for the region Asia and for the months April to June.

Region = “Asia”
Time	Car Type
Time	SUV	Mid Range	Luxury
April	34.1	56	12
May	23	46	10.5
June	42	45	9
July	40	43	8.2

*Revenue in millions of U.S Dollars

Now let us consider a 3 dimensional view of the revenue data of all the car types of the automobile manufacturer for all the regions and for the months April to June.

	Region = “Asia”			Region= “Europe”			Region= “North America”
Time	SUV	Mid Range	Luxury	SUV	Mid Range	Luxury	SUV	Mid Range	Luxury
April	34.1	56	12	30.1	40	9	30.1	57	10
May	23	46	10.5	20	44	8	32	54	9.5
June	42	45	9	36	41	9.1	31	41	5
July	40	43	8.2	39	40	8.9	33	40.4	6.2

*Revenue in millions of U.S Dollars

It is normal tendency to think cube as a 3 dimensional structure, the cube referred in data warehousing is n-dimensional. We can also have n number of dimensions, a 4 dimensional cube is a series of 3D cubes, and a 5 dimensional cube is a series of 4D cubes and so on.

n Dimensional Data has a series of (n-1) Dimensional Cubes

A 3 Dimensional data cube of the revenue data for the dimensions time (months), region, and car type.

SCHEMAS OF MULTIDIMENSIONAL DATABASE

For OLTP database the schema is generally an Entity Relationship model i.e the schema contains the entities whose relationships are mapped. For a Data-warehouse which is a multidimensional database we have the star schema, snowflake schema and the fact constellation schema.

STAR SCHEMA

Star schema is the most commonly used schema for modeling the multidimensional database. The Star schema has a central fact table which is the primary table holding all the numeric values (measures) and the various dimension tables are mapped to the Fact table. The name “Star Schema” is because the dimension tables surround the central fact table making it appear like a star burst.

Following is the star schema for the Data warehouse of automobile manufacturing company

SNOWFLAKE SCHEMA

Snow flake is a derivation of the star schema. Even here the dimension tables are mapped to centre fact table, the only difference is that the dimension tables are further normalized in snowflake schema.

Snowflake schema for the Data warehouse of automobile manufacturing company

FACT CONSTELLATION

In the data-warehouse more often there is a requirement for having two different fact tables consisting two different categories of data. In that case the fact constellation schema allows the fact tables to share the dimension tables among them. The prevents the duplicity and the repetition of the dimension tables

In the above schema diagram there are two fact tables FactRevenue and FactEmployeeSalary which are mapped to the dimensions. Region is a common dimension between both the fact tables.

A Data warehouse normally concentrates on multiple subjects across the organization, where as a data mart is a subset of a data warehouse which concentrates on one particular subject. Therefore fact constellation schema is very common in modeling a data warehouse for interrelated fact tables each concentrating on one subject. The Star schema and snowflake schema is commonly used in modeling data marts where there is mostly one subject.

MEASURES

Measures are the numerical values that are stored in the fact table which are mapped against the different dimensions Measures can be broadly classified as distributive, algebraic and holistic.

Distributive Measure is computed by a distributive aggregate function. Example sum(),count(),max()
Algebraic Measure is computed by a algebraic aggregate function. Example standard deviation, average
Holistic Measure is something which cannot be calculated by an algebraic function. Example Rank

CONCEPT HIERARCHIES

Concept hierarchy is the mapping of low level hierarchies to the higher level hierarchies in the dimension table. In time dimension the low level hierarchy day can be mapped to the immediate higher level hierarchy week. Similarly the week can be mapped to the immediate higher level hierarchy month and so on. The concept hierarchy for the time dimension will be day

The conceptual hierarchy provides the facility to the user to view the data in various perspectives.

OLAP OPERATIONS

ROLL UP

The roll up operation aggregates the data in an increasing order in the conceptual hierarchy from a lower level to higher level. In the example mentioned before the data is aggregated by week and then the data is aggregated according to its higher level in hierarchy month and so on.

DRILL DOWN

Drill down is the opposite of roll up. The data is aggregated in decreasing order in the conceptual hierarchy from a higher level to lower level. In the time dimension the data is grouped by ‘ week’ and then the next grouping is in the lower level ‘day’.

SLICE AND DICE

Slice operation is choosing one particular dimension in the n dimensional cube which results in a sub cube where as dice operation is done by choosing two or more dimensions in the cube to result in a sub cube.

PIVOT

This is also referred as rotate where the axis is rotated to provide a different presentation of the data in the n dimensional cube.

Data warehouse Design

Designing an effective data warehouse for an organizations needs and business is highly challenging. There are a lot of intricacies involved in designing the data warehouse. There are different views which need to be considered before initiating the design of the data warehouse.

The four different types of views that needs to be considered are

Top-down view

It is the selection of the most relevant information on which the data warehouse needs to be built. The information should serve the current requirements and the future trend as well.

Data Source view

It is the information that is available in the data sources for the data warehouse. This information should be documented.

Data warehouse view

The finalized fact table and the dimension table for the data warehouse and almost any information that is stored in the data warehouse. The derived measures which are calculated from the direct measures are also included.

Business Query view

It indicates the view of the end users. The final output what the end users would like to see from the data warehouse.

DATA WAREHOUSE ARCHITECTURE

The data warehouse is normal designed as three tier architecture.

The lower most tier is a relational database where the data from various servers are populated after several data cleansing, transformations which are done by the ETL tools. In the relational database the data is prepared in such a way that all the relevant information can be easily populated in the data warehouse. This lower tier database is also referred as a staging database.

The middle tier is the multidimensional OLAP cube where the data is populated in the form of facts and measures from the relational database. The data from the lower layer to middle layer does not undergo much transformation. The data is only organized into formulated dimensions and measures.

The top most tier is the front end for the analyst and the executives who would view the data from the cube. It contains tools which aid in analysis and data mining.

DATA MINING IN DATA WAREHOUSE

The data warehouse is used in organization from several industrial sectors in making important strategic decisions by performing data analysis. Over the years the data warehouse has evolved a lot due to constant change in the purpose it being used for. Initially it was used for generating predefined reports later on it was used for analysis of the detailed data and then gradually for multidimensional analysis using OLAP tools. Now the data warehouse may also be used for data mining. Based on the purpose the data warehouse tools may be categorized as

Data access tools
Data reporting tools
Data analysis tools
Data mining tools

Metadata which is a part of the data mart provides information of the data that is available in the data warehouse. Metadata is the data about the data. It helps the business people understand the data warehouse.

The data mining tools help to find the hidden patterns in the data available in the data warehouse and in turn aids in data prediction. On line analytical processing is similar to data mining, however it does not cover the wide range of data mining. The OLAP operations such as drill down, roll up, slice and dice can derive the information from the data warehouse, but Information processing is not exactly data mining. Data mining helps the executives to understand the customers, to fix rates for their product, to provide offers to compete in the market.

On-line Analytical Mining(OLAM)

OLAM is the integration of On-Line Analytical processing with data mining. It is also known as OLAP mining. In order to effectively perform data mining it is very important to analyze the data in the data warehouse as well. This where OLAM comes into play. OLAP operations like drilling, pivoting, dicing and slicing help the user to analyze the data in various levels of granularities in sub sets of the cube.Hence it provides very good flexibility to the user in order to perform data mining. The OLAM is very similar to OLAP in terms of what it does. The OLAM is used for analytical mining in data cubes where as the OLAP is used for analytical processing in the cubes. The OLAM performs several tasks like

Concept description
Association
Classification
Prediction
Clustering
Time series analysis

The Data mining requires data to be cleansed and transformed in order to extract the patterns and information from them. Since the data available in the data warehouse is consistent and integrated and data can be viewed in several perspective it serves as a great source for the data mining.

MS BI Blog

Sunday, May 22, 2011

SQL:Data warehousing :An Overview

1 comment:

Creating a Meta Data Driven SSIS Solution with Biml