Slowly Changing Dimension - The SCD

The SCD's - Slowly Changing Dimensions

As the name suggest, this concept talks about how do we wish to manage the changes happening in the Dimension table. This concept is specifically for Dimension table.
The Fact tables are usually continuously changing, as in the data in the Fact table changes continuously as against the data in the Dimension table do not change continuously hence called as slowly changing Dimension.

Let us take an example of Employee Table, to understand this.

Consider an Employee, with Employee_ID 101, Name as 'Steve', who is currently working in India as location.



Now consider the Employee is shifted to Japan. In this case, because the location of the Employee is not very frequently changing, Hence called slowly changing Dimension.
Since the location of the Employee is changed, the updates needs to be reflected in the Dimension table. This is the "Slowly Changing Dimension" problem.


There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1: The new record replaces the original record. We no longer have the old record available.

Type 2: A new record is added along with the old record. So we have Current as well as Old record.


Type 3: The original record is modified to reflect the change. 

SO taking the above 3 scenarios in consideration, we have:

1. SCD 1(Slowly Changing Dimension 1)

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.
So in the above example, as Steve has been shifted to Japan the entry in the Dimension table now looks like,


So we do not have the original record. 

SCD1, only the current data, no Historical data is stored.

Advantages:
1. This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
2. Less space required, as we are not storing the historical data.

Disadvantages:
1. No history available, so we cannot trace back to see any historical record.

When to use Type 1:
SCD 1 should be used when it is not necessary for the data warehouse to keep track of historical changes.

2. SCD 2(Slowly Changing Dimension 2)

In SCD2, the new information is added as a new record. In other words, new and old record are present as 2 separate records.

So in the above example, as Steve has been shifted to Japan the entry in the Dimension table now looks like,


So we have the original record along with the current record. 

SCD2, both the current data and Historical data is stored.

The problem in the above case is with Employee ID, Employee ID being Primary Key cannot be same. Also Steve being a same person, cannot have 2 Employee ID's.

This problem can be managed by using the surrogate key concept, by adding another column and giving it a unique value. So the data looks like:

This helps in maintaining the proper sequence for the data, also helps in maintaining the versions.
Then the other issues, How to identify the current and historical data. For that we can create another column which indicates which record is current and which one is historical.

All this is done at ETL layer.

Advantages:
1.  We can use both the current as well as historical records for gathering the information.

Disadvantages:
1.  This will cause the size of the table to grow fast. 
2.  This implementation complicates the ETL process.

When to use Type 2:
SCD2 should be used when it is necessary for the data warehouse to track historical changes.

3. SCD 3(Slowly Changing Dimension 3)

In SCD 3, the new information is added as a new column, as against a new record in SCD2. In other words, new and old record are present in record, but in different column.

So in the above example, as Steve has been shifted to Japan the entry in the Dimension table now looks like,


So when building a Datawarehouse with SCD3, we first decide how many versions of a particular entity we wish to maintain, accordingly the columns in the table will be added, and hence in future will be maintained.

So we have the original record along with the current record, stored in a same record, but different columns.

SCD3, Partial history is stored. Partial as in, what if Steve is again relocated to China.


So we lost the data representing India, hence the partial history.


Advantages:
1. This does not significantly increase the size of the table

2. This allows us to keep some part of history.

Disadvantages:
1.  This SCD will not be able to maintain all the historical data. 

2.  This implementation also complicates the ETL process.

When to use Type 3:
SCD 3 should be used when it is not necessary to maintain the complete history of a particular entity, and the company is happy with partial history.

Hope you are clear about the concept of SCD now.

No comments:

Post a Comment