The Temporary Storage - Staging Area

In the previous post we discussed about the Architecture of Datawarehouse, where we saw the definition of Staging Area.

Let's first understand - What is Staging Area?

Few important things to understand in the concept of Staging Area,
a. What is staging area?
b. Is staging area diffrent from actual datawarehouse space,
c. Is the structure of Staging area different from datawarehouse tables,
d. What is the difference in the Data in Staging Area and Datawarehouse?
e. Is Staging Area mandatory in a Datawarehouse Architecture?
f. How and why Staging Area is required?

Now let's understand all these points:

Note: Explained below is a sample scenario where staging area is used/not used.

a. What is Staging Area?
Ans: Staging area is common place where all the data is stored prior to being utilized and loaded into Datawarehouse. 

Staging area stores the data in tables. The count and the structure of tables in a staging area depends on the business requirements.

b. Is staging area different from Actual datawarehouse space, AND
c. Is the structure of Staging area different from datawarehouse tables,
Ans: Yes, the staging area structure can be different from the Datawarehouse structure, or it can be same also. It purely depends on the business requirements.

d. What is the difference in the Data in Staging Area and Datawarehouse?
Ans: The data in Staging area and datawarehouse is also different, as in the data in the staging area is transactional data, where as the data in datawarehouse is summarized and analytical.

e. Is Staging Area mandatory in a Datawarehouse Architecture?
Ans: The staging area is optional, as in it can be present in an Datawarehouse Architecture or may not be present. It again depends on the amount of data generated and the requirement of the transactional data. The staging area will be created if a particular company needs to utilize the transactional data kept in Staging area.

f. How and why Staging Area is required?
Let's take a scenario:

Consider a example of a small telecom company. which started its operations in year 2000.

  • Name: XYZ
  • Number of customers: 100000
  • Average Transaction per day per customer: 100
  • Total years for which data needs to be stored in datawarehouse: 20 Years


Taking above details in consideration, let's see how many records are being generated for the Telecom Company:

100000(Total Customers) * 100 (Average transactions/day) * 30 (per month) * 12 (per year) * 20 (total number of years)

i.e, 100000 * 100 * 30 * 12 * 20  =  72000000000 (A big number)

Now, it takes a huge amount of space to store this amount of data. Ideally a company wish to store all the data they generate, but to store all these records they will require a huge space which in-turn will require lots of investment in purchasing the space. So there comes the compromise.

Company needs to compromise as in the data which they wish to store. There can be various types of compromise which can be done. Few of them listed below:

1. Reduce the total number of years from 20 to may be 10. So I make my data half. - Which is not a good option.
2. Summarization the total transactions (i.e, 100 in this case), based on the analysis to be done in future. - A better option. 


--------------------
Now how the Summarization can be done?
Consider we are in year 2001, that time also on an average the number of transaction generated per consumer is 100. 
Out of these 100 transaction, consider one of the transaction:
Date: 01-01-2001.
Time: 12:00 AM

Now in year 2015, is that individual transaction of any use, or does that make any sense to analyse that particular record in year 2015. NO.

So what can we do, We can summarize the data at a particular level, so as to make that summarized data useful for analysis. This will also help me in reducing the unwanted data to be stored in the Datawarehouse.

I can summarize the total transaction generated per customer per month (i.e, 100 * 30), and make it one record which will also be the total bill generated for the particular customer/month. Now I can store the total billed amount/per month/per customer, which will be useful data for analysis in year 2013 also.

So my data is reduces from :

100000 * 100 * 30 * 12 * 20  =  72000000000

to 

100000 * 1 * 12 * 20 = 24000000

This is just a sample example I took for explaining how summarization can be done. There are various ways in which summarization can be done. 
----------------------

Now let's go back to Staging Area concept.

We just saw how we can do summarization and how can we reduce the unwanted data.
Again going back to above example, 72000000000 are transactional data and 24000000 is summarized data(Analytical data).

Out of 72000000000, 100000 * 100 * 30 = 30,00,00000 are generated per month.

We discussed that we will summarize these many records and store in datawarehouse.. But still do I need to store these 30,00,00000 records for any specific use. YES.
To generate the bill of the customer per month. Using these records only the bill of the customer will be generated. 
This is where staging area comes in picture. we will create a temporary place where we will store these 30,00,00000 records for 1 month. Once the bill is generated, I can summarize the data at the end of one month and load the summarized data into Datawarehouse.

This is a sample scenario where we need the data warehouse. 

So a staging area is a place where we store the transactional data for a small period of time (depending on the business need) and then summarize that transactional data and load the data in a datawarehouse.

This is how we can utilize the staging area for business needs.

So if a company is having enough space to store all the data they generate, they may skip the staging area and directly load the Datawarehouse, which is not usually the scenario.