Datawarehousing - The definition - Explained in simple terms

In this post we will be talking about the Definition of Data Warehouse.

In the couple of previous posts we have seen some very basic information about datawarehouse. I have tried to portray datawarehouse in simple terms.


Datawarehousing, as the name suggests is a game played with data. The data for today, yesterday, day before yesterday, a week old data, an year old data stretching may be up-to the ages of your grand grand father.

The data warehouse is mainly the term used in the companies to manage the data. The larger the company, the bigger the data, the bigger requirement to manage the data.

The concept of a data warehouse is to have a centralized database that is used to capture information from different parts of the business process. The definition of a data warehouse can be determined by the collection of data and how it is used by the company and individuals that it supports. Data warehousing is the method used by businesses where they can create and maintain information for a company wide view of the data.

Lets discuss about how actually we can define datawarehouse - as given by pioneers of Datawarehousing.


-- A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.


Other one - a little technical definition...


-- A process of transforming data into information and making it available to users in a timely enough manner to make a difference



Lets understand these points in details:

1. Single and complete: A single/complete store of data???

Here is an example:

Let's say you started a small boutique in Ohio, United States - you sell authentic dresses.

Whatever transactions where done, you are saving those in your personal computer in an excel sheet. Since its just a start-up you do not have many transactions - say 10 transactions per day, and you can use your excel sheet to store the details.
So on an average you are generating 100(transactions/day) * 30 (days/month) * 12 (months/year)
i.e, 10 * 30 * 12 = 3600.

Example of some transaction you saved.

So we have 3600 records in your Excel Sheet per year.

Now suppose after an year you stand good in your business and think your market is growing and you wish to analyse the data and see which brand sells best, or you want to analyse which quarter the sell is at its best etc. 
Since you have all your data in a single personal computer of yours, you can just easily analyse the data and find out all you need and accordingly you can make strategies.

With gods grace and your hard work, you thought of expanding your business and you started a new boutique in Ohio - so now you have 2 boutiques in Ohio. You hired a person to manage the boutique - he takes care of managing the new boutique and storing the data in a personal computer at new shop.


After few days you wanted to analyse the performance of new boutique - How will you do that???

1. You travel to new boutique - check the data in personal computer there and analyse.
2. You call the manager of new boutique with the data to your place - you check the data and do analysis.
3. You ask the data from new boutique over email - and you analyse.

All this is possible because - 

a. You are located in a quite small geographical space of world.
b. Your data is not that huge.

After 2-3 years, you are now expanding your business to other states of US - you are planning to start 10 new boutiques - now the larger once. Father more in next few years, you have shops in 5 different continents of this world.


CAN YOU NOW

1. travel to each boutique in 5 different continents to check the data in personal computers at each boutiques???
2. call the manager of each boutique with the data to your place with the data for doing analysis???
3. Ask for the data from each boutique over email - and do analyse????

ANSWER - NO


Reason???

1. You are now located in quite a larger geographical area.
2. You cannot travel to each location to check the data and do analysis
3. You do not have skills to analyse the data which is HUGE now.

SOLUTION - A DATAWAREHOUSE

You need a centralized location where all your data is stored. Datawarehouse is the solution. You have data across your enterprise stored at a common location, which you can use to analyse with ease.

2. Consistent: What do we mean by consistent in Datawarehouse?


As we discussed in above example, that you have hired managers to manage the store and in turn its the manager's responsibility now to manage the data as well.


Now suppose there is a data entry operator in Japan and other in China. The ways in which they enter the data will be different. See the example of data snapshot again:

                            


In this example, the Date of Purchase is not stored in a common format. now if you wish to do analysis by looking the data, it will be difficult to do so and there are chances that you miss some data for a particular analysis.
Here, sales done in FEBRUARY is represented in different format i.e, 2-Feb-13 and 5/2/2013.
1. There are chances that you miss some entry while you analyse using data stored in excel sheet.
2. You will lose the data if you use SQL to extract the data, because you are not aware of different formats in which data is stored.

SOLUTION - DATAWAREHOUSE.


Datawarehouse stored the data in consistent format. Means, when we store the above data in Datawarehouse, the data will be converted in a common format(example - DD-MM-YYYY) before loading into datawarehouse.


Now you can understand what advantages it will have.

1. You can just easily extract the data.
2. You will not miss any data due to in-consistent data format.

I hope now you understand the definition:

singlecomplete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.


The most popular definition came from Bill Inmon (Father of Data warehousing), who provided the following:

-- A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Subject-Oriented: Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
Here is an example:

We will illustrate the above example.

The boutique which you started is grand success now. You have around 100 stored across globe now and you are generating huge amount of data. Now there is a need for proper analysis of that data so that you can do the business and make future policies for your stores across different continents.
For that you decided to build a datawarehouse. 
As you are aware datawarehouse is combination of multiple tables. How many table? What type of table? What will be the structure of table?

When you run a business, you know that there are few important aspects of your business i.e. there are few important entities of your business which are must to run your business. For example - 

1. Employees: This is very important entity of your business to run.
2. Product: This is the entity which tell you for what all products you do business.
3. Location: This is the entity which tell you what all locations you are doing your business.
4. Customer: This entity is most important which tell you about all you customers.
    There are many other, we will take only 4 for example.

Now a good Datawarehouse design says - The important entities of your business should be present as a separate table in datawarehouse i.e, you create a EMPLOYEE table, a PRODUCT table, a LOCATION table, a CUSTOMER table.


This is called as SUBJECT ORIENTED - The important aspects/entity of your business should be present as a separate table.


Why this is important:

Since these are important entity of your business, you create a separate table for these entities and you store data accordingly. Now you have four table with data:
1. Employee Table: Contains data for all the employees you have across enterprise.
2. Product Table: Contains details of all the products you do business across globe.
3. Location Table: Contains details of all the location where you do business.
4. Customer Table: Contains details of the customer with whom you do business.

When you wish to extract some data related to a particular Employee, you will go to employee table and get that data, you need not search in random across other tables. This saves your time and resources. Also gives faster result.



Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
This is similar to the explanation we saw for SINGLE source of data in above explanation.


Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, 20 years or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

I hope you are not in a better position to understand about datawarehouse. Atleast the Definition now !!!

In next post we will see more concepts of datawarehouse. 

A Big warehouse - Data Warehouse



Similarity in a Warehouse and Data Warehouse:

If you are a new born baby or heard about datawarehosuing for first time - here are some definitions of Datawarehouse for you.
We will see the definition given by pioneers later in my post. 

a. Datawarehouse is a place where I can store huge amount of data in a systematic manner.

b. Datawarehouse similar Database to is nothing but combination of multiple tables linked to each other through primary key and foreign constraints.
c. Datawarehouse stores data collected over multiple years(Historical).
d. Datawarehouse is used for analyzing the data for decision making purposes of the enterprise.
e. Datawarehouse is a place where we store huge amount of data, from which we can easily extract the data.

Now read this:

Datawarehouse is combination of multiple table linked to each other which stores the historical data in systematic manner which is used for analyzing the data for decision making purposes.

Moving forward - what actually we mean by all these terms.


What do we mean by storing the data in systematic manner and why it is important.

Don't understand anything out of this !!!!

Let's talk about some familiar terms first.


Have you ever got a chance to see any warehouse(not the city retail shop) - The big storage place usually out of city where companies keep there products. And based on the requirements the products are dispatched to retail shops.


Take a look at fig(a) below, a place which is not well arranged and not maintained. Now if I try to go there and get some thing (say - a PENCIL) out of that, it will be really difficult for me.


Fig(a) - A warehouse which is not systematically arranged. 

Disadvantages of this place:

1. I am not sure if I have PENCIL.
2. If I am sure PENCIL is there, I am not sure where it will be
3. I am not sure how many PENCIL's I have?
4. Even if I am aware that I have pencil in that huge stock, I am not sure how will i reach there and take that out..

Now have a look at fig(b) below, a nice place !!!



Fig(b) - A warehouse which is systematically arranged.

Advantages of this place (you all can now easily answer..)

1. I am aware of the place where PENCIL is present.
2. I can easily go there and search for the PENCIL.
3. I will save my time, since I know the place where PENCIL is kept.
4. Since place is properly arranged, I will be aware how many PENCIL's we have stored.

Now relate above figure(a) as your data and fig(b) as datawarehouse.


Only having the data is not enough, having the data systematically arranged is important so that you can use that data for analysis and other purposes in future. - This is where Datawarehouse helps.


You all know every company which works in whatever corner of this world generates Data - some generate huge data(Amazon.com, Wallmart etc) and some generate little data(a new start up). So when we see there are so many advantages of having a Datawarehouse why don't every company have a datawarehouse.


Friends - you might have heard - Every coin has 2 sides. So when Datawarehouse brings in so many advantages, it comes to you with some disadvantages.


To list a few:

1. Heavy cost of building the datawarehouse - you need integration tools, you need reporting tool, You need the storage devices etc.
2. Heavy cost of maintaining the datawarehouse.

The cost is the primary aspect which stops many companies from having a datawarehouse.



In the next post we will see the definition of Datawarehousing.