Data warehouse series - Slowly Slowly Changing Dimension

In the  OLTP  business database to  DW  process data warehouse data extraction, especially after the first imported once every incremental extraction often encounter such a problem: some of the data in the database business has changed, in the end not to these changes are also reflected in the data warehouse? In the data warehouse, what data should also change, which can not change? In view of these changes, the dimension table in the data warehouse should be designed and how to meet those needs.

It is clear that changes in the data service database is very natural and normal, such as customer contact information, phone number and other information may change as the customer's location changes, such as changes in commodity prices have risen and fall at different times . So in business database, it is natural for them to modify and immediately reflected in the actual business were to go. But in the data warehouse, the main features of its data First static historical data, second is a small change does not remove, the three regular growth, its role is mainly used for data analysis. Therefore, the process of analysis of the historical data on the proposed requirements, there are some data that need to be able to reflect changes in the cycle of history, there are some missing data is not required, then these data should be how to control.

Suppose first loaded from the database of a number of business data into the data warehouse, database information service was such a customer.

Customers  BIWORK  , living in Beijing, is currently a  BI  development engineer. Assume  BIWORK  because the air quality in Beijing  PM2.5  and other reasons moved to Sanya from Beijing. Then this information should be in the business database updated  -

 

Then when the next extract such information from a business database, data warehouse and how to deal with it? We assume that the realization of the synchronization between the business data warehouse database also modify the update data directly to the entry in the data warehouse. Then we create reports do some simple statistical analysis, this time in the data warehouse for all customers  BIWORK  selling point to the  BIWORK  new location  -  the city of Sanya, but in fact  BIWORK  before all purchases have occurred in  BIWORK  living in Beijing time. This is a very simple example, it describes the problems caused by change some basic information could result in data summary and analysis arise. But sometimes, this scenario may indeed exist.

In order to solve such a problem similar to the need to understand a very important concept in the data warehouse  - slowly slowly changing dimensions . 

A slow gradient type  (Type 1 SCD)

In the data warehouse, we can keep traffic data and the data warehouse is always in agreement. Can  Customer  use from business database dimension  Business Key - CustomerID  changes to track business data, then it will change once the old business data overwritten.

DW  records based on business database  CustomerID  get the latest  City  information, updates directly to the  DW  in.

Two slow gradient type  (Type 2 SCD)

Of course, in the data warehouse is relatively more static historical data aggregation and analysis of data, and therefore maintains historical data from business systems as much as possible, to be able to truly capture this change in historical data. The example above, it may require the analysis of the results is  BIWORK  in  2012, when the purchase amount as a whole stable, but starting in 2013 reduced the amount of purchase, the reason may be related to the city where, stores in Beijing could than the stores in Sanya relatively more. Like this, it is not very simple in the data warehouse  BIWORK  directly update the current city, but should add a new data will now  BIWORK  location in  Sanya.

But if only the  DW  add a new data will still be new problems, because in  DW  identifies the customer by the  CustomerID  to achieve, this  CustomerID  from business database, which is unique. However, in the  DW  new business to save data in a database historical information, this data can not be guaranteed in  DW  uniqueness of the other of  DW  associated data tables to this table will not be able to know how to reference this  Customer  information . In fact, if  CustomerID  in  DW  also uniquely identified as the primary key  Customer  , then insert new data in time will fail to occur.

So we need to continue to maintain the  Business Key  business key, because it is the only link related to business databases. Part to make a change is to add a new  Key, a data warehouse is key. The term inside the data warehouse, the warehouse unique identification data recorded in the key table we call  Surrogate Key surrogate key is usually set to the primary key table DW . 

In this table above, which -

CustomerID - Business Key  business key , used to connect to the database and data warehousing business keys, pay attention to change at any time regardless of whether the business should not happen in a database or data warehouse.

DWID - Surrogate Key  surrogate keys , generally set  DW  primary key dimension tables, dimension tables and fact tables used within the data warehouse to associate.

Why use a surrogate key, what are the benefits?

  • Suppose our business from different database systems, when these data integration may appear the same  Business Key, then by  Surrogate Key  can solve this problem.
  • Usually the service database from the  Business Key  may be longer fields, such as  the GUID, a long string that identifies the like, using Surrogate Key  may be provided directly to the shaping. Itself to large volume fact table associated  Surrogate Key  associated  Business Key  compared, Surrogate Key  more efficient and saves volume fact table.
  • The most important thing is to give the example above, the use of  Surrogate Key  can better solve this slow slowly changing dimensions, maintenance history record information.

When you can not surrogate key? I think we can combine our real business, like some business table itself  Business Key  is already shaping up, and the table attribute essentially does not change with time or geography. Some countries such as the name, area code, etc. will not change the coding basically how it happened, even though the change does not need to maintain such a case history can be used directly in the database business  Business Key  without the need to set up a new  Surrogate Key .

Then the table above structure, the light so that a new set of  Surrogate Key - DWID  is not enough, because the data warehouse also need to tell which piece of information is now being used. Of course, according to  DWID  to find out the latest recording order, but every time the comparison  CustomerID  then find the maximum  DWID  such queries is too much trouble.

So can a flag indicates that this additional data is the latest change.

Another way is identified by a start time, Valid To  is  NULL  identifying the current data.

 

Of course, there will be both comprehensive.

Another case is a mixture of  Type 1  and  Type 2  , say  Occupation  this field occurs in the database business has changed, but you can not maintain this historical information, it may be better to just the latest  Occupation  covered in the data warehouse out.

According to the actual situation, there is a practice that is all overwritten.

Three slow gradient type  (Type 3 SCD)

Indeed  Type 1 and 2  to meet the needs of the majority, but there are still other solutions, such as  Type 3 SCD.  Type 3 SCD  only want less maintenance history,

For example, to add a history field to be maintained, then a time update  Current Column  and  Previous Column. In this way, only a history of recently saved twice. But more if you want to maintain the field, it is more troublesome, because you want more  Current  and  Previous  fields. Therefore,  Type 3 SCD  with or not  Type 1  and  Type 2  less common.

 

to sum up

  • 1 SCD Type  -  History is not recorded. No maintenance of all historical data can choose  Type 1  , assuming that the name of national geographic information changes, as this data is basically maintenance-free, then the direct use of  Type 1 SCD  country name over the old one.
  • 2 SCD Type  -  add new data. The more common use, in addition to substantially  Type 1 SCD  case will priority than  Type 2 SCD.
  • 3 SCD Type  -  add history columns. It does not keep track of all of history, only one track on historical information. This situation is often between  Type 1  and  2 Type  time will take into account, need to record historical data, but they do not need to record so much.

Other Articles

PS

In different tools to achieve SCD is not the same, such as to achieve the SCD in the design of Microsoft SSIS SCD control them:

  • Type 0 - Fixed Attribute does not change the properties.
  • Type 1 - Changing Attribute variable attributes, data will be overwritten.
  • Type 2 - Historical Attribute historic property.

So here I am and to introduce three basic types Type SCD there are some differences in concept and prototype realization, which is that we should not confuse, the focus should be specific prototype implementation of ideas and solutions.

See more articles BI  BI series of essays list (SSIS, SSRS, SSAS, MDX , SQL Server)  If you find this article helpful for you to read, please help recommend to facilitate others quickly see the recommendation bar BIWORK blog these articles.

 

Reprinted from: https://www.cnblogs.com/biwork/p/3363749.html

Guess you like

Origin www.cnblogs.com/guohu/p/11516973.html