In the OLTP business database to DW process data warehouse data extraction, especially after the first imported once every incremental extraction often encounter such a problem: some of the data in the database business has changed, in the end not to these changes are also reflected in the data warehouse? In the data warehouse, what data should also change, which can not change? In view of these changes, the dimension table in the data warehouse should be designed and how to meet those needs.
It is clear that changes in the data service database is very natural and normal, such as customer contact information, phone number and other information may change as the customer's location changes, such as changes in commodity prices have risen and fall at different times . So in business database, it is natural for them to modify and immediately reflected in the actual business were to go. But in the data warehouse, the main features of its data First static historical data, second is a small change does not remove, the three regular growth, its role is mainly used for data analysis. Therefore, the process of analysis of the historical data on the proposed requirements, there are some data that need to be able to reflect changes in the cycle of history, there are some missing data is not required, then these data should be how to control.
Suppose first loaded from the database of a number of business data into the data warehouse, database information service was such a customer.
Customers BIWORK , living in Beijing, is currently a BI development engineer. Assume BIWORK because the air quality in Beijing PM2.5 and other reasons moved to Sanya from Beijing. Then this information should be in the business database updated -
Then when the next extract such information from a business database, data warehouse and how to deal with it? We assume that the realization of the synchronization between the business data warehouse database also modify the update data directly to the entry in the data warehouse. Then we create reports do some simple statistical analysis, this time in the data warehouse for all customers BIWORK selling point to the BIWORK new location - the city of Sanya, but in fact BIWORK before all purchases have occurred in BIWORK living in Beijing time. This is a very simple example, it describes the problems caused by change some basic information could result in data summary and analysis arise. But sometimes, this scenario may indeed exist.
In order to solve such a problem similar to the need to understand a very important concept in the data warehouse - slowly slowly changing dimensions .
A slow gradient type (Type 1 SCD)
In the data warehouse, we can keep traffic data and the data warehouse is always in agreement. Can Customer use from business database dimension Business Key - CustomerID changes to track business data, then it will change once the old business data overwritten.
DW records based on business database CustomerID get the latest City information, updates directly to the DW in.
Two slow gradient type (Type 2 SCD)
Of course, in the data warehouse is relatively more static historical data aggregation and analysis of data, and therefore maintains historical data from business systems as much as possible, to be able to truly capture this change in historical data. The example above, it may require the analysis of the results is BIWORK in 2012, when the purchase amount as a whole stable, but starting in 2013 reduced the amount of purchase, the reason may be related to the city where, stores in Beijing could than the stores in Sanya relatively more. Like this, it is not very simple in the data warehouse BIWORK directly update the current city, but should add a new data will now BIWORK location in Sanya.
But if only the DW add a new data will still be new problems, because in DW identifies the customer by the CustomerID to achieve, this CustomerID from business database, which is unique. However, in the DW new business to save data in a database historical information, this data can not be guaranteed in DW uniqueness of the other of DW associated data tables to this table will not be able to know how to reference this Customer information . In fact, if CustomerID in DW also uniquely identified as the primary key Customer , then insert new data in time will fail to occur.
So we need to continue to maintain the Business Key business key, because it is the only link related to business databases. Part to make a change is to add a new Key, a data warehouse is key. The term inside the data warehouse, the warehouse unique identification data recorded in the key table we call Surrogate Key surrogate key is usually set to the primary key table DW .
In this table above, which -
CustomerID - Business Key business key , used to connect to the database and data warehousing business keys, pay attention to change at any time regardless of whether the business should not happen in a database or data warehouse.
DWID - Surrogate Key surrogate keys , generally set DW primary key dimension tables, dimension tables and fact tables used within the data warehouse to associate.
Why use a surrogate key, what are the benefits?
- Suppose our business from different database systems, when these data integration may appear the same Business Key, then by Surrogate Key can solve this problem.
- Usually the service database from the Business Key may be longer fields, such as the GUID, a long string that identifies the like, using Surrogate Key may be provided directly to the shaping. Itself to large volume fact table associated Surrogate Key associated Business Key compared, Surrogate Key more efficient and saves volume fact table.
- The most important thing is to give the example above, the use of Surrogate Key can better solve this slow slowly changing dimensions, maintenance history record information.
When you can not surrogate key? I think we can combine our real business, like some business table itself Business Key is already shaping up, and the table attribute essentially does not change with time or geography. Some countries such as the name, area code, etc. will not change the coding basically how it happened, even though the change does not need to maintain such a case history can be used directly in the database business Business Key without the need to set up a new Surrogate Key .
Then the table above structure, the light so that a new set of Surrogate Key - DWID is not enough, because the data warehouse also need to tell which piece of information is now being used. Of course, according to DWID to find out the latest recording order, but every time the comparison CustomerID then find the maximum DWID such queries is too much trouble.
So can a flag indicates that this additional data is the latest change.
Another way is identified by a start time, Valid To is NULL identifying the current data.
Of course, there will be both comprehensive.
Another case is a mixture of Type 1 and Type 2 , say Occupation this field occurs in the database business has changed, but you can not maintain this historical information, it may be better to just the latest Occupation covered in the data warehouse out.
According to the actual situation, there is a practice that is all overwritten.
Three slow gradient type (Type 3 SCD)
Indeed Type 1 and 2 to meet the needs of the majority, but there are still other solutions, such as Type 3 SCD. Type 3 SCD only want less maintenance history,
For example, to add a history field to be maintained, then a time update Current Column and Previous Column. In this way, only a history of recently saved twice. But more if you want to maintain the field, it is more troublesome, because you want more Current and Previous fields. Therefore, Type 3 SCD with or not Type 1 and Type 2 less common.
to sum up
- 1 SCD Type - History is not recorded. No maintenance of all historical data can choose Type 1 , assuming that the name of national geographic information changes, as this data is basically maintenance-free, then the direct use of Type 1 SCD country name over the old one.
- 2 SCD Type - add new data. The more common use, in addition to substantially Type 1 SCD case will priority than Type 2 SCD.
- 3 SCD Type - add history columns. It does not keep track of all of history, only one track on historical information. This situation is often between Type 1 and 2 Type time will take into account, need to record historical data, but they do not need to record so much.
Other Articles
- On how to implement SCD in SSIS see Microsoft BI SSIS series - data warehouse implemented in three ways Slowly Changing Dimension slow slowly changing dimensions
PS
In different tools to achieve SCD is not the same, such as to achieve the SCD in the design of Microsoft SSIS SCD control them:
- Type 0 - Fixed Attribute does not change the properties.
- Type 1 - Changing Attribute variable attributes, data will be overwritten.
- Type 2 - Historical Attribute historic property.
So here I am and to introduce three basic types Type SCD there are some differences in concept and prototype realization, which is that we should not confuse, the focus should be specific prototype implementation of ideas and solutions.
See more articles BI BI series of essays list (SSIS, SSRS, SSAS, MDX , SQL Server) If you find this article helpful for you to read, please help recommend to facilitate others quickly see the recommendation bar BIWORK blog these articles.
Reprinted from: https://www.cnblogs.com/biwork/p/3363749.html