Data warehouse ETL modeling and practical skills

First, the data warehouse architecture

Data Warehouse (Data Warehouse DW) in order to facilitate multi-angle and multi-dimensional analysis and the presentation data stored in the relational database established by a particular pattern, based on its data source OLTP system. Data warehouse is in the details, integrated, subject-oriented, in order to analyze the needs of OLAP system for the purpose.

Data warehouse architecture model includes the star schema (Figure II: pic2.bmp) and the snowflake schema (Figure III: pic3.bmp) modes. As shown in FIG intermediate, star schema fact table, surrounded dimension table, similar to the stars; the comparison, as the intermediate snowflake schema fact table, dimension table both sides which can be associated with a further sub-table, so that He expressed a clear dimension hierarchy. 

Considered from two aspects analysis needs OLAP system and processing efficiency of the ETL: polymerization star structure, high efficiency of the analysis; and snowflake clear structure to facilitate interaction with the OLTP system. Therefore, in the actual project, we will make comprehensive use a star schema and snowflake schema to design a data warehouse.

Well, here we take a look at the process of building enterprise-class data warehouse.

Second, build enterprise-class data warehouse five-step method

(A) to determine the theme

That data analysis or determine the subject of front-end display. For example: We want to analyze a case of beer sales in a particular area for a month, that's a theme. The relationship between the theme to reflect the particular aspect of each analysis point of view (dimensions) and statistical numerical data (measurements), and comprehensive consideration when determining the topic.

We can imagine the image of a theme as a star: statistical numerical data (measure) exists in the fact that the stars in the middle of the table; analysis angle (dimensions) are the corners of the stars; we will be through a combination of dimension, to examine the measure. Well, such a theme "beer sales in a particular area for a month", we are required by a combination of the two dimensions of time and place, to examine sales this measure. Thus, different themes from a different subset of the data warehouse, data mart, we can call it. Data mart information reflects some aspect of the data warehouse, data marts constitute more than a data warehouse.

(B), determining a measure of

In determining the subject of the future, we will consider technical indicators to be analyzed, such as sales and the like. They are generally numerical data. Either we summarize this data, or the data fetch number of times, taking the maximum or minimum number of independent like, such data is called a measure.

To measure statistical indicators necessary to select the appropriate, can be complex key performance indicators (KPI) such as the design and calculation based on different metrics.

(C) determining the fact that data granularity

After determining the measure we have to take into account the case of a measure under polymerization summary of the situation and the different dimensions of the measure. Taking into account the different measure of the degree of polymerization, we will use the "minimum granularity principle", is about to measure the particle size is set to the minimum.

For example: Suppose the current minimum data record to the second, that is recorded in the database transactions per second. So, if we can confirm that, in the future analytical needs, the time need only accurate to the day if you can, we can in the ETL process, in days to aggregate data, this time, the size of the data warehouse measure is " day "; conversely, if we can not confirm whether future analysis needs require accurate to seconds in time, then we need to follow the" minimum granularity principle "to preserve every second of data in the data warehouse fact table in order the future of the "second" analysis.

While the use of "minimum granularity principle", we do not have to worry about the huge amounts of data pooled analysis of efficiency brought about, because in the subsequent establishment of a multidimensional analysis model (CUBE), we will summarize the data in advance, in order to protect produce results s efficiency. On issues related to the establishment of a multidimensional analysis model (CUBE), we will be set forth in the next column.

(Iv) to determine the dimensions

Dimension is all angles analysis. For example, we hope that by time, or by region or by product analyzed, this time, the region, the product is appropriate dimension. Based on different dimensions, we can see a summary of the situation of each measure, can also be cross-analysis based on all dimensions.

Here we must first determine the level dimension (Hierarchy) and level (Level) (Figure IV: pic4.bmp). As shown, we are in the time dimension, in accordance with Figure "- Quarter - Month" form to the next level, where "year", "quarter", "month" has become this level of three levels; the same token, when we when creating the product dimension, we can "product categories - product subclass - products" classified as a level that contains "product category", "sub-class products", "products" three levels.

 

So, we analyze these dimensions are used, in the form of a data warehouse is what is it?

We can be provided a three-level data fields in Table 3, such as the time dimension; We can also use the three tables, were preserved product category, product subtype, three parts of the product data, such as product dimensions.

 

    In addition, it is worth mentioning that we should make full use of surrogate keys when creating a dimension table. Numeric surrogate key is the ID number (e.g., the first field in each table of FIG six), which uniquely identifies each dimension members. More importantly, during the polymerization, match and compare numeric field, JOIN high efficiency, ease of polymerization. Meanwhile, the agency has a key importance for slowly changing dimensions, the original data in the same primary key cases, it played a role in the identification of new data and historical data.

Here, we may wish to talk about the dimension table changes over time, which is what we often encounter, we call for the slowly changing dimension.

For example, we added new product, or a product ID number changes, or product adds a new property, this time, dimension tables will be modified or add new rows. In this way, we are in the process of ETL, we must take into account the process of slowly changing dimensions. For slowly changing dimensions, there are three cases:

1, the first type of slowly changing dimensions: historical data needs to be modified. In this case, we use UPDATE method to modify the data in the dimension table. For example: ID number of the product is 123, later found the wrong ID number, you need to rewrite the 456, then we in the ETL process, directly modify the original dimension table ID number is 456.

2, the second type of slowly changing dimensions: historical data retention, the new data should be retained. At this time, to update the original data, the new data is inserted, we use UPDATE / INSERT. For example: A certain employees in the sector in 2005, in 2006 he transferred to the B division. So when the statistical data for 2005 should be to locate the employee A department; and in the statistical data for 2006 should be targeted to department B, then insert new data will be carried out in accordance with the new department (division B) process, so our approach is the added dimension member list identity column, the historical data identified as "expired", the current data identified as "current." Another method is the dimension of time stamped, the time period is about to take effect on historical data as one of its properties, the association according to the time period when the original table is generated matching the fact table, the benefits of this approach is the dimension member effective time clear.

3. The third type of slowly changing dimensions: new data dimension members changed the property. For example: a new dimension members joined a column that it can not be based on historical data browsing, and you can view it in accordance with current data and future data, then the time we need to change the dimension table attributes that add new fields column. So, we will use the stored program to generate a new procedure or dimension attributes view based on the new property in the subsequent data.

(E) to create a fact table

After determining good factual data and dimensions, we will consider to load the fact table.

When large amounts of data piling up company, we want to see what's inside what is found there is a sum of production records, a sum of transaction records ... then these records is the fact that the original data table we are going to build, that is about the fact that a topic of records in the table.

Our approach is to the original table with the dimension tables associating generated fact table (Figure VI: pic6.bmp). Note promising empty when the associated time data (dirty data source), requires external connections, each connection we will put out dimension surrogate keys in the fact table, in addition to the dimensions of the fact table surrogate keys, and each measure is data, from which the original table, and the presence of each metric dimension surrogate keys in the fact table, descriptive information should not exist, i.e. matching "lanky principle", i.e. the fact table requires as many number of pieces of data (the minimum granularity), the descriptive information as little as possible.

 

If you take into account the expansion of the fact table you can add a unique identifier column, the future expansion of the fact that as a snowflake dimension, but does not require the time generally recommended not to do so.

Fact table is the core of the data warehouse requires careful maintenance, after get JOIN fact table, generally larger than the number of records that we need for composite primary keys and indexes set to achieve data integrity and data warehouse based on query performance optimization. Fact table and dimension tables put together in the data warehouse, if you need to connect front-end data warehousing queries, we also need to establish a number of related intermediate summary tables or materialized views, in order to facilitate the inquiry.

Third, what is ETL

In building a data warehouse, ETL throughout the project always, it is the lifeblood of the entire data warehouse, including data cleansing, integration, transformation, and loading and other processes. If the data warehouse is a building, then ETL is the foundation of the building. The quality of the integration of ETL extracts data directly affect the final results show. So ETL plays a key role in the data warehouse project, it must be placed in a very important position.

ETL is a data extraction (the Extract), conversion (the Transform), load (Load) abbreviation, it means that: the OLTP systems extract data out, and the conversion and the integration of data from different data sources, the consistency of results data, and then loaded into the data warehouse. For example: The following figure shows us the effect of ETL data conversion. (Figure VII: pic7.bmp)

 

So, in this conversion process, we have completed the correction of the data format, the calculation of the three operations merge data fields, as well as new indicators. Similarly, we can also according to other needs, improve data warehouse.

In short, by ETL, we can generate a data warehouse based on the data in the source system. ETL for us to build a bridge between OLTP systems and OLAP systems.

 

V. practical skills

(A), the use of the preparation area

 When building a data warehouse, if the data source is located on a server, data warehouse on another server, taking into account the data source Server client access frequently, and large amount of data must be constantly updated, so you can establish a preparation area database (Fig. eight: pic8.bmp). First zone to the ready data extraction, then processed based on the data in the staging area, the benefits of this treatment is to prevent the frequently accessed in the original OLTP systems, data sorting operations or other operations.

 

For example, we can follow the day the extracted data to the ready area, based on data preparation area, we will perform data conversion, integration of the data from different data sources consistency process. In the presence of the original data preparation area extraction table, convert the intermediate and temporary tables ETL log table and the like.

(B), the use of time stamps

Time dimension for a fact that the theme is very important, because different times have different statistical information, the information in accordance with the time recorded will play a very important role. In ETL, the timestamp has its special role in the slowly changing dimensions mentioned above, we can use the time stamps to identify dimension members; in the operation log database and data warehouse, we will also use the time stamp identification information. For example: during data extraction, we will follow the timestamp data extracted OLTP system, such as data taken at midnight the day before 0:00, we will follow the OLTP system timestamp GETDATE one day to take GETDATE Save, The thus-obtained data the previous day.

(C) the use of the log table

When the data processing, data processing errors will inevitably occur, resulting in an error message, then how do we get an error message and corrected it in time? Our approach is to use one or more sheets Log log table, an error message will be recorded in the log table, we will record the number of each drawing, the number of successful treatment, the number of treatment failures, failure of processing data, processing time and so on. Thus, when a data error occurs, it is easy to find the problem, and then to correct erroneous data or reprocessing.

(Iv) using scheduling

You must schedule (Figure 9: pic9.bmp) when the data warehouse incremental updates, i.e. to the fact table incremental update process. Before using a schedule to take into account the fact that the amount of data needed to determine how often they change. For example, you want to view by day, so we had better be drawn by the day, if the amount of data, the data can be updated by following months or six months. If there is slowly changing dimensions, the schedule needs to be considered when the dimension table updates, before updating the fact table must first update the dimension table.

 

Scheduling is a key part of the data warehouse, careful consideration. After the ETL process to build a good, you want to run it regularly, so scheduling is a key step in the implementation of ETL processes. Every time scheduling in addition to data processing Log information is written to the log table, but also use the Send Email or alarm services, which would also facilitate technical staff grasp of ETL processes, and enhance the accuracy and security of data processing.

V. Summary

Build enterprise-class data warehouse requires a simple five-step, five-step mastered this method, we can build a powerful data warehouse. However, each step has a deep need to study the content and mining, especially in the actual project, we have to consider. For example: If the data cleaning dirty data source of a lot before we first set up a data warehouse to conduct, in order to weed out unwanted messages and dirty data.

It is a bridge between the ETL and OLAP systems OLTP systems, data is flowing into the channel from the source data warehouse system. In the data warehouse project implementation, it relates to the data quality of the entire project, so sloppy, it must be placed in an important position, this will be the foundation of building a data warehouse and build a strong!

 

Guess you like

Origin www.cnblogs.com/handsome-24/p/11369926.html