ETL subsystem

  Recently read "Pentaho Kettle Solutions", see ETL subsystem, found that a relatively large amount of information, do some brief notes with the statement.

  ETL system subsystem 34 seed, is divided into four parts: extraction, wash and corrections, publishing and management.

First, extract

 Subsystem 1: Data Analysis System

  It refers to the process of collecting statistical data or other relevant information from diverse sources, aimed at analyzing the structure and content of different data sources.

 Subsystem 2: incremental data capture system

  The purpose is to capture data changes in the system. Since the delay amount and the data network, the initial data loading is completed, then the data should not be reloaded while, in order to identify the change or updated data, time stamp increases or snapshot mode.

 Subsystem 3: Extraction System

  Extracting data from different data sources, and input to the process in the ETL.

Second, cleaning and corrections

  There is little data is no problem, so before the data is loaded into the data warehouse to add some steps to clean and correct these data. In addition, the system stores data was different for each mode, for example, some data sources, the gender is expressed as 0,1; some data sources in a "male", "female" said the store into the data warehouse which should have a unified standard.

 Subsystem 4: Cleaning and mass data processing system

  This process is mainly to modify and organize into dirty data ETL processes, improve data quality.

 Subsystem 5: Error Event Processing

  The purpose of the error event handling is recorded in the ETL process for each error. This facilitates the administrator to regularly monitor and analyze errors.

 Subsystem 6: Audit latitude

  Audit dimension table is a special type of dimension tables, data warehouse fact tables are linked tables and audit latitude. It contains fact table metadata changes, such as loading data, such as date and time.

 Subsystem 7: repeat recording system troubleshooting

  In most cases, duplication refers to remove duplicate data, the data from different systems or conflicting unity.

 8 subsystems: data consistency

  The purpose of this step is to the fact that data derived from a plurality of business systems follow the same latitude. For example, A company has a customer service system that has its own customer database, in order to customer service management system and marketing system on the same data warehouse, we need to sell customer data and customer data management system, customer service systems into a unified customer dimension table, when loading data from the fact that the two systems are, the facts need data from the two systems point to the same customer dimension table. The most common way to solve this problem is the dimension table reservation system brought from different natural key. When loading data, these keys can be natural to find the source system in the dimension table.

Third, the data release

 Partition 9: slow change process latitude

  When the business system data has changed, the process of slowly changing dimensions, it is necessary to change the latitude data warehouse according to different rules. There are three general slowly changing dimension.

  A slow change type: History is not recorded, the new data overwrites the old data

  Slow changes in two types: a plurality of stored records, is directly added a new record, while retaining the original recording, and with a separate dedicated fields stored difference

  Slowly changing type III: Adding history columns, save traces of changes in different fields it can only save recorded two changes apply to change the dimensions of no more than twice.

 Subsystem 10: proxy key generation system

  Key is used to identify a row proxy dimension tables, when loading the dimension tables and fact tables need to query the surrogate key. Surrogate key generation in general are: 1, now using the proxy key table maximum + 1; 2, the sequence database; 3, using a self-energizing field.

 Subsystem 11: Dimension of building

  In the data warehouse should also consider how to build and maintain a data warehouse level. Level allows users to analyze data at different viewing dimension level. At the simplest level concept is the time dimension level, such as the level of such "years - Diurnal and seasonal - - months."

 Subsystem 12: Special dimension generation system

  In addition to slowly changing dimensions, data warehouse based on a number of models, at least it contains a special dimension: time dimension. Of course, there are other special dimensions not listed here.

 Subsystem 13: fact table load

  Prior to loading the data warehouse fact tables, data needed to be ready. There are three main types of fact table:

  1, the fact the fact table size: at each transaction or event as a unit, such as a sales record;

  2, periodic snapshot fact sheet: Facts table does not save all the data, save only data at fixed time intervals, such as monthly spending records;

  3, cumulative snapshot fact tables: When there is new data, updating the fact table record.

 Subsystem 14: surrogate key pipeline

  This subsystem is responsible for taking the right surrogate keys, used to load the fact table.

 Subsystem 15: multi-value generating system bridging dimension

  When there are multiple dimensions associated items table and the fact table or other dimensions, but also to use bridging table. Such as movie tickets and actors. If you want a summary of how many actors have a movie ticket income, need to build a bridge table between the movie and the movie actor dimension, bridge table can also set the right movie actor weighting factor.

 Subsystem 16: late data processing

  Data and fact table dimension table data are likely to arrive late. For the fact table is not a big problem, find the class dimension when the business under a valid time dimension surrogate keys. If the dimension table data arrive late, the situation will be a little trouble, if the data loading fact table dimension table over but the data is not current. When the dimension data to be updated over, a record increase in the dimension table, this time to use the newly created dimension surrogate key to update the fact table has a surrogate key data. (To be honest this section is not able to understand how ......)

 Subsystem 17: Dimension Management System

  Central control system, is used to prepare the correct amount released dimension to the data warehouse.

 Subsystem 18: Fact table management system

  Responsible for any create, organize, and manage tasks related to the fact table.

 Subsystem 19: Construction of gathering

  If the database is used for the analysis, there will be required performance. This creates several solutions to the requirements of speed in these solutions, the aggregate table to enhance the performance of the largest.

 Subsystem 20: OLAP Cube Build System

  OLAP database storing a special structure, data may be aggregated when loaded in advance. Some OLAP database can only be written can not be updated, so before doing the update should clear the source data.

 Subsystem 21: Data Integration Management System

  It used to obtain the data from the data warehouse, and transmits the data to other environments, typically off-line data analysis or for other special purposes, such as sending the report to the user.

Fourth, management

 Subsystem 22: Scheduling

 Subsystem 23: Backup System

 Subsystem 24: recovery and restart the system

 Subsystem 25: Version Control System

 Subsystem 26: from development to test and production environments system version.

 Subsystem 27: Workflow Monitoring

 Subsystem 28: Sorting System

 Subsystem 29: descent and dependency analysis 

 Subsystem 30: Problem Reporting System

   Subsystem 31: parallel / pipeline system

   Subsystem 32: Security system

   Subsystem 33: Compliance Reporting System

   Subsystem 34: metadata repository management system

  

Guess you like

Origin www.cnblogs.com/lyuzt/p/11401349.html
ETL