On the ETL

ETL acquaintance

concept

That ETL Extract-Transform-Load. The aim is to disperse, messy, heterogeneous data integration together to provide data analysis for decision-making, the project is BI (Business Intellifence) project important part, account for about 1/3 of the time. The difficulty lies in transferring and data, typically stored in the final DW (Data Warehousing) in.

Commonly used methods to achieve

  1. Tools; (such as Qracle of OWB, SQL Server 2000's DTS), what tool is certainly convenient, but not flexible.
  2. SQL: Coding, flexible but complex
  3. SQL and tool combination: combine the advantages of both

Data Extraction

To find out the source of the data, the characteristic data (structured, the data amount of data)

Different sources need to take corresponding measures:

  1. The same data in the DBMS, the DBMS provide tools or statements
  2. Different DBMS, data link can be established through ODBC, or after txt, json, csv and other file exported from one database into another database. Can also complete the application program interface (Python, etc.)
  3. txt, json, csv file, or the use of certain programming tools loaded into the database

Note: For large data systems, must be considered incremental extraction (to be supplemented Method)

(1) extracting total amount

Extracting the total amount of data similar to the data migration or replication, data will the data source table or view intact extracted from the database, and converted into own format recognizable ETL tool. The total amount drawn is relatively simple.

(2) Extraction increments

Incremental drawn only collected since the last extraction to extract database table to add or modify data. In the ETL use. Incremental take out more than the total amount drawn wider application. How to capture incremental changes of data is the key extraction. Capture methods are generally two requirements: accuracy, the data traffic is possible to change the system to accurately capture a certain frequency; performance, can not cause too much pressure on the service system, affect the existing service. Current methods commonly used in data extraction incremental change data capture are:

Commonly used methods are: the time stamp, the entire table alignment (the MD5), the log contrast, triggers, etc.

Data conversion

Half the data warehouse into two parts ODS (Operational Data Store) and DW.

DW

Data storage warehouse is a subject-oriented, reflecting the historical changes in data, used to support management decisions.

ODS

Operational data store, data storage is the current situation, to provide users with the current status, providing real-time operability demand integration of all information.

It forms a transition to the ODS database as a data warehouse, data warehouse and in a different physical structure, the response time can provide high-performance, hybrid design ODS design.

ODS data is "real-time value", and the data warehouse is "historical value", the general ODS stored data does not exceed one month, and the data warehouse is 10 years or more.

From the service system to make ODS cleaning, dirty data and incomplete data filter out, from the ODS to DW during the conversion, calculation and polymerization of some business rules.

1, data cleaning

  Data cleansing task is to filter data that does not meet the requirements, the filtered results to the competent authorities, or to confirm whether filtering out again decimated by the business units after correction. Data cleansing is an iterative process, the only constant discover and solve problems. Whether filtration,

Does not meet the requirements of the main data is incomplete data, wrong data, duplication of data three categories.

2, the data conversion

  The main task of data conversion is calculated inconsistent data conversion, data granularity conversion, as well as some business rules.

  (1) inconsistent data conversion: This process is an integrated process, the same type of different business systems unified data, such as the same vendor encoding billing system is XX0001, and in CRM code is YY0001, so extraction converted into a unified coding after over.

  (2) Conversion of data granularity: business systems typically store very detailed data, and the data is used to analyze the data warehouse, data need not be very detailed. Under normal circumstances, it will be polymerized in accordance with the business system data granularity of the data warehouse.

  (3) calculate business rules : different companies have different business rules, different data indicators, which sometimes is not a simple simple calculations can be done, this time need to calculate these indicators in the data ETL after they have stored in the data warehouse for analysis.

3, data load

The loading and processing the converted data to the target database is usually the last step of the ETL process. The best way to load data depend on the type of operation performed and how much data needs to be loaded. When the purpose of the library is a relational database, the general there are two ways to load:

(1) direct SQL statements insert, update, delete operations.

(2) using bulk loading methods, such as BCP, Bulk, relational database specific bulk loading tool or api.

The first method will be used in most cases, because they were logged and recoverable. However, bulk load operation is easy to use, loading large amounts of data and high efficiency. Which data loading method depending on the needs of business systems.

ETL tool selection

How to choose the data integration ETL tool? Generally we need to consider the following aspects:

(1) the degree of support for the platform.

(2) the degree of support for data sources.

(3) extracting and loading performance is not high, and the impact on the business performance of the system is not large, poured into sexual Gaobu Gao.

Strong (4) and the data conversion processing function is not strong.

(5) whether the management and scheduling functions.

(6) whether the good integration and openness

Draws on several articles, thanks.

Guess you like

Origin www.cnblogs.com/for-ever-ly/p/10941537.html
ETL