On the data warehouse ETL

First, the basic concept

        ETL, it's Extract, Transform, Load three word initials. Data warehouse ETL is the most important process, but also the largest part of the workload, usually account for half of the workload of the entire data warehouse built.

  1. Extraction: obtaining operational data from a data source;
  2. Conversion: Conversion data to make it into a suitable form to query and analyze and structure;
  3. Load: Import data converted to the final target data warehouse;

        Establishing a data warehouse, data is put from a plurality of source systems integrated heterogeneous, then placed in a centralized location, for data analysis.

Two, E: extract

        A typical source systems are generally transactional applications, one source may be a systematic analysis of sales data warehouse, it may be an order entry system, which includes all records related to the operating behavior of the order. Many of these records are complex, determine the need to extract data (target data) is very difficult. Data are not usually drawn only once, but need a certain time interval repeatedly drawn, in such a way to provide all changed data to the data warehouse, to maintain the timeliness of the data. Assumes that you have a clear goal extracted data, you can consider using the extraction method which.

        Selection extraction method is highly dependent on the business needs of the source and destination bin number environment. Generally follow two principles: First, do not add extra logic in the source system; Second, can not increase the workload of the source system. That is not invasive to the source system. The following describes two methods of extraction: extracting logical and physical extraction.

1, the logic extraction

        Extracting logic is divided into: the total amount of extraction and incremental extraction.

1.1 full amount drawn

        The full amount of all the upcoming data source system are drawn up. There is a way to extract this advantage is no need to track data changes since the last successful extraction, data does not need additional logic information (such as time stamp) to the source system. Typically, the first extraction will use the entire amount extracted.

Incremental Extraction 1.2

        After the increment, only to extract a particular point in time an event occurs of data, data that is a point in time change. Often the amount of data source system is very large, such as the behavior of some information c end, then using the whole amount drawn, will make very slow extraction efficiency, it is a good incremental extraction means. Using incremental extraction, you must be able to identify all of the data changes after a specific time point, because the data provided by the source system, it may be used in the extraction logically extracted time stamps, as the identification. Incremental extraction technique, often called "change data acquisition", referred to as "the CDC." There are four common methods: a timestamp, snapshots, triggers and logs.

  • Timestamp: the source system needs to have a corresponding time data sequence identifier;
  • Snapshot: You can use OLTP database system comes with the realization mechanism can also be custom implementation;
  • Trigger: A trigger is a characteristic of a relational database;
  • Log: log may use an application or the system log, this manner does not invasive source system, but it requires additional log parsing;

        In many data warehouse, the extraction process does not contain any changes in data capture techniques, increments extraction process is as follows: the extraction system in the source table to the data warehouse entire transition zone, and this data table and the last source system do comparison data extracted from the table to obtain the data changes. Of course, this approach would increase the number of warehouse processing burden, especially when particularly large amount of data.

2, physical extraction

        Extraction method relies on the logic of selection and operation can be made to the source system and constraints and there may be two physical extraction mechanisms: online and offline extracting extraction.

2.1 Extraction Online

        Extracting data directly from the source system.

2.2 Extraction offline

        Data is not extracted directly from the source system, but the transition region extracted from a source outside the system. Transition zone may already exist (such as database backup files, redo logs, or archive logs), may also be drawn to establish their own procedures.

Three, T: Conversion

        After the data acquisition system operating type source requiring multiple conversion operation, such as unified data type, processing spelling errors, eliminating ambiguous data, into the standard format. An important function of the data conversion of data cleansing is aimed at only the "compliance" of data to enter the target data warehouse. ETL conversion operation is the most complex and tedious links, ETL50% occupy the entire time, because of space limitations, not described in detail herein.

Four, L: load

        ETL的最后步骤是把转换后的数据装载进目标数据仓库,需要关注的两个问题:

1、数据的装载效率

        要提高装载的效率,可以从下面几个方面入手:

  • 保证足够的系统资源;
  • 海量数据情况下,服务器具备高性能,资源独占,即不与其他系统共用;
  • 禁用数据库约束(唯一性、非空性,检查约束等),装载结束,再启动这些约束;
  • 不使用外键约束;

2、一旦装载中途失败,如何再次重复执行装载过程

        需要再次执行装载过程,一般有两种情况。

        一种情况是,数据装载过程中,可能由于各种原因而失败,比如源表与目标表的结构不一致,而这时已经有部分表装载成功。那么,在大数据情况下,如何只装载失败的部分数据,是一个不小的挑战。这种情况下,解决方案是记录失败点,并在装载程序中处理相关逻辑。

        另一种情况是,装载成功后,某些数据滞后了,会带来数据的更新或新增,对于这种情况,是先删除再插入,或者使用replace into、merge into等类似功能的操作。

(附上一张简单的关系图)

Guess you like

Origin www.cnblogs.com/SysoCjs/p/11345156.html
ETL