CDC + ETL data integration solutions

Welcomed the consultation, cooperation! weix: wonter

Glossary:

       CDC also known as change data capture (Change Data Capture), will open the insert data into the log table when the source table in the cdc insert INSERT, UPDATE, and DELETE DELETE update activities. CDC process by capturing the change data capture change tables, query functions provided by cdc, we can capture this part of the data.

       Data Warehouse ETL (Extract-Transform-Load), which is to load data from a source system to a data warehouse. Is used to describe the data extracted from the source terminal via (extract), transpose (transform), load (load) to the destination process. Use the tools included (kettle, flume, sqoop).

       Kettle JAVA based ETL tools, graphical support GUI design interface, then flow in the form of workflow, doing some simple or complex data extraction, quality control, data cleansing, aspect data conversion, data filtering, etc. have a relatively stable Performance.

       Flume Cloudera is provided to a highly available, highly reliable, distributed massive log collection, aggregation and transmission systems, to support various types of customized data Flume sender log system for collecting data; simultaneously, providing Flume simple data processing, and the ability to write to a variety of data recipients (customizable) is.

       Sqoop Apache open-source software is mainly used between HADOOP (Hive) and the traditional database (mysql, postgresql ...) for delivery of data.
It can be applied to high-volume data transmission between big data clusters to communicate directly relational database.

Comparative Data Integration

       Data integration there are two options:

       One is through the ESB data integration interface mode, the advantage is the timeliness of the data is high, but the biggest drawback is dependent on the business system interface transformation, often involve costs and vendor interfaces. Another way is extracted by way of ETL data, and to achieve real-time synchronization of data by CDC way, the advantage is not dependent on the business systems, business systems database only needs to obtain permission can achieve integration.

CDC + ETL data integration

 

The first step: extract historical data database kettle to several intermediate positions.

        GUI supports graphical design interface for the first round of the art embodiment no business process operation, stable, and efficient.

 

Step two: the service system boot image database CDC function, and real-time synchronization data to several intermediate warehouse database.

        The technique by reading the image database log files, database parsing playback operation, to achieve traffic data capture changes. Log read, parsed, playback and other operations are carried out on the mirror database, the service database only a small amount of I / O overhead to reduce the impact on the system maximum traffic.

 

The third step: extract historical data to a database from the intermediate Haoop by Sqoop.

       Yi East for using data warehouse configuration directory mapping relations, cross-database automatically generates SQL statements extracted.

 

 

 

Step four: real-time synchronization of data base table to Hadoop by CDC function.

       Underlying table is a staff information tables, data dictionary tables and other regular maintenance will change tables.

Step Five: flume record table by the incremental data to extract real-time Hadoop.

       Carrying the data record table timestamp table that becomes more content to modify the operation state is added in increments.

Step Six: The data in table

       Lake data sets provide data services, according to business activities, custom check the data marts required fields generated ElasticSearch index, and automatically generate the data interface.

Data applications

 

 

 

Recommended reading:

Hospital information integration platform (ESB) implementation, building programs

Hospital information integration platform (ESB) data integration building program

How the ETL technology floor

 

Guess you like

Origin www.cnblogs.com/Javame/p/12168001.html
ETL