Super detailed introduction and function comparison of six mainstream ETL tools

Reprint link: https://cloud.tencent.com/developer/article/1531141

Overview

ETL (abbreviation of Extract-Transform-Load, that is, the process of data extraction, transformation, and loading). For enterprise or industrial applications, we often encounter various data processing, transformation, and migration, so we understand and master one The use of etl tools is essential. Recently, kettle is used for data processing a lot, so I will also introduce this aspect. Let's compare several mainstream ETL tools.

1、DataPipeline

Data Pipeline is a technology company that provides data infrastructure services for enterprise users. The DataPipeline data quality platform integrates data quality analysis, quality verification, quality monitoring and other features to ensure the integrity, consistency and accuracy of data quality And uniqueness, completely solve the problem of data islands and data definition evolution.

 


2、Kettle

Kettle is a foreign open source ETL tool, written in pure java, can run on Windows, Linux, Unix, data extraction is efficient and stable. The Chinese name of Kettle is kettle. The main programmer of the project MATT hopes to put various data in a kettle and then stream it out in a specified format.

The Kettle family currently includes 4 products: Spoon, Pan, CHEF, and Kitchen.

SPOON allows you to design the ETL transformation process (Transformation) through a graphical interface.

PAN allows you to batch run ETL conversions designed by Spoon (for example, using a time scheduler). Pan is a program executed in the background without a graphical interface.

CHEF allows you to create a task (Job). Tasks are more conducive to automating the complex work of updating the data warehouse by allowing each conversion, task, script, etc. The task pass allows each conversion, task, script, etc. The task will be checked to see if it runs correctly.

KITCHEN allows you to use tasks designed by Chef in batches (for example, using a time scheduler). KITCHEN is also a program running in the background.

 

3、Talend

Talend, a professional open source integrated software company, provides enterprises with open source middleware solutions so that enterprises can win greater value in their applications, systems and databases. Talend series software is developed in the form of open source in the field where traditional software companies provide closed and private solutions. Talend can run between Hadoop clusters and directly generate MapReduce code for Hadoop to run, thereby reducing the difficulty and cost of deployment and speeding up analysis. And Talend also supports Hadoop2.0 which can perform concurrent transaction processing.

 


4、Informatica

Informatica is the world's leading provider of data management software. It is a leader in the following Gartner Magic Quadrants: Magic Quadrant for Data Integration Tools, Magic Quadrant for Data Quality Tools, Magic Quadrant for Metadata Management Solutions, Magic Quadrant for Master Data Management Solutions, and Magic Quadrant for Enterprise Integrated Platform as a Service (EiPaaS).

Informatica Enterprise Data Integration includes two major products, Informatica PowerCenter and Informatica PowerExchange. With its high-performance and fully scalable platform, it can solve almost all data integration projects and enterprise integration solutions.

· Informatica PowerCenter is used to access and integrate almost any business system and data in any format. It can deliver data within the enterprise at any speed and has the characteristics of high performance, high scalability, and high availability. Informatica PowerCenter includes 4 different versions, namely: Standard Edition, Real-Time Edition, Advanced Edition, Cloud Computing Edition. At the same time, it also provides a number of optional components to extend the core data integration functions of Informatica PowerCenter. These components include: data cleaning and matching, data masking, data verification, Teradata dual load, enterprise grid, metadata exchange, Pushdown Optimization, team development, unstructured data, etc.

· Informatica PowerExchange is a series of data access products, which ensure that IT organizations can access and deliver critical data anywhere and anytime as needed. With this capability, IT organizations can optimize the business value of limited resources and data. Informatica PowerExchange supports a variety of different data sources and various applications, including enterprise applications, databases and data warehouses, mainframes, medium-sized systems, messaging systems, and technical standards.

 

5、Datax

DataX is an offline data synchronization tool/platform that realizes efficient data between various heterogeneous data sources including MySQL , Oracle, SqlServer, Postgre, HDFS, Hive, ADS, HBase , TableStore (OTS), MaxCompute (ODPS), DRDS, etc. Synchronization function.

Open source address: https://github.com/alibaba/DataX

 

6、Oracle Goldengate

GoldenGate software is a log-based structured data replication software. GoldenGate can realize real-time capture, transformation and delivery of large amounts of transaction data, realize data synchronization between the source database and the target database, and maintain sub-second data delay.

The source side extracts the redo log or archive log log content through the extraction process, and sends it to the target side through the pump process (TCP/IP protocol). Finally, the rep process on the target side receives the log, parses and applies it to the target side, and completes data synchronization.

 

7. Comparison of ETL tools

Organized into a table as follows:

 

 

 

Guess you like

Origin blog.csdn.net/kevin1993best/article/details/105485666