Comparison of big data ETL tools (Sqoop, DataX, Kettle)

foreword

During the internship, I encountered a database migration project. For data warehouse and big data integration applications, ETL tools are usually used to assist in the completion. Companies and customers use Sqoop , DataX and Kettle three tools. Simply sort out these three ETL tools.
The ETL tool needs to complete the process of extracting (exat), interactively transforming (transform), and loading (load) the source data to the target end.

1. Sqoop

1.1 Introduction

Sqoop, SQL to Hadoop, can realize data conversion between SQL and Hadoop .
Apache is an open source tool for transferring data between Hadoop and relational database servers. It can import a database in a relational database (MySQL, Oracle, etc.) to HDFS in Hadoop, and can also export HDFS data to relational in the database.
The bottom layer of the Sqoop command is to convert it into a MapReduce program. Sqoop is divided into import and export , strategies are divided into table and query , and modes are divided into incremental and full .
insert image description here

Sqoop supports full data import and incremental data import . Incremental data import can be divided into two types. One is incremental data import based on incremental columns (Append), but incremental data import based on time columns (LastModified). , optionally specifying whether the data is to be imported concurrently.

1.2 Features

  1. The data in the relational database can be imported into components such as hdfs , hive or hbase , and the data in the hadoop component can also be imported into the relational database.
  2. Sqoop adopts the map-reduce computing framework, generates a map-reduce job according to the input conditions, and runs it in the hadoop cluster . Import or export operations can be performed on multiple nodes, which is more efficient than running multiple parallel imports and exports on a single node , and has good concurrency and fault tolerance.

2. DataX

2.1 Introduction

DataX is an offline synchronization tool for heterogeneous data sources open sourced by Ali , dedicated to realizing stable and efficient data between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc. Synchronization function.
In order to solve the synchronization problem of heterogeneous data sources, DataX has changed the complex mesh synchronization link into a star data link. DataX is responsible for connecting various data sources as an intermediate transmission carrier, so as to realize the integration of new data sources and existing data. Seamless data synchronization between sources.
insert image description here
DataX itself, as an offline data synchronization framework, is built with Framework + plugin architecture. The data source reading and writing are abstracted into Reader and Writer plug-ins, which are incorporated into the entire synchronization framework.

  • Reader: Data collection module, responsible for collecting data from data sources and sending the data to Framework.
  • Writer: The data writing module is responsible for continuously fetching data from the Framework and writing the data to the destination.
    insert image description here

2.2 Features

  1. Data exchange between heterogeneous databases and file systems
  2. Built with Framework+plugin architecture, Framework handles most of the technical issues of high-speed data exchange such as buffering, flow control, concurrency, and context loading, and provides a simple interface to interact with plug-ins. Plug-ins only need to access the data processing system.
  3. The data transmission process is completed in a single process and operates in full memory.
  4. Strong scalability, developers can develop a new plug-in to support the new database file system.

3. Kettle

3.1 Introduction

A foreign open source free, visual, and powerful ETL tool, written in pure Java, can run on mainstream systems, data extraction is efficient and stable, and supports various data sources, such as relational databases, NoSQL, and files.
Kettle has now been renamed PDI, Pentaho Data Integration-Pentaho Data Integration.
The execution of kettle is divided into two levels:

  • Transformatiobn : complete the basic transformation of the data.
  • Job : Complete the control of the entire workflow.

insert image description here

Simply understand, a transformation (Trans) is an ETL process, and a job (Job) is a collection of multiple transformations. In a job, transformations or jobs can be scheduled and scheduled tasks.

core components

  • Spoon is a visual EPL design tool. Users can use the visual interface in Spoon to create connections between sources, targets, and transformations, as well as define transformations and logic for data integration.
  • Pan: A command tool to run transformations.
  • Kitchen: A command tool for running jobs.
  • Carte: A lightweight web container for establishing a dedicated, remote ETL Server.

3.2 Features

  1. Free and open source, cross-platform (because it is written in pure java)
  2. Graphical interface design, no need to write code
  3. There are two kinds of script files, trans is responsible for data conversion, and job is responsible for scheduling and controlling the entire workflow.
  4. Supports job scheduling and monitoring, and can automate data integration tasks.

4. Tool comparison

DataX and Sqoop

Function DataX Sqoop
operating mode single process multithread MR
distributed not support support
Flow Control With flow control function No
Statistics There are some statistics, and the report needs to be customized No
Data validation In the core part there are No, distributed data collection is inconvenient
monitor needs customization needs customization
Function DataX Kettle
data source Few relational databases and big data non-relational databases Most relational databases
underlying architecture Support stand-alone deployment and cluster deployment The master-slave structure is not highly available, has poor scalability, and low fault tolerance of the architecture, so it is not suitable for big data scenarios
CDC machine offline batch processing Based on timestamps, triggers, etc.
Impact on the database Collect data through sql select, which is not intrusive to the data source There are requirements on the database table structure, and there is a certain intrusiveness
data cleaning Cleaning scripts need to be written and called according to their own clear rules (functions provided by DataX3.0). Modeling calculations are carried out around the data requirements of the data warehouse. The cleaning function is relatively complicated and requires manual programming
Extraction speed datax has less pressure on the database In the case of a small amount of data, there is not much difference. When the amount of data is large, datax is faster than kettle.
Community activity Open source software, highly active community Ali open source, low community activity

Summarize

  1. Both DataX and Kettle are general-purpose data integration tools that support multiple data sources and targets, and provide powerful data conversion and cleaning functions.
  2. The difference between DataX and Kettle lies in the developer and user groups. DataX is widely used within Alibaba, while Kettle is an independent open source project.
  3. Sqoop is mainly used for data transmission between Hadoop and relational databases, and is suitable for import and export tasks of large-scale data .

Guess you like

Origin blog.csdn.net/YuannaY/article/details/131427195