【Big Data】What is Data Integration? (Introduction to SeaTunnel Integration Tool)

1. What is data integration?

Data integration refers to the integration of data from different data sources to form a unified data set . This process includes collecting data from different data sources, cleaning, transforming, refactoring and integrating the data so that it can be stored and managed in a unified data warehouse or data lake.

  • Data integration can help businesses better understand and leverage their data and facilitate data-driven decision-making and business process optimization. In the process of data integration, issues such as data quality, data security, data format, and data structure need to be considered, and appropriate techniques and tools are used to solve these issues, such as (extraction, transformation, loading) tools, ETLdata mapping tools , data cleaning tools, data modeling tools, etc.

  • The tools used in general data integration mainly include: Sqoop, DataX, or as explained in this chapter SeaTunnel, these three tools are all data conversion integration tools, just use one of them, in fact, it can also be considered Sqoopas first generation DataXor the second generation, SeaTunnelIt is a third-generation tool. SqoopIt is not used much anymore, Dataxbut it should be used more. SeaTunnelIt is a Apachetop-level project and the latest generation of data integration tools. Those who are interested follow my article to learn about SeaTunnelthe tool. If you want to know Sqoopand DataXfriends, you can refer to my following articles:

  • Big Data Hadoop - Data Synchronization Tool Sqoop

  • Big Data Hadoop - Data Synchronization Tool DataX

insert image description here

Second, what is ETL?

The previous article actually mentioned it ETL, and here is just a review of the following, ETLin which Eis extract, data extraction; Tyes Transform, represents data conversion; L represents Load, data loading.

insert image description here

3. Introduction of SeaTunnel

1 Overview

Apache SeaTunnelIt is a distributed, high-performance, easy-to-expand data integration platform for massive data (offline & real-time) synchronization and transformation. It can synchronize tens of billions of data stably and efficiently every day, and has been used in production by nearly 100 companies.

SeaTunnel workflow diagram:
insert image description here

2) The role of SeaTunnel

SeaTunnel focuses on data integration and data synchronization, mainly aimed at solving common problems in the field of data integration :

  • Various data sources : There are hundreds of commonly used data sources with incompatible versions. With the advent of new technologies, more data sources are emerging. It is difficult for users to find tools that fully and quickly support these data sources.

  • Complex synchronization scenarios : Data synchronization needs to support various synchronization scenarios such as offline-full synchronization, offline-incremental synchronization, CDC, real-time synchronization, and full database synchronization.

  • High resource requirements : Existing data integration and data synchronization tools often require a large number of computing resources or JDBC connection resources to complete real-time synchronization of massive small tables. This has increased the burden on enterprises to a certain extent.

  • Lack of quality and monitoring : The data integration and synchronization process often encounters data loss or duplication. The synchronization process lacks monitoring, and it is impossible to intuitively understand the real situation of the data during the task process.

  • Complex technology stack : Enterprises use different technology components, and users need to develop corresponding synchronization programs for different components to complete data integration.

  • Difficulty in management and maintenance : limited by different underlying technical components (Flink/Spark), offline synchronization and real-time synchronization are often developed and managed separately, which increases the difficulty of management and maintenance.

3) Features of SeaTunnel

  • Rich and extensible connectors : SeaTunnel provides a connector API that does not depend on a specific execution engine. Connectors (sources, transformations, sinks) developed based on this API can run on many different engines, such as the currently supported SeaTunnel engines, Flink, Spark.

  • Connector Plugins : The plugin design allows users to easily develop their own connectors and integrate them into SeaTunnel projects. Currently, SeaTunnel already supports 100multiple connectors , and the number is still proliferating. There is a list of currently supported connectors

  • Batch streaming integration : The connector developed based on the SeaTunnel connector API is perfectly compatible with scenarios such as offline synchronization, real-time synchronization, full synchronization, and incremental synchronization. It greatly reduces the difficulty of managing data integration tasks.
    Support distributed snapshot algorithm to ensure data consistency.

  • Multi-engine support : SeaTunnel uses SeaTunnelthe engine . At the same time, SeaTunnel also supports the use of Flink or Spark as the execution engine of the connector to adapt to the existing technical components of the enterprise. SeaTunnel supports multiple versions of Spark and Flink.

  • JDBC multiplexing, multi-table analysis of database logs : SeaTunnel supports multi-table or full database synchronization, which solves the problem of too many JDBC connections; supports multi-table or full-database log read and write analysis, and solves repeated reading in CDC multi-table synchronization scenarios Get the problem parsing the log.

  • High throughput and low latency : SeaTunnel supports parallel reading and writing , provides stable and reliable data synchronization capability, high throughput and low latency.

  • Perfect real-time monitoring : SeaTunnel supports detailed monitoring information of each step in the data synchronization process, allowing users to easily understand the amount of data read and written by the synchronization task, data size, QPS and other information.

4) Seatunnel advantages and disadvantages

Advantage

  • Easy to use, flexible configuration, no development required
  • Modular and Pluggable
  • Supports data processing and aggregation using SQL
  • Due to its highly encapsulated computing engine architecture, it can be well integrated with the middle platform and provide distributed computing capabilities to the outside world

shortcoming

  • Spark supports 2.2.0 - 2.4.8, does not support spark3.x
  • Flink supports 1.9.0. Currently, flink has been iterated to 1.14.x and cannot be upwardly compatible
  • Although the Spark job can be configured quickly, the relevant personnel still need to understand the tuning of some parameters to make the job more efficient

5) Core idea

The core of SeaTunnel design is to use the " control inversion " or " dependency injection " in the design pattern , which is mainly summarized as the following two points:

  • The upper layer does not depend on the bottom layer, both rely on abstraction;

  • Process code and business logic should be separated. The entire data processing process can be roughly divided into the following processes: input -> conversion -> output. For more complex data processing, it is essentially a combination of these behaviors:

insert image description here

4. Architecture Evolution

Looking at the evolution process of the SeaTunnel architecture, one of the things we are currently doing is to transform and upgrade the architecture from v1 to v2.
insert image description here

For the V1 version, SeaTunnel is essentially an ETL platform. The V2 version is developing towards the ELT route. Based on the discussion of the entire architecture and design philosophy, we can see it at https://github.com/apache/incubator-seatunnel/issues/1608. If you are interested, you can learn about the past and present of the SeaTunnel architecture evolution.

V1 architecture

insert image description here

  • In the V1 architecture, SeaTunnel's connectors and heterogeneous data are strongly dependent on distributed computing engines. For different computing engines, there will be a different API layer. Connectors also rely on Spark and Flink, which have been developed Connectors are essentially Spark connectors and Flink connectors.

  • After accessing the data, convert the incoming data and then write it out. Although the code development amount of this design philosophy is very small, and many details do not need to be considered, because the open source Spark and Flink's connecotor have solved most of the problems for us, but in fact this is also a drawback. First, due to the strong dependence on the computing engine, we cannot achieve decoupling, and whenever the computing engine is upgraded to a major version, a large number of underlying transformations are required, which is relatively difficult.

V2 architecture
insert image description here

Based on these pain points, we refactored the V2 version. First of all, the V2 version has its own set of APIs and its own set of data types, so you can develop your own connector without relying on any engine. Every piece of data accessed is SeaTunnelRow, which is passed through the translation layer. , push SeaTunnelRow to the corresponding calculation engine.

Finally, make a summary and compare the upgrades of the V1 and V2 architectures, and what we have done.
insert image description here

5. Relevant competing products and comparison

SeaTunnel Engine performance testing
insert image description here
The comparison tools include DataX, which is familiar to everyone, and Chunjun of Kangaroo Cloud, which may be unfamiliar to everyone. In fact, it was called FlinkX before it was renamed, and StreamPark (formerly known as StreamX), which has just entered the Apache incubator recently.
insert image description here

6. SeaTunnel deployment and simple use

1) Install JDK

Download address (you can also go to the official website to download):

Link: https://pan.baidu.com/s/1gOFkezOH-OfDcLbUmq6Dhw?pwd=szys
Extraction code:szys

# jdk包在我下面提供的资源包里,当然你也可以去官网下载。
tar -xf jdk-8u212-linux-x64.tar.gz

# /etc/profile文件中追加如下内容:
echo "export JAVA_HOME=`pwd`/jdk1.8.0_212" >> /etc/profile
echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> /etc/profile
echo "export CLASSPATH=.:\$JAVA_HOME/lib/dt.jar:\$JAVA_HOME/lib/tools.jar" >> /etc/profile

# 加载生效
source /etc/profile

2) Download

export version="2.3.1"
wget "https://archive.apache.org/dist/incubator/seatunnel/${version}/apache-seatunnel-incubating-${version}-bin.tar.gz"
tar -xzvf "apache-seatunnel-incubating-${version}-bin.tar.gz"

3) Install the connector plug-in

Starting from 2.2.0-beta, the binary package does not provide connector dependencies by default, so when using it for the first time, we need to execute the following command to install the connector: (Of course, you can also manually download the connector [Apache Maven Repository] (download https://repo.maven.apache.org/maven2/org/apache/seatunnel/, Then manually move to the Seatunnel subdirectory under the connector directory).

# config/plugin_config ,可以修改这个配置指定下载连接器,会下载到这个目录下connectors/seatunnel/
cd apache-seatunnel-incubating-${version}
sh bin/install-plugin.sh 2.3.1

4) Quick start

config/v2.batch.conf.template

env {
    
    
  execution.parallelism = 1
  job.mode = "BATCH"
}

source {
    
    
    FakeSource {
    
    
      result_table_name = "fake"
      row.num = 16
      schema = {
    
    
        fields {
    
    
          name = "string"
          age = "int"
        }
      }
    }
}

sink {
    
    
  Console {
    
    }
}

Start the application:

cd "apache-seatunnel-incubating-${version}"
# 连接器:connectors/seatunnel/connector-fake-2.3.1.jar
./bin/seatunnel.sh --config ./config/v2.streaming.conf.template -e local

insert image description here

5) Quickly start using Flink

Edited config/v2.streaming.conf.templateto determine the method and logic of data input, processing and output after the sea tunnel is activated. Below is an example of a configuration file, same as the sample application mentioned above.

env {
    
    
  execution.parallelism = 1
  job.mode = "BATCH"
}

source {
    
    
    FakeSource {
    
    
      result_table_name = "fake"
      row.num = 16
      schema = {
    
    
        fields {
    
    
          name = "string"
          age = "int"
        }
      }
    }
}

sink {
    
    
  Console {
    
    }
}

Start the application (Flink versions between 1.15.x and 1.16.x):

cd "apache-seatunnel-incubating-${version}"
./bin/start-seatunnel-flink-15-connector-v2.sh --config ./config/v2.streaming.conf.template

Here is just a simple example in the official document. If you are interested, you can experiment with other data conversion scenarios. In fact, the idea of ​​conversion is the same as that of the previous software. If you have any questions, please leave me a message. I will update related technical articles in the future. Please wait patiently, and you can follow my official account [Big Data and Cloud Native Technology Sharing] 】Add group communication or private message communication~

Guess you like

Origin blog.csdn.net/qq_35745940/article/details/129899167