Data migration tools, use these 8 types!

Recommended reading:

(1) 5 types of microservice gateways, which one should you choose?

(2) How to optimize the export of millions of excel tables?

(3) Talk about MySQL data backup killer binlog

(4) Principle and selection of message queue: Kafka, RocketMQ, RabbitMQ and ActiveMQ

foreword

Recently, some friends asked me ETLwhich data migration tools should be used.

ETL (an acronym for Extract-Transform-Load, that is, the process of data extraction, transformation, and loading), for enterprise applications, we often encounter scenarios of various data processing, transformation, and migration.

Today I specially summarize some of the more commonly used ETLdata migration tools on the market, and I hope it will be helpful to you.

1.Kettle

KettleIt is a foreign open source ETLtool, purely Javawritten, green without installation, efficient and stable data extraction (data migration tool).

There are two kinds of script files in Kettle, transformation and job. Transformation completes the basic conversion of data, and job completes the control of the entire workflow.

The Chinese name of Kettle is Kettle. MATT, the main programmer of the project, hopes to put various data into a kettle and then output it in a specified format.2fce91d55abe5a15538af95beab1ec55.png

Kettle is an ETL toolset that allows you to manage data from different databases by providing a graphical user environment to describe what you want to do, not how you want to do it. a45845d19dbebad3ae64ac337ed82e10.pngThe Kettle family currently includes 4 products: Spoon, Pan, CHEF, Kitchen.

  • SPOON: Allows you to design the ETL conversion process (Transformation) through a graphical interface.

  • PAN: Allows you to run ETL transformations designed by Spoon in batches (eg using a time scheduler). Pan is a program that runs in the background and has no graphical interface.

  • CHEF: Allows you to create tasks (Job). Tasks are more conducive to automating the complex work of updating data warehouses by allowing each transformation, task, script, etc. Task pass allows for every transition, task, script and more. The task will be checked to see if it ran correctly.

  • KITCHEN: Allows you to batch tasks designed by Chef (for example using a time scheduler). KITCHEN is also a program running in the background.

2.Datax

DataXIt is an open source version of Alibaba Cloud DataWorks data integration, and it is an offline data synchronization tool/platform widely used in Alibaba Group.

DataX is an offline synchronization tool for heterogeneous data sources, dedicated to achieving stable and efficient data synchronization between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.

0fbc22330e9ae18e5d8ba054a3e67fbe.png

Design concept: In order to solve the synchronization problem of heterogeneous data sources, DataX turns the complex mesh synchronization link into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect this data source to DataX to achieve seamless data synchronization with the existing data source.

Current usage status: DataX is widely used in Alibaba Group, and it undertakes the offline synchronization business of all big data, and has been running stably for 6 years. At present, 80,000 simultaneous multi-channel operations are completed every day, and the daily data transmission volume exceeds 300TB.

DataX itself, as an offline data synchronization framework, is built with Framework + plugin architecture. Abstract data source reading and writing into a Reader/Writer plug-in, which is incorporated into the entire synchronization framework.

6b9f58ce965ee794d1c4e9301736a549.png

DataX 3.0 open source version supports single-machine multi-thread mode to complete synchronous job operation. This section uses a sequence diagram of DataX job life cycle to briefly explain the relationship between DataX modules from the overall architecture design. 9404ea8b22ed6d075977c16e51ebbce9.pngSix core advantages of DataX 3.0:

  • Reliable Data Quality Monitoring

  • Rich data conversion functions

  • precise speed control

  • Strong synchronization performance

  • Robust Fault Tolerance Mechanism

  • Minimalist user experience

3.DataPipeline

DataPipelineAdopt log-based incremental data acquisition technology (Log-based Change Data Capture), support rich, automatic and accurate semantic mapping construction between heterogeneous data, and meet real-time and batch data processing at the same time.

It can realize accurate incremental data acquisition of Oracle, IBM DB2, MySQL, MS SQL Server, PostgreSQL, GoldenDB, TDSQL, OceanBase and other databases.

The platform has six characteristics of "complete data, fast transmission, strong collaboration, more agile, extremely stable, and easy maintenance".

On the basis of supporting traditional relational databases, it also provides extensive support for big data platforms, domestic databases, cloud-native databases, APIs, and object storage, and continues to expand.

DataPipeline data fusion products are committed to providing users with enterprise-level data fusion solutions, providing users with a unified platform to manage real-time synchronization and batch data processing tasks of heterogeneous data nodes, and will also provide support for real-time stream computing in the future.

The distributed cluster deployment method can be expanded linearly and vertically to ensure stable and efficient data flow, allowing customers to focus on the release of data value.

3178f0bfd5ec57aeb9474b049aca8a6f.pngFeatures:

  • 全面的数据节点支持: Support various data node types such as relational database, NoSQL database, domestic database, data warehouse, big data platform, cloud storage, API, etc., and can customize data nodes.

  • 高性能实时处理: Provide TB-level throughput and second-level low-latency incremental data processing capabilities for different data node types, accelerating data flow in various scenarios of enterprises.

  • 分层管理降本增效: By adopting the hierarchical management mode of "data node registration, data link configuration, data task construction, and system resource allocation", the construction cycle of the enterprise-level platform is reduced from three to six months to one week.

  • 无代码敏捷管理: Provide more than ten advanced configurations in two categories: restriction configuration and policy configuration, including flexible data object mapping relationship, and the R&D delivery time of data fusion tasks is reduced from 2 weeks to 5 minutes.

  • 极稳定高可靠: Adopt distributed architecture, all components support high availability, provide rich fault-tolerant strategies, deal with unexpected situations such as upstream and downstream structural changes, data errors, network failures, etc., and can ensure system business continuity requirements.

  • 全链路数据可观测: Equipped with a container, application, thread, and business four-level monitoring system, the panoramic cockpit guards the stable operation of tasks. Automated operation and maintenance system, flexible expansion and contraction, reasonable management and allocation of system resources.

4.Talend

Talend (Talend) is the first ETL (data extraction Extract, transmission Transform, loading Load) open source software supplier targeting at the data integration tool market.

9e6c3fab81b573faae0c4f95c205dfa2.pngTalend provides a new vision for ETL services with its dual technology and business model. It breaks away from traditional proprietary closed services and provides an open, innovative, powerful and flexible software solution for companies of all sizes.

5.DataStage

DataStage, IBM WebSphere DataStage, is a set of integrated tools that simplify and automate the process of data extraction, transformation and maintenance from multiple operational data sources, and input them into the data mart or data warehouse target database. In different business systems, data is extracted from data sources on multiple platforms, converted and cleaned, and loaded into various systems.

Each step can be completed in a graphical tool, and can also be flexibly scheduled by an external system. Special design tools are provided to design conversion rules and cleaning rules, etc., and various complex and practical tasks such as incremental extraction and task scheduling are realized. Function. Simple data conversion can be realized by dragging and dropping operations on the interface and calling some DataStage predefined conversion functions, complex conversion can be realized by writing scripts or combining extensions of other languages, and DataStage provides a debugging environment, which can greatly improve development and Debug the efficiency of extracting and converting programs.

Datastage interfacef31df15b2f0463f92c954e43a7fc30ea.png

  • Support for metadata: Datastage manages Metadata by itself and does not depend on any database.

  • Parameter control: Datastage can set parameters for each job, and this parameter name can be referenced inside the job.

  • Data quality: Datastage is equipped with ProfileStage and QualityStage to ensure data quality.

  • Custom development: Provides customization of extraction and conversion plug-ins. Datastage embeds a BASIC-like language, and a batch program can be written to increase flexibility.

  • Modification and maintenance: provide a graphical interface. The advantage of this is that it is intuitive and fool-like; the disadvantage is that it is more troublesome to make changes (especially batch changes).

Datastage consists of four major components:

  • Administrator: Create or delete a project, and set the public properties of the project, such as permissions.

  • Designer: connect to the specified project for job design;

  • Director: Responsible for the operation and monitoring of the Job. For example, set the scheduling time of the designed Job.

  • Manager: Perform job management such as job backup.

6.Scoop

Sqoop is a data synchronization tool created by Cloudera, which is now completely open source.

Sqoop is currently the first choice for data migration in the hadoop ecosystem. Sqoop is a tool for transferring data between Hadoop and relational databases. It can import data from a relational database (such as MySQL, Oracle, Postgres, etc.) To Hadoop's HDFS, HDFS data can also be imported into a relational database.

4a58a63db561db8a8d29a56995c1195e.pngHe synchronizes our traditional relational database | file database | enterprise data warehouse to our hadoop ecological cluster.

At the same time, the data in the hadoop ecological cluster can also be imported back to the traditional relational database | file database | enterprise data warehouse.

So how does Sqoop extract data?e41385b455124f28d42b7b1d54c9537c.png

  1. First, Sqoop goes to rdbms to extract metadata.

  2. When the metadata is obtained, the task is divided into multiple tasks and distributed to multiple maps.

  3. Then each map will output its own task to the file after completion.

7.FineDataLink

FineDataLink is a relatively good ETL tool in China. FineDataLink is a one-stop data processing platform with efficient data synchronization function, which can realize the ability of real-time data transmission, data scheduling, data governance and other complex combination scenarios, and provide data Convergence, research and development, governance and other functions.

FDL has the advantage of low code, and can realize the whole process of ETL through simple drag and drop interaction. 63798268b51f7b2f70ff7eaeea59e39e.pngFineDataLink - China's leading low-code/high-efficiency data integration product, can provide one-stop data services for enterprises, integrate multiple data through fast connection and high-efficiency, and provide a low-code Data API agile release platform to help enterprises solve problems The problem of isolated data islands can effectively enhance the value of enterprise data.

8.canal

canal [kə'næl], translated as waterway/pipe/ditch, the main purpose is to provide incremental data subscription and consumption based on MySQL database incremental log parsing. d3c8713ba33fc599be4d9dfcc4bdc30f.pngIn the early days, due to the deployment of dual computer rooms in Hangzhou and the United States, there was a business requirement for cross-computer room synchronization. The implementation method was mainly based on business triggers to obtain incremental changes. Since 2010, the business has gradually tried to parse database logs to obtain incremental changes for synchronization, and a large number of incremental database subscription and consumption businesses have been derived from this.

Services based on log incremental subscription and consumption include:

  • database mirroring

  • Database real-time backup

  • Index construction and real-time maintenance (split heterogeneous index, inverted index, etc.)

  • Business cache refresh

  • Incremental data processing with business logic

The current canal supports source MySQL versions including 5.1.x, 5.5.x, 5.6.x, 5.7.x, 8.0.x.

a7dfb26aef0ac1160141278f75a3776c.png
  • The MySQL master writes data changes to the binary log (binary log, where the records are called binary log events, which can be viewed through show binlog events).

  • MySQL slave copies the master's binary log events to its relay log (relay log).

  • MySQL slave replays events in the relay log, reflecting data changes to its own data.

How canal works:

  • canal simulates the interactive protocol of MySQL slave, pretends to be MySQL slave, and sends dump protocol to MySQL master

  • MySQL master receives dump request and starts pushing binary log to slave (ie canal)

  • canal parse binary log object (originally byte stream)

Guess you like

Origin blog.csdn.net/weixin_44045828/article/details/130191343