The practice of DataX in Youzan big data platform


Hello everyone, I am Mr. Foot (o^^o)

When reading technical articles, I found that Youzan platform has adopted Datax. Thinking of Zhibei Data Center, the data aggregation adopts the secondary development of Datax-web, drawing on experience and design ideas to enrich its own solutions.

With a large platform as the foundation, it will be more thoughtful to optimize the platform of Zhibei Data.

The road is long and the road is long~~~

1. Demand

In the early days of Youzan's big data technology application, we used Sqoop as a data synchronization tool to meet the daily development needs of data synchronization between MySQL and Hive.
With the development of the company's business, there are more and more data synchronization scenarios, mainly data synchronization between MySQL, Hive and text files. Sqoop can no longer fully meet our needs. At the beginning of 2017, we could no longer bear the torture brought to us by Sqoop, and prepared to transform our data synchronization tool. At that time, there were some of the most painful needs:

  • Multiple data synchronization exceptions caused by MySQL changes. MySQL needs to support read-write separation and sub-table sub-database mode, and must be compatible with possible database migration, node downtime, and master-slave switching.

  • Many exceptions are caused by table structure changes. The table structure of MySQL or Hive may change, and it needs to be compatible with most table structure inconsistencies.

  • MySQL read and write operations do not affect online business, do not trigger MySQL operation and maintenance alarms, and do not want to be sprayed by DBA every day.

  • Hope to support more data sources, such as HBase, ES, text files.

As a data platform administrator, I also hope to collect more operation details to facilitate daily maintenance:

  • Statistical information collection, such as running time, data volume, resource consumption
  • Dirty data checksum report
  • It is hoped that the running log can be connected to the company's log platform for easy monitoring

2. Selection

Based on the above data synchronization requirements, we plan to make transformation based on open source. The objects of investigation are mainly DataX and Sqoop. The functional comparison between them is as follows:

The main disadvantage of DataX is that it runs on a single machine, and this can be avoided through the scheduling system. Other functions are better than Sqoop. Finally, we chose to develop based on DataX .

3. Pre-design

3.1 Operating form

The most important thing to use DataX is to solve the problem of distributed deployment and operation. DataX itself is a single-process client operation mode, and it is necessary to consider how to trigger the operation of DataX.
We decided to reuse the existing offline task scheduling system. Task triggering is handled by the scheduling system, and DataX is only responsible for data synchronization. In this way, system capabilities are reused and repeated development is avoided.

On the worker server of each data platform, a DataX client will be deployed, and multiple processes can be started at the same time during runtime, all of which are controlled by the scheduling system.

3.2 Actuator design

In order to interact with existing data platforms, some custom modifications are required:

  • Status reporting that conforms to platform rules, such as start/running/end, progress needs to be reported when running, and success or failure needs to be reported when it ends
  • Real-time reporting of running logs that comply with platform rules for display
  • Parameters of sub-modules such as statistics, verification, and flow control can be imported from the platform, and the results need to be persisted
  • Need to be compatible with abnormal input, such as MySQL master-slave switching, table structure change

3.3 Development Strategy

The general operation process is: pre-configuration file conversion, table structure verification -> (input -> DataX core + business-independent verification -> output) -> post-statistics/persistence.

Try to ensure that DataX focuses on data synchronization, try not to hide business logic, put Youzan's unique business logic outside DataX, and only modify the source code if the data synchronization process cannot meet the needs.
Table structure, table naming rules, address translation and other runtime pre-verification logics, as well as the persistence of running results, are placed in the metadata system, while the monitoring of the running status is placed in the scheduling system.

4. Datax-Web

datax is a script-based collection tool, so friends will ask: Is there a visual interface for collection task execution?

This must have.

DataX Web is a distributed data synchronization tool developed on top of DataX, which provides a simple and easy-to-use operation interface, reduces the learning cost for users to use DataX, and shortens task configuration time.

  • Users can create a data synchronization task by selecting a data source on the page.
  • Support RDBMS, Hive, HBase, ClickHouse, MongoDB and other data sources. RDBMS data sources can create data synchronization tasks in batches, support real-time viewing of data synchronization progress and logs, and provide the function of terminating synchronization. Integration and secondary development of xxl-job can be based on time, Auto-increment primary key incrementally synchronizes data.

It also provides a visual interface for data aggregation in the form of menus.

In the process of building the big data aggregation module, the source code of datax-web becomes a separate microservice, which is used for secondary development and finally deployed in a containerized form.

V. Summary

Youzan platform uses Datax to aggregate data, and has taken multiple measures to optimize a lot. But nowadays, we are more using the integration of Datax-Web for secondary development, which can be deployed in a distributed manner.

Zhibei Data Central Platform integrates Datax-Web in a completely source code manner, including front-end and back-end.

In the future, it will be redeveloped, including adding plug-ins, customizing crawlers, etc.

Guess you like

Origin blog.csdn.net/shujuelin/article/details/131241302