Heterogeneous data source massive data exchange tool-Taobao DataX download and use

Introduction to DataX

DataX is a tool for high-speed data exchange between heterogeneous databases/file systems, realizing data exchange between arbitrary data processing systems (RDBMS/Hdfs/Local filesystem).

At present, there are many mature data import and export tools, but generally they can only be used for data import or export, and can only support one or several specific types of databases.
One problem this brings is that if we have many different types of databases/filesystems (Mysql/Oracle/Rac/Hive/Other...)
and often need to import and export data between them, then we may need to develop/maintain / Learn to use a batch of such tools (jdbcdump/dbloader/multithread/getmerge+sqlloader/mysqldumper...). And with each additional library type in the future, the number of tools we need will grow linearly. (When we need to import mysql data into oracle, have you ever thought of breaking out the jdbcdump and dbloader in half and putting them together?) Some of these tools use files to transfer data, some use pipes, and to varying degrees for data Transit brings additional overhead, and the efficiency varies greatly.
Many tools are also unable to meet the common needs of ETL tasks, such as date format conversion, character character conversion, and encoding conversion.

In addition, sometimes, we want to export a piece of data from one database to multiple different types of databases at the same time within a very short time window.

DataX was born to solve these problems.

datax

datax

 

DataX Features

  • High-speed data exchange between heterogeneous databases/file systems
  • Built with Framework + plugin architecture, Framework handles most of the technical issues of high-speed data exchange such as buffering, flow control, concurrency, and context loading, and provides a simple interface to interact with plug-ins. Plug-ins only need to access the data processing system
  • Operation mode: stand-alone
  • The data transfer process is completed in a single process, full memory operation, no read and write disk, and no IPC
  • Open framework, developers can develop a new plug-in in a very short time to quickly support the new database/file system. (For details, please refer to the "DataX Plug-in Development Guide")

DataX Structural Pattern (Framework + Plugin)

 

DataX Architecture Patterns

DataX Architecture Patterns

  • Job: A data synchronization job
  • Splitter: A job splitting module that splits a large task into multiple concurrent small tasks.
  • Sub-job: Small tasks after data synchronization jobs are split
  • Reader(Loader): Data read-in module, responsible for running small tasks after segmentation, and loading data into DataX from the source
  • Storage: Reader and Writer exchange data through Storage
  • Writer(Dumper): Data writing module, responsible for importing data from DataX to the destination data destination

The DataX framework centrally handles the problems encountered in high-speed data exchange through technologies such as double buffer queues and thread pool encapsulation, and provides a simple interface to interact with plug-ins. Plug-ins are divided into two types: Reader and Writer. It is very convenient to develop the required plug-ins.
For example, if you want to export data from oracle to mysql, all you need to do is to develop OracleReader and MysqlWriter plug-ins and assemble them on the framework. And such plug-ins are generally universal in other data exchange occasions.

The bigger surprise is that we have developed the following plugins:

Reader plugin

hdfsreader : Supports getting data from hdfs file system.
mysqlreader: supports fetching data from mysql database.
sqlserverreader: Supports getting data from sqlserver database.
oraclereader : Supports getting data from oracle database.
streamreader: supports getting data from stream (usually used for testing)
httpreader: supports getting data from http URL.

Writer plugin

hdfswriter: supports writing data to hdbf.
mysqlwriter: supports writing data to mysql.
oraclewriter: Supports writing data to oracle.
streamwriter: Supports writing data to stream streams. (often used for testing)

You can choose to use or independently develop your own plug-ins as needed (see "DataX Plug-in Development Guide" for details)

Application of DataX in Taobao

       After the data synchronization tool is normalized to DataX, it greatly improves the user's data dragging speed and memory utilization. At the same time, for the normalized DataX tool, we can better cope with the previous scattered data such as mysql database cutting, data synchronization monitoring, etc. Operation and maintenance tasks that are difficult to complete under the tool.
The following is the comparison of some tools after replacement:

DataX

DataX

 

The following is an example of my configuration and usage after compiling the DataX source code (the download address is at the end of the article):

Environmental requirements:

1. java >= 1.6 python >= 2.6
2. If you use Oracle, you need to install the Oracle client;
3. If you use HDFS, you need to ensure that the hadoop command line is available; also make sure to link Hadoop in the /home directory of the user executing DataX config directory file, execute in the user directory: ln -s /home/$user/config hadoop-configure-directory
4. The default installation is to the /home/taobao/datax directory, it is best to use the root user to install, because there are other permissions question;

 

Install:

1. First install Datax engine
rpm -ivh t_dp_datax_engine-1.0.0-1.noarch.rpm
After installation, the /home/taobao/datax directory structure is as follows:

datax installation

datax installation

2. Install the required read and write plug-ins. For example, if I need data transfer between HDFS and Mysql, I need to install the read and write plug-ins for HDFS and Mysql:
rpm -ivh t_dp_datax_hdfsreader-1.0.0-1.noarch .rpm
rpm -ivh t_dp_datax_hdfswriter-1.0.0-1.noarch.rpm
rpm -ivh t_dp_datax_mysqlreader-1.0.0-1.noarch.rpm
rpm -ivh t_dp_datax_mysqlwriter-1.0.0-1.noarch.rpm
After successful installation, /home There are more plugins/ directories under /taobao/datax, and further down, there are reader and writer directories, which are used to store read plugins and write plugins respectively, as shown in the figure:

datax read and write plugin

datax read and write plugin

Configuration (take mysql data import into hdfs as an example)

1. Establish a link to the Hadoop configuration file directory (the user who executes the task is lxw1234), switch to the lxw1234 user, and execute:
ln -s /usr/local/hadoop-0.20.2/conf /home/lxw1234/config As shown in the
figure:

datax configuration

datax configuration

2. Generate the job configuration file
and enter: /home/taobao/datax/bin/,
execute: ./datax.py -e
The screen displays as shown below:

datax configuration

datax configuration

The available data source types are listed (the reader plugins of hdfs and mysql were installed before, so these two data sources are displayed here), select 1 (mysql), as shown in the figure:

datax configuration

datax configuration

The available data target types are listed (similarly, hdfs and mysql are displayed), select 0 (hdfs), as shown in the figure:

datax configuration

datax configuration

Generated job configuration file /home/taobao/datax/jobs/mysqlreader_to_hdfswriter_1432867511409.xml

3. Edit the job configuration file
vi /home/taobao/datax/jobs/mysqlreader_to_hdfswriter_1432867511409.xml

The data source is configured in the <reader></reader> tag, and the configuration information for reading data from mysql needs to be modified here:
<param key=”ip” value=”127.0.0.1″/>
<param key=”port” value =”3306″/>
<param key=”dbname” value=”lxw1234″/>
<param key=”username” value=”lxw1234″/>
<param key=”password” value=”lxw1234.com”/>
<param key=”sql” value=”select job_id,job_create_time,job_last_update_time,job_type from dmp_job_log limit 500″/>

For other reader parameters, please refer to the instructions in the user manual.

The data target is configured in the <writer></writer> tag. Here you need to modify the configuration information for writing data to HDFS:
<param key=”hadoop.job.ugi” value=”?”/> //Hadoop authentication configuration , if not, no configuration is required;
<param key=”hadoop_conf” value=”/home/lxw1234/config/core-site.xml”/> //Hadoop configuration file directory
<param key=”dir” value= ”hdfs://namenode:8020/tmp/lxw1234/datax/”/> //Write data to which directory in hdfs
<param key=”field_split” value=”\001″/> //Write the file column Separator
<param key=”file_type” value=”TXT”/> //The file type written to hdfs
<param key=”concurrency” value=”1″/> //Write concurrently, each concurrently generates a file

For other writer parameters, please refer to the instructions in the manual.

implement:

1. Enter: /home/taobao/datax/bin/,
execute: ./datax.py /home/taobao/datax/jobs/mysqlreader_to_hdfswriter_1432867511409.xml

The running result is shown in the figure:

datax execution

datax execution

 

datax execution

datax execution

datax execution

datax execution

After execution, view the generated files on hdfs:

datax execution

datax execution

illustrate:

1. There are many extended attributes in datax, such as: dynamic parameters, dynamic sequences, read and write concurrency, etc., see the documentation for details;
2. For other subsequent jobs, you can directly copy and modify the previous job configuration file, and then execute it;
3 . The hadoop jar package used by the hdfs reader and writer plugins is of a lower version hadoop-0.19.2-core.jar (see /home/taobao/datax/plugins/writer/hdfswriter and /home/taobao/datax/plugins/reader/hdfsreader ), when using the hdfs plugin, you need to copy your hadoop jar package to the plugin directory, for example, the hadoop version I use is hadoop-core-0.20.2-cdh3u2.jar, copy the jar package to the hdfs plugin directory, and Delete the original hadoop-0.19.2-core.jar;

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326048228&siteId=291194637