DataX case practice

1. Use of DataX

1.1. DataX task submission command

The use of DataX is very simple. The user only needs to select the corresponding Reader and Writer according to the data source and destination of the data to be synchronized, and configure the information of the Reader and Writer in a json file, and then execute the following command to submit the data synchronization task. .

[song@hadoop102 datax]$ python /opt/model/datax/bin/datax.py /opt/model/datax/job/job.json

1.2, DataX configuration file format

The DataX configuration file template can be viewed using the following naming.

[song@hadoop102 datax]$ python bin/datax.py -r mysqlreader -w hdfswriter

insert image description here
The configuration file template is as follows:
The outermost layer of json is a job. The job includes two parts: setting and content. The setting is used to configure the entire job, and the content user configures the data source and destination.
insert image description here
For the specific parameters of Reader and Writer, please refer to the official documentation, the address is as follows: For
the specific parameters of Reader and Writer, refer to the official documentation
insert image description here

2. DataX use cases

2.1. The case of synchronizing MySQL data to HDFS

2.1.1. Case requirements

Synchronize the base_province table data in the gmall database to the /base_province directory of HDFS

2.1.2. Demand analysis

To realize this function, you need to choose MySQLReader and HDFSWriter. MySQLReader has two modes: TableModeand QuerySQLMode, the former uses table, column, where and other attributes to declare the data to be synchronized; the latter uses a SQL query statement to declare the data to be synchronized. The following two modes are used for demonstration.

2.1.2.1, TableMode of MySQLReader
  1. Data preparation: Mysql data table
/*
 Navicat Premium Data Transfer

 Source Server         : hadoop102
 Source Server Type    : MySQL
 Source Server Version : 80026
 Source Host           : 192.168.10.102:3306
 Source Schema         : gmall

 Target Server Type    : MySQL
 Target Server Version : 80026
 File Encoding         : 65001

 Date: 04/02/2023 13:53:48
*/

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for base_province
-- ----------------------------
DROP TABLE IF EXISTS `base_province`;
CREATE TABLE `base_province`  (
  `id` bigint(0) NULL DEFAULT NULL COMMENT 'id',
  `name` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '省名称',
  `region_id` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '大区id',
  `area_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '行政区位码',
  `iso_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '国际编码',
  `iso_3166_2` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT 'ISO3166编码'
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;

-- ----------------------------
-- Records of base_province
-- ----------------------------
INSERT INTO `base_province` VALUES (1, '北京', '1', '110000', 'CN-11', 'CN-BJ');
INSERT INTO `base_province` VALUES (2, '天津', '1', '120000', 'CN-12', 'CN-TJ');
INSERT INTO `base_province` VALUES (3, '山西', '1', '140000', 'CN-14', 'CN-SX');
INSERT INTO `base_province` VALUES (4, '内蒙古', '1', '150000', 'CN-15', 'CN-NM');
INSERT INTO `base_province` VALUES (5, '河北', '1', '130000', 'CN-13', 'CN-HE');
INSERT INTO `base_province` VALUES (6, '上海', '2', '310000', 'CN-31', 'CN-SH');
INSERT INTO `base_province` VALUES (7, '江苏', '2', '320000', 'CN-32', 'CN-JS');
INSERT INTO `base_province` VALUES (8, '浙江', '2', '330000', 'CN-33', 'CN-ZJ');
INSERT INTO `base_province` VALUES (9, '安徽', '2', '340000', 'CN-34', 'CN-AH');
INSERT INTO `base_province` VALUES (10, '福建', '2', '350000', 'CN-35', 'CN-FJ');
INSERT INTO `base_province` VALUES (11, '江西', '2', '360000', 'CN-36', 'CN-JX');
INSERT INTO `base_province` VALUES (12, '山东', '2', '370000', 'CN-37', 'CN-SD');
INSERT INTO `base_province` VALUES (14, '台湾', '2', '710000', 'CN-71', 'CN-TW');
INSERT INTO `base_province` VALUES (15, '黑龙江', '3', '230000', 'CN-23', 'CN-HL');
INSERT INTO `base_province` VALUES (16, '吉林', '3', '220000', 'CN-22', 'CN-JL');
INSERT INTO `base_province` VALUES (17, '辽宁', '3', '210000', 'CN-21', 'CN-LN');
INSERT INTO `base_province` VALUES (18, '陕西', '7', '610000', 'CN-61', 'CN-SN');
INSERT INTO `base_province` VALUES (19, '甘肃', '7', '620000', 'CN-62', 'CN-GS');
INSERT INTO `base_province` VALUES (20, '青海', '7', '630000', 'CN-63', 'CN-QH');
INSERT INTO `base_province` VALUES (21, '宁夏', '7', '640000', 'CN-64', 'CN-NX');
INSERT INTO `base_province` VALUES (22, '新疆', '7', '650000', 'CN-65', 'CN-XJ');
INSERT INTO `base_province` VALUES (23, '河南', '4', '410000', 'CN-41', 'CN-HA');
INSERT INTO `base_province` VALUES (24, '湖北', '4', '420000', 'CN-42', 'CN-HB');
INSERT INTO `base_province` VALUES (25, '湖南', '4', '430000', 'CN-43', 'CN-HN');
INSERT INTO `base_province` VALUES (26, '广东', '5', '440000', 'CN-44', 'CN-GD');
INSERT INTO `base_province` VALUES (27, '广西', '5', '450000', 'CN-45', 'CN-GX');
INSERT INTO `base_province` VALUES (28, '海南', '5', '460000', 'CN-46', 'CN-HI');
INSERT INTO `base_province` VALUES (29, '香港', '5', '810000', 'CN-91', 'CN-HK');
INSERT INTO `base_province` VALUES (30, '澳门', '5', '820000', 'CN-92', 'CN-MO');
INSERT INTO `base_province` VALUES (31, '四川', '6', '510000', 'CN-51', 'CN-SC');
INSERT INTO `base_province` VALUES (32, '贵州', '6', '520000', 'CN-52', 'CN-GZ');
INSERT INTO `base_province` VALUES (33, '云南', '6', '530000', 'CN-53', 'CN-YN');
INSERT INTO `base_province` VALUES (13, '重庆', '6', '500000', 'CN-50', 'CN-CQ');
INSERT INTO `base_province` VALUES (34, '西藏', '6', '540000', 'CN-54', 'CN-XZ');

SET FOREIGN_KEY_CHECKS = 1;

  1. Create configuration file base_province.json
    insert image description here
  2. The content of the configuration file is as follows
{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "mysqlreader",
                    "parameter": {
    
    
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2"
                        ],
                        "where": "id>=3",
                        "connection": [
                            {
    
    
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall"
                                ],
                                "table": [
                                    "base_province"
                                ]
                            }
                        ],
                        "password": "123456",
                        "splitPk": "",
                        "username": "root"
                    }
                },
                "writer": {
    
    
                    "name": "hdfswriter",
                    "parameter": {
    
    
                        "column": [
                            {
    
    
                                "name": "id",
                                "type": "bigint"
                            },
                            {
    
    
                                "name": "name",
                                "type": "string"
                            },
                            {
    
    
                                "name": "region_id",
                                "type": "string"
                            },
                            {
    
    
                                "name": "area_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_3166_2",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}
  1. Parameter Description
    • Reader parameter description
      insert image description here

    • Writer parameter description
      insert image description here

    • Setting parameter description
      insert image description here

Note :
When there is a Null value in Mysql. By default, what is written to HFDS is stored as an empty string (''"), and the default null value storage format of Hive is "\N". When the null value in HDFS is stored as "\N" , Hive can display the null value, but HFDS Writer does not provide the nullFormat parameter, the problem now is: when the Null in mysql is stored in hdfs is an empty string, hive reads the empty string in hdfs, and the display is also empty string, resulting in inconsistency between the two data formats.

There are two solutions to this problem:

  • To modify the source code of DataX HDFS Writer, add the logic of customizing the storage format of null value, please refer to add link description .
  • Specify the null value storage format as an empty string ('' '') when creating a table in Hive, for example
DROP TABLE IF EXISTS base_province;
CREATE EXTERNAL TABLE base_province
(
    `id`         STRING COMMENT '编号',
    `name`       STRING COMMENT '省份名称',
    `region_id`  STRING COMMENT '地区ID',
    `area_code`  STRING COMMENT '地区编码',
    `iso_code`   STRING COMMENT '旧版ISO-3166-2编码,供可视化使用',
    `iso_3166_2` STRING COMMENT '新版IOS-3166-2编码,供可视化使用'
) COMMENT '省份表'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    NULL DEFINED AS ''
    LOCATION '/base_province/';
  1. submit task
    • Create the /base_province directory in HDFS and use DataX to synchronize data to HDFS, make sure that the target path already exists
      insert image description here
      insert image description here
    • Enter the root directory of DataX and execute the task,
      insert image description here
      but the following error occurs:
      insert image description here
      the reason is that I am using Mysql8. The .x version is fine:
      replace the jar package of datax->plugins->reader->mysqlreader->libs->mysql-connector-5... with version 8.XX

Just re-execute.

  1. View Results
    insert image description here
    insert image description here
  2. View HDFS files
[song@hadoop102 datax]$ hadoop fs -cat /base_province/* | zcat

insert image description here

2.1.2.2, QuerySQLMode of MySQLReader
  1. write executable file
{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "mysqlreader",
                    "parameter": {
    
    
                        "connection": [
                            {
    
    
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall?serverTimezone=UTC&useUnicode=true&characterEncoding=utf8"
                                ],
                                "querySql": [
                                    "select id,name,region_id,area_code,iso_code,iso_3166_2 from base_province where id>=3"
                                ]
                            }
                        ],
                        "password": "000000",
                        "username": "root"
                    }
                },
                "writer": {
    
    
                    "name": "hdfswriter",
                    "parameter": {
    
    
                        "column": [
                            {
    
    
                                "name": "id",
                                "type": "bigint"
                            },
                            {
    
    
                                "name": "name",
                                "type": "string"
                            },
                            {
    
    
                                "name": "region_id",
                                "type": "string"
                            },
                            {
    
    
                                "name": "area_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_3166_2",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}
  1. View Results
    insert image description here
    insert image description here

2.2, DataX parameter passing

Usually, the offline data synchronization task needs to be executed regularly and repeatedly every day, so the target path on HDFS usually contains a layer of date to distinguish the data that is synchronized every day, that is to say, the target path of the daily synchronization data is not fixed Unchanged, so the value of the path parameter of HDFS Writer in the DataX configuration file should be dynamic.

In order to achieve this effect, it is necessary to use the function of passing parameters in DataX. -p"-Dparam=value"The usage of DataX parameter passing is as follows. Use ${param} to refer to parameters in the JSON configuration file, and use the incoming parameter values ​​when submitting tasks . The specific examples are as follows.

  1. Write a configuration file
[song@hadoop102 job]$ vim base_province_date.json
{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "mysqlreader",
                    "parameter": {
    
    
                        "connection": [
                            {
    
    
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall?serverTimezone=UTC&useUnicode=true&characterEncoding=utf8"
                                ],
                                "querySql": [
                                    "select id,name,region_id,area_code,iso_code,iso_3166_2 from base_province where id>=3"
                                ]
                            }
                        ],
                        "password": "000000",
                        "username": "root"
                    }
                },
                "writer": {
    
    
                    "name": "hdfswriter",
                    "parameter": {
    
    
                        "column": [
                            {
    
    
                                "name": "id",
                                "type": "bigint"
                            },
                            {
    
    
                                "name": "name",
                                "type": "string"
                            },
                            {
    
    
                                "name": "region_id",
                                "type": "string"
                            },
                            {
    
    
                                "name": "area_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
    
    
                                "name": "iso_3166_2",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province/${dt}",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}
  1. Create target path: hadoop fs -mkdir /base_province/2023-02-04
    insert image description here

  2. Excuting an order:python bin/datax.py -p"-Ddt=2023-02-04" job/base_province_date.json
    insert image description here

  3. View Results
    insert image description here
    insert image description here

2.3. Synchronize HDFS data to MySQL case

2.3.1. Case requirements

Synchronize the data in the /base_province directory on HDFS to the test_province table in the MySQL gmall database.

2.3.2. Demand Analysis

To realize this function, HDFSReader and MySQLWriter need to be selected.

  1. Write a configuration file
    Create a configuration file test_province.json
{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "hdfsreader",
                    "parameter": {
    
    
                        "defaultFS": "hdfs://hadoop102:8020",
                        "path": "/base_province",
                        "column": [
                            "*"
                        ],
                        "fileType": "text",
                        "compress": "gzip",
                        "encoding": "UTF-8",
                        "nullFormat": "\\N",
                        "fieldDelimiter": "\t",
                    }
                },
                "writer": {
    
    
                    "name": "mysqlwriter",
                    "parameter": {
    
    
                        "username": "root",
                        "password": "000000",
                        "connection": [
                            {
    
    
                                "table": [
                                    "test_province"
                                ],
                                "jdbcUrl": "jjdbc:mysql://hadoop102:3306/gmall?serverTimezone=UTC&useUnicode=true&characterEncoding=utf8"
                            }
                        ],
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2"
                        ],
                        "writeMode": "replace"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}

test_province table structure

/*
 Navicat Premium Data Transfer

 Source Server         : hadoop102
 Source Server Type    : MySQL
 Source Server Version : 80026
 Source Host           : hadoop102:3306
 Source Schema         : gmall

 Target Server Type    : MySQL
 Target Server Version : 80026
 File Encoding         : 65001

 Date: 04/02/2023 15:13:43
*/

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for test_province
-- ----------------------------
DROP TABLE IF EXISTS `test_province`;
CREATE TABLE `test_province`  (
  `id` bigint(0) NOT NULL COMMENT 'id',
  `name` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '省名称',
  `region_id` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '大区id',
  `area_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '行政区位码',
  `iso_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '国际编码',
  `iso_3166_2` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT 'ISO3166编码',
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;

SET FOREIGN_KEY_CHECKS = 1;


  1. Reader parameter description
    insert image description here

  2. Writer parameter description
    insert image description here

  3. Submit the task and create the gmall.test_province table in MySQL

  4. Execute the following command, python bin/datax.py job/test_base_province.json

[song@hadoop102 datax]$ python bin/datax.py job/test_base_province.json
  1. View Results
    insert image description here
    insert image description here

Guess you like

Origin blog.csdn.net/prefect_start/article/details/128881183