DataX usage, synchronous HDFS data to MySQL case, DataX optimization

4. Use of DataX

4.3 Synchronize HDFS data to MySQL case

  Case requirements: Synchronize the data in the /base_province directory on HDFS to the test_province table in the MySQL gmall database.
  Requirement analysis: To realize this function, HDFSReader and MySQLWriter need to be selected.

4.3.1 Writing configuration files

4.3.1.1 Create configuration file test_province.json

[summer@hadoop102 job]$ vim test_province.json

insert image description here

4.3.1.2 The content of the configuration file is as follows

insert image description here

{
    
    
    "job": {
    
    
        "content": [
            {
    
    
                "reader": {
    
    
                    "name": "hdfsreader",
                    "parameter": {
    
    
                        "defaultFS": "hdfs://hadoop102:8020",
                        "path": "/base_province",
                        "column": [
                            "*"
                        ],
                        "fileType": "text",
                        "compress": "gzip",
                        "encoding": "UTF-8",
                        "nullFormat": "\\N",
                        "fieldDelimiter": "\t",
                    }
                },
                "writer": {
    
    
                    "name": "mysqlwriter",
                    "parameter": {
    
    
                        "username": "root",
                        "password": "******",
                        "connection": [
                            {
    
    
                                "table": [
                                    "test_province"
                                ],
                                "jdbcUrl": "jdbc:mysql://hadoop102:3306/gmall?useUnicode=true&characterEncoding=utf-8"
                            }
                        ],
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2"
                        ],
                        "writeMode": "replace"
                    }
                }
            }
        ],
        "setting": {
    
    
            "speed": {
    
    
                "channel": 1
            }
        }
    }
}

4.3.2 Configuration file description

4.3.2.1 Reader parameter description

insert image description here

4.3.2.2 Writer parameter description

insert image description here

There is no one-size-fits-all, depending on the specific business scenario, the third one is used for mysql data, and the second
insert into is used for collection data. If there is no primary key, two of the same data will be inserted and two will be saved. If there is no primary key , inserting two identical data will result in an error.
replace into, how to have a primary key, delete the entire data, and then insert it
ON DUPLICATE KEY UPDATE does not delete the entire data, but updates the data in which column when the data changes.

4.3.3 Submitting tasks

4.3.3.1 Create gmall.test_province table in MySQL

insert image description here

DROP TABLE IF EXISTS `test_province`;
CREATE TABLE `test_province`  (
  `id` bigint(20) NOT NULL,
  `name` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `region_id` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `area_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `iso_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `iso_3166_2` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;

4.3.3.2 Execute the following command

[summer@hadoop102 datax]$ python bin/datax.py job/test_province.json 

insert image description here

insert image description here

4.3.4 View Results

4.3.4.1 DataX print log

insert image description here

2022-11-01 15:22:03.041 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2022-11-01 15:21:50
任务结束时刻                    : 2022-11-01 15:22:03
任务总计耗时                    :                 12s
任务平均流量                    :              200B/s
记录写入速度                    :              9rec/s
读出记录总数                    :                  96
读写失败总数                    :                   0

insert image description hereBecause there are three files in the base_province directory, it will nest loops to traverse all the files. There are 32 pieces of data in each file, and there are 96 pieces of data in total.

4.3.4.1 View MySQL target table data

insert image description hereBecause the replace keyword is used, it will be deduplicated.

5. DataX optimization

5.1 Speed ​​control

  DataX3.0 provides three flow control modes including channel (concurrent), record stream, and byte stream. You can control your job speed at will, so that your job can achieve the best synchronization speed within the range that the database can bear.
  The key optimization parameters are as follows:

parameter illustrate
job.setting.speed.channel concurrent number
job.setting.speed.record Total record speed limit (number of tps processed per second)
job.setting.speed.byte Total byte rate limit (number of bytes processed per second in bps)
core.transport.channel.speed.record The record speed limit of a single channel, the default value is 10000 (10000 records/s)
core.transport.channel.speed.byte The byte speed limit of a single channel, the default value is 1024*1024 (1M/s)

Note:
1. If the total record rate limit is configured, the record rate limit of a single channel must be configured.
2. If the total byte rate limit is configured, the byte rate limit of a single channel must be configured.
3. If the total record rate limit is configured And the total byte rate limit, the channel concurrency parameter will be invalid. Because after configuring the total record rate limit and the total byte rate limit, the actual channel concurrency is obtained by calculation: the
calculation formula is:
min(total byte rate limit/byte rate limit of a single channel, total record rate limit/single channel record speed limit)
configuration example:

{
    
    
    "core": {
    
    
        "transport": {
    
    
            "channel": {
    
    
                "speed": {
    
    
                    "byte": 1048576 //单个channel byte限速1M/s
                }
            }
        }
    },
    "job": {
    
    
        "setting": {
    
    
            "speed": {
    
    
                "byte" : 5242880 //总byte限速5M/s
            }
        },
        ...
    }
}

5.2 Memory adjustment

  When increasing the number of concurrent channels in a DataX job, the memory usage will increase significantly, because DataX, as a data exchange channel, will cache more data in memory. For example, there will be a Buffer in the Channel as a buffer for temporary data exchange, and there will be some Buffers in some Readers and Writers. In order to prevent errors such as OOM, it is necessary to increase the heap memory of the JVM.
  It is recommended to set the memory to 4G or 8G, which can also be adjusted according to the actual situation.
  There are two ways to adjust JVM xms xmx parameters: one is to directly change the datax.py script; the other is to add corresponding parameters when starting, as follows:

python datax/bin/datax.py --jvm="-Xms8G -Xmx8G" /path/to/your/job.json

Guess you like

Origin blog.csdn.net/Redamancy06/article/details/128048258
Recommended