Article directory
4. Use of DataX
4.3 Synchronize HDFS data to MySQL case
Case requirements: Synchronize the data in the /base_province directory on HDFS to the test_province table in the MySQL gmall database.
Requirement analysis: To realize this function, HDFSReader and MySQLWriter need to be selected.
4.3.1 Writing configuration files
4.3.1.1 Create configuration file test_province.json
[summer@hadoop102 job]$ vim test_province.json
4.3.1.2 The content of the configuration file is as follows
{
"job": {
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"defaultFS": "hdfs://hadoop102:8020",
"path": "/base_province",
"column": [
"*"
],
"fileType": "text",
"compress": "gzip",
"encoding": "UTF-8",
"nullFormat": "\\N",
"fieldDelimiter": "\t",
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "******",
"connection": [
{
"table": [
"test_province"
],
"jdbcUrl": "jdbc:mysql://hadoop102:3306/gmall?useUnicode=true&characterEncoding=utf-8"
}
],
"column": [
"id",
"name",
"region_id",
"area_code",
"iso_code",
"iso_3166_2"
],
"writeMode": "replace"
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
4.3.2 Configuration file description
4.3.2.1 Reader parameter description
4.3.2.2 Writer parameter description
There is no one-size-fits-all, depending on the specific business scenario, the third one is used for mysql data, and the second
insert into is used for collection data. If there is no primary key, two of the same data will be inserted and two will be saved. If there is no primary key , inserting two identical data will result in an error.
replace into, how to have a primary key, delete the entire data, and then insert it
ON DUPLICATE KEY UPDATE does not delete the entire data, but updates the data in which column when the data changes.
4.3.3 Submitting tasks
4.3.3.1 Create gmall.test_province table in MySQL
DROP TABLE IF EXISTS `test_province`;
CREATE TABLE `test_province` (
`id` bigint(20) NOT NULL,
`name` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`region_id` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`area_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`iso_code` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`iso_3166_2` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
4.3.3.2 Execute the following command
[summer@hadoop102 datax]$ python bin/datax.py job/test_province.json
4.3.4 View Results
4.3.4.1 DataX print log
2022-11-01 15:22:03.041 [job-0] INFO JobContainer -
任务启动时刻 : 2022-11-01 15:21:50
任务结束时刻 : 2022-11-01 15:22:03
任务总计耗时 : 12s
任务平均流量 : 200B/s
记录写入速度 : 9rec/s
读出记录总数 : 96
读写失败总数 : 0
Because there are three files in the base_province directory, it will nest loops to traverse all the files. There are 32 pieces of data in each file, and there are 96 pieces of data in total.
4.3.4.1 View MySQL target table data
Because the replace keyword is used, it will be deduplicated.
5. DataX optimization
5.1 Speed control
DataX3.0 provides three flow control modes including channel (concurrent), record stream, and byte stream. You can control your job speed at will, so that your job can achieve the best synchronization speed within the range that the database can bear.
The key optimization parameters are as follows:
parameter | illustrate |
---|---|
job.setting.speed.channel | concurrent number |
job.setting.speed.record | Total record speed limit (number of tps processed per second) |
job.setting.speed.byte | Total byte rate limit (number of bytes processed per second in bps) |
core.transport.channel.speed.record | The record speed limit of a single channel, the default value is 10000 (10000 records/s) |
core.transport.channel.speed.byte | The byte speed limit of a single channel, the default value is 1024*1024 (1M/s) |
Note:
1. If the total record rate limit is configured, the record rate limit of a single channel must be configured.
2. If the total byte rate limit is configured, the byte rate limit of a single channel must be configured.
3. If the total record rate limit is configured And the total byte rate limit, the channel concurrency parameter will be invalid. Because after configuring the total record rate limit and the total byte rate limit, the actual channel concurrency is obtained by calculation: the
calculation formula is:
min(total byte rate limit/byte rate limit of a single channel, total record rate limit/single channel record speed limit)
configuration example:
{
"core": {
"transport": {
"channel": {
"speed": {
"byte": 1048576 //单个channel byte限速1M/s
}
}
}
},
"job": {
"setting": {
"speed": {
"byte" : 5242880 //总byte限速5M/s
}
},
...
}
}
5.2 Memory adjustment
When increasing the number of concurrent channels in a DataX job, the memory usage will increase significantly, because DataX, as a data exchange channel, will cache more data in memory. For example, there will be a Buffer in the Channel as a buffer for temporary data exchange, and there will be some Buffers in some Readers and Writers. In order to prevent errors such as OOM, it is necessary to increase the heap memory of the JVM.
It is recommended to set the memory to 4G or 8G, which can also be adjusted according to the actual situation.
There are two ways to adjust JVM xms xmx parameters: one is to directly change the datax.py script; the other is to add corresponding parameters when starting, as follows:
python datax/bin/datax.py --jvm="-Xms8G -Xmx8G" /path/to/your/job.json