background
Previously, we have experienced the quick completion of data migration from to through zero code, visualization, and drag-and-drop methods. However, in the actual production environment, we still need to do some filtering and conversion work before migrating to the target library; for example, in poetry MySQL
data ClickHouse
After the migration, it was discovered that the original poetry data in was all in Traditional Chinese. This resulted in the chart display generated MySQL
when directly migrating to for statistical analysis to be in Traditional Chinese, which affected the experience for users who were not familiar with Traditional Chinese. Today, we will use the custom rule capabilities provided by and call the third-party package to complete the conversion from Traditional Chinese to Simplified Chinese; specifically, migrate the poetry database from to , complete the data cleaning and conversion work before entering the database, and complete the data table Conversion from Traditional Chinese to Simplified Chinese in fields such as title, author and content.ClickHouse
ETLCloud
jar
opencc4j
MySQL
ClickHouse
Dataset description
MySQL
The structure of the library table in the database poetry
is as follows, and the amount of data is: 311828
.
CREATE TABLE `poetry` (
`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`title` VARCHAR(150) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`yunlv_rule` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
`author_id` INT(10) UNSIGNED NOT NULL,
`content` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
`dynasty` VARCHAR(10) NOT NULL COMMENT '诗所属朝代(S-宋代, T-唐代)' COLLATE 'utf8mb4_unicode_ci',
`author` VARCHAR(150) NOT NULL COLLATE 'utf8mb4_unicode_ci',
PRIMARY KEY (`id`) USING BTREE
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
AUTO_INCREMENT=311829;
ClickHouse
The table creation statement in:
CREATE TABLE poetry.poetry (`id` Int32, `title` String, `yunlv_rule` String, `author_id` Int32, `content` String, `dynasty` String, `author` String) ENGINE = MergeTree() PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192
Tool selection
- ClickHouse database
- Docker deploys ETLCloudV2.2
- ETLCloud's library table input component, data cleaning conversion component, and DingTalk message component
Note: The community version chosen here is Docker
lightweight and quick to start using the deployment method: docker pull ccr.ccs.tencentyun.com/restcloud/restcloud-etl:V2.2
.
Create applications and processes
Create the application first (because the subsequent rules follow the application) and fill in the basic application configuration information.
Next, create a data process and fill in the information.
Custom rules
Before actually starting the data migration, prepare the cleaning and transformation rules. When migrating into the database, you can directly configure and select the defined rules.
Enter the application configuration -> New rule category -> Add custom rule.
Write the rule code from Traditional Chinese to Simplified Chinese. Among them, the class name is automatically generated. First, the conversion tool class is introduced: , and then ZhConverterUtil
calls it. Static methods are enough; after writing, click "Compile and Save". If normal, it will prompt that the compilation is successful~.
package cn.restcloud.etl.rule.ext;
import org.apache.commons.lang3.StringUtils;
import org.bson.Document;
import java.sql.Connection;
import cn.restcloud.framework.core.context.*;
import cn.restcloud.etl.base.IETLBaseEvent;
import cn.restcloud.etl.base.IETLBaseProcessEngine;
import cn.restcloud.framework.core.util.*;
import cn.restcloud.framework.core.util.db.rdb.*;
import cn.restcloud.etl.rule.service.ETLProcessRuleUtil;
import java.util.*;
import com.github.houbb.opencc4j.util.ZhConverterUtil;
/**
indoc是一个map的包装对像内部结构为key-value
被流程的Java规则节点调用时,返回0表示终止流程,返回1表示成功,其中indoc为流数据,fieldId为空值
当被字段绑定运行时fieldId为绑定的字段Id,流入数据的每一行作为indoc对像传入本方法执行一次
params为绑定规则时填写的传入的参数格式为JSON字符串
繁体中文转为简体中文
2023-07-07 10:58:21
admin
*/
public class ETL_64a77f4d955fc70345c4041a implements IETLBaseEvent {
@Override
public String execute(IETLBaseProcessEngine engine, Document modelNodeDoc, Document indoc,String fieldId,String params) throws Exception {
//List<Document> dataDocs=engine.getData(indoc); //上一节点传入的数据流(仅作为Java规则节点运行可用)
Document paramsDoc=ETLProcessRuleUtil.paramsToDocument(params);//规则参数转为一个map包装对像key-value
String paramsValue=DocumentUtil.getString(paramsDoc,"参数id"); //读取规则选中时输入的自定义参数值
String fieldValue=indoc.getString(fieldId); //获取规规绑定的字段Id获取字段值
PrintUtil.o(fieldId+"取到的值为=>"+fieldValue); //PrintUtil.o();可以打印变量到控制以日志中
//TODO 对fieldValue进行自定义处理
String result = ZhConverterUtil.toSimple(fieldValue);
PrintUtil.o("转换后的值为=>"+result);
indoc.put(fieldId,result); //把新的值覆盖旧字段的值
return "1";
}
}
Note: What needs to be noted here is that we use a third-party Jar
package opencc4j
to complete this work, so ETLCloud
how do we know how to call the custom tool class method? This requires us to jar
put the third-party ETLCloud
in the deployment directory: /usr/tomcat/webapps/ROOT/WEB-INF/lib
.
[root@etl ~]# docker cp /opt/opencc4j-1.8.1.jar de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
Successfully copied 513kB to de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
Then click version update, and the platform prompts the following:
Platform configuration (Successfully registered (0) java bean, update (2) java bean information!, API upgrade results: Update or register (0) services, (0) input parameters, (0) output codes from Jar file ! ), ETL configuration (Successfully registered (0) java bean, update (0) java bean information!, API upgrade results: Update or register (2) services, (0) input parameters, (0) from Jar file output encoding! )
Migration practices
Next, the rapid conversion and migration of poetry data from MySQL
to is completed through visual configuration and operation .ClickHouse
Data source configuration
- Configure Source: MySQL
Select MySQL
and fill in the IP: port and user password information.
Test connection successful~
- Configure Sink: ClickHouse
Select the ClickHouse poetry database from which previous articles were migrated as the data source.
Visual configuration process
After creating the process, you can click the "Process Design" button to enter the process visualization configuration page.
- Database table input: MySQL
In the input component on the left, select "Library Table Input", drag it to the central process drawing area, and double-click to enter the configuration stage.
MySQL
Step 1: Select the data source we configured and load MySQL
the existing tables in it.
Step 2: You can generate statements based on the selected table SQL
.
Step 3: The definition of each field can be read from the table, and fields can be added and deleted.
Step 4: SQL
Data preview is automatically performed based on the statement. Such a check operation ensures the normal execution of subsequent operations.
- Data cleaning and conversion: opencc4j converts Traditional Chinese to Simplified Chinese
Before configuring rules for fields, first familiarize yourself with opencc4j
its usage in back-end development.
- Introduce dependencies
<!-- Opencc4j 支持中文繁简体转换 -->
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>opencc4j</artifactId>
<version>1.8.1</version>
</dependency>
- transcoding
import com.github.houbb.opencc4j.util.ZhConverterUtil;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
import org.springframework.boot.test.context.SpringBootTest;
@SpringBootTest
class SpringbootOpencc4jApplicationTests {
// 繁体中文转简体中文
@Test
void toSimple(){
String original = "李白乘舟將欲行,忽聞岸上踏歌聲。|桃花潭水深千尺,不及汪倫送我情。";
String result = ZhConverterUtil.toSimple(original);
System.out.println(result);
Assertions.assertEquals("李白乘舟将欲行,忽闻岸上踏歌声。|桃花潭水深千尺,不及汪伦送我情。", result);
}
// 簡體中文轉繁體中文
@Test
void toTraditional(){
String original = "李白乘舟将欲行,忽闻岸上踏歌声。|桃花潭水深千尺,不及汪伦送我情。";
String result = ZhConverterUtil.toTraditional(original);
Assertions.assertEquals("李白乘舟將欲行,忽聞岸上踏歌聲。|桃花潭水深千尺,不及汪倫送我情。", result);
}
}
In the data transformation component on the left, select "Data Cleaning Transformation", drag it to the central process drawing area, and double-click to enter the configuration stage.
Because the values of these three fields in the source data table title
, content
and author
are in Traditional Chinese, set custom rules for these three fields: Convert Traditional Chinese to Simplified Chinese . In the next step, click Save to convert all data records.
- Library table output: ClickHouse
In the output component on the left, select "Library Table Output", drag it to the central process drawing area, and double-click to enter the configuration stage.
Step 1: Select the ClickHouse data source we configured.
Step 2: The definition of each field can be read from the table, and it supports adding, deleting fields and binding rules.
Finally, by 流程线
connecting the start , library table input , data cleaning conversion , library table output , and end components respectively, the visual configuration of data conversion and migration through customized rules is completed. Done~
Run process
Save the process and run the process; you can then view the corresponding process logs and conversion logs, and visually monitor the migration progress.
Problem record
- Error during data conversion process
Problem description: ETLCloud
An error was found in the log of . Caused by: java.lang.ClassNotFoundException: com.github.houbb.heaven.support.instance.impl.Instances
Problem analysis: When developing in SpringBoot
combination IDEA
with Maven
, we only introduced one dependency: opencc4j
, but in fact, when observing the external dependency library, we found that there are two other dependencies: heaven
and nlp-common
.
Solution: Upload all three packages opencc4j-1.8.1.jar
, , heaven-0.2.0.jar
and , to the directory of , re-update the configuration, and restart the service.nlp-common-0.0.5.jar
jar
ETLCloud
/usr/tomcat/webapps/ROOT/WEB-INF/lib
ETLCloud
ETLCloud
[root@etl ~]# docker cp /opt/heaven-0.2.0.jar de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
Successfully copied 304kB to de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
[root@etl ~]# docker cp /opt/nlp-common-0.0.5.jar de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
Successfully copied 1.97MB to de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
Note: Jar
The package can be found and downloaded from the Alibaba Cloud mirror repository: , or searched in the directory https://developer.aliyun.com/mvn/search
of the local development environment ..m2\repository\com\github\houbb
Summarize
The above describes how to ETLCloud
complete the data cleaning and conversion function through the powerful custom rule function, and realize the conversion of table field values from Traditional Chinese to Simplified Chinese. The following two points should be noted:
- Custom rules are attached to a certain process;
- Third-party
Jar
package dependencies must be complete in quantity.
Reference
If you have any questions or any bugs are found, please feel free to contact me.
Your comments and suggestions are welcome!