Custom rules based on ETLCloud call third-party jar packages to convert Traditional Chinese to Simplified Chinese

background

Previously, we have experienced the quick completion of data migration from to through zero code, visualization, and drag-and-drop methods. However, in the actual production environment, we still need to do some filtering and conversion work before migrating to the target library; for example, in poetry MySQLdata ClickHouseAfter the migration, it was discovered that the original poetry data in was all in Traditional Chinese. This resulted in the chart display generated MySQLwhen directly migrating to for statistical analysis to be in Traditional Chinese, which affected the experience for users who were not familiar with Traditional Chinese. Today, we will use the custom rule capabilities provided by and call the third-party package to complete the conversion from Traditional Chinese to Simplified Chinese; specifically, migrate the poetry database from to , complete the data cleaning and conversion work before entering the database, and complete the data table Conversion from Traditional Chinese to Simplified Chinese in fields such as title, author and content.ClickHouse
ETLCloudjaropencc4jMySQLClickHouse

Dataset description

MySQLThe structure of the library table in the database poetryis as follows, and the amount of data is: 311828.

CREATE TABLE `poetry` (
	`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
	`title` VARCHAR(150) NOT NULL COLLATE 'utf8mb4_unicode_ci',
	`yunlv_rule` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
	`author_id` INT(10) UNSIGNED NOT NULL,
	`content` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
	`dynasty` VARCHAR(10) NOT NULL COMMENT '诗所属朝代(S-宋代, T-唐代)' COLLATE 'utf8mb4_unicode_ci',
	`author` VARCHAR(150) NOT NULL COLLATE 'utf8mb4_unicode_ci',
	PRIMARY KEY (`id`) USING BTREE
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
AUTO_INCREMENT=311829;

ClickHouseThe table creation statement in:

CREATE TABLE poetry.poetry (`id` Int32, `title` String, `yunlv_rule` String, `author_id` Int32, `content` String, `dynasty` String, `author` String) ENGINE = MergeTree() PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192

Tool selection

  • ClickHouse database
  • Docker deploys ETLCloudV2.2
  • ETLCloud's library table input component, data cleaning conversion component, and DingTalk message component

Note: The community version chosen here is Dockerlightweight and quick to start using the deployment method: docker pull ccr.ccs.tencentyun.com/restcloud/restcloud-etl:V2.2.

Create applications and processes

Create the application first (because the subsequent rules follow the application) and fill in the basic application configuration information.
2023-07-15-1-CreateApp.jpg
Next, create a data process and fill in the information.
2023-07-15-2-CreateApp.jpg

Custom rules

Before actually starting the data migration, prepare the cleaning and transformation rules. When migrating into the database, you can directly configure and select the defined rules.
2023-07-15-3-RuleCategory.jpg
Enter the application configuration -> New rule category -> Add custom rule.
2023-07-15-4-RuleContent.jpg
Write the rule code from Traditional Chinese to Simplified Chinese. Among them, the class name is automatically generated. First, the conversion tool class is introduced: , and then ZhConverterUtilcalls it. Static methods are enough; after writing, click "Compile and Save". If normal, it will prompt that the compilation is successful~.
2023-07-15-5-RuleCode.jpg

package cn.restcloud.etl.rule.ext;

import org.apache.commons.lang3.StringUtils;
import org.bson.Document;
import java.sql.Connection;
import cn.restcloud.framework.core.context.*;
import cn.restcloud.etl.base.IETLBaseEvent;
import cn.restcloud.etl.base.IETLBaseProcessEngine;
import cn.restcloud.framework.core.util.*;
import cn.restcloud.framework.core.util.db.rdb.*;
import cn.restcloud.etl.rule.service.ETLProcessRuleUtil;
import java.util.*;
import com.github.houbb.opencc4j.util.ZhConverterUtil;

/**
indoc是一个map的包装对像内部结构为key-value
被流程的Java规则节点调用时,返回0表示终止流程,返回1表示成功,其中indoc为流数据,fieldId为空值
当被字段绑定运行时fieldId为绑定的字段Id,流入数据的每一行作为indoc对像传入本方法执行一次
params为绑定规则时填写的传入的参数格式为JSON字符串
繁体中文转为简体中文
2023-07-07 10:58:21
admin
*/
public class ETL_64a77f4d955fc70345c4041a implements IETLBaseEvent {
    
    

	@Override
	public String execute(IETLBaseProcessEngine engine, Document modelNodeDoc, Document indoc,String fieldId,String params) throws Exception {
    
    
	    //List<Document> dataDocs=engine.getData(indoc); //上一节点传入的数据流(仅作为Java规则节点运行可用)
	    Document paramsDoc=ETLProcessRuleUtil.paramsToDocument(params);//规则参数转为一个map包装对像key-value
	    String paramsValue=DocumentUtil.getString(paramsDoc,"参数id"); //读取规则选中时输入的自定义参数值
		String fieldValue=indoc.getString(fieldId); //获取规规绑定的字段Id获取字段值
		PrintUtil.o(fieldId+"取到的值为=>"+fieldValue); //PrintUtil.o();可以打印变量到控制以日志中
		//TODO 对fieldValue进行自定义处理
		String result = ZhConverterUtil.toSimple(fieldValue);
		PrintUtil.o("转换后的值为=>"+result); 
		indoc.put(fieldId,result); //把新的值覆盖旧字段的值
		return "1";
	}
}

Note: What needs to be noted here is that we use a third-party Jarpackage opencc4jto complete this work, so ETLCloudhow do we know how to call the custom tool class method? This requires us to jarput the third-party ETLCloudin the deployment directory: /usr/tomcat/webapps/ROOT/WEB-INF/lib.

[root@etl ~]# docker cp /opt/opencc4j-1.8.1.jar de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
                                             Successfully copied 513kB to de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib

Then click version update, and the platform prompts the following:

Platform configuration (Successfully registered (0) java bean, update (2) java bean information!, API upgrade results: Update or register (0) services, (0) input parameters, (0) output codes from Jar file ! ), ETL configuration (Successfully registered (0) java bean, update (0) java bean information!, API upgrade results: Update or register (2) services, (0) input parameters, (0) from Jar file output encoding! )

Migration practices

Next, the rapid conversion and migration of poetry data from MySQLto is completed through visual configuration and operation .ClickHouse

Data source configuration

  1. Configure Source: MySQL

Select MySQLand fill in the IP: port and user password information.
2023-07-01-2-SourceMySQL.jpg
Test connection successful~

  1. Configure Sink: ClickHouse

Select the ClickHouse poetry database from which previous articles were migrated as the data source.

Visual configuration process

After creating the process, you can click the "Process Design" button to enter the process visualization configuration page.

  1. Database table input: MySQL

In the input component on the left, select "Library Table Input", drag it to the central process drawing area, and double-click to enter the configuration stage.

MySQLStep 1: Select the data source we configured and load MySQLthe existing tables in it.
2023-07-15-6-Source1.jpg
Step 2: You can generate statements based on the selected table SQL.
2023-07-15-7-Source2.jpg
Step 3: The definition of each field can be read from the table, and fields can be added and deleted.
2023-07-15-8-Source3.jpg
Step 4: SQLData preview is automatically performed based on the statement. Such a check operation ensures the normal execution of subsequent operations.
2023-07-15-9-Source4.jpg

  1. Data cleaning and conversion: opencc4j converts Traditional Chinese to Simplified Chinese

Before configuring rules for fields, first familiarize yourself with opencc4jits usage in back-end development.

  • Introduce dependencies
        <!-- Opencc4j 支持中文繁简体转换 -->
        <dependency>
            <groupId>com.github.houbb</groupId>
            <artifactId>opencc4j</artifactId>
            <version>1.8.1</version>
        </dependency>
  • transcoding
import com.github.houbb.opencc4j.util.ZhConverterUtil;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
import org.springframework.boot.test.context.SpringBootTest;
@SpringBootTest
class SpringbootOpencc4jApplicationTests {
    
    
	// 繁体中文转简体中文
    @Test
    void toSimple(){
    
    
        String original = "李白乘舟將欲行,忽聞岸上踏歌聲。|桃花潭水深千尺,不及汪倫送我情。";
        String result = ZhConverterUtil.toSimple(original);
        System.out.println(result);
        Assertions.assertEquals("李白乘舟将欲行,忽闻岸上踏歌声。|桃花潭水深千尺,不及汪伦送我情。", result);
    }

	// 簡體中文轉繁體中文
    @Test
    void toTraditional(){
    
    
        String original = "李白乘舟将欲行,忽闻岸上踏歌声。|桃花潭水深千尺,不及汪伦送我情。";
        String result = ZhConverterUtil.toTraditional(original);
        Assertions.assertEquals("李白乘舟將欲行,忽聞岸上踏歌聲。|桃花潭水深千尺,不及汪倫送我情。", result);
    }
}

In the data transformation component on the left, select "Data Cleaning Transformation", drag it to the central process drawing area, and double-click to enter the configuration stage.
2023-07-15-10-Rule.jpg
Because the values ​​of these three fields in the source data table title, contentand authorare in Traditional Chinese, set custom rules for these three fields: Convert Traditional Chinese to Simplified Chinese . In the next step, click Save to convert all data records.

  1. Library table output: ClickHouse

In the output component on the left, select "Library Table Output", drag it to the central process drawing area, and double-click to enter the configuration stage.

Step 1: Select the ClickHouse data source we configured.
2023-07-15-11-CK1.jpg
Step 2: The definition of each field can be read from the table, and it supports adding, deleting fields and binding rules.
2023-07-15-12-CK2.jpg
Finally, by 流程线connecting the start , library table input , data cleaning conversion , library table output , and end components respectively, the visual configuration of data conversion and migration through customized rules is completed. Done~
2023-07-15-17-Flow.jpg

Run process

Save the process and run the process; you can then view the corresponding process logs and conversion logs, and visually monitor the migration progress.
2023-07-15-17-Result.jpg

Problem record

  • Error during data conversion process

Problem description: ETLCloudAn error was found in the log of . Caused by: java.lang.ClassNotFoundException: com.github.houbb.heaven.support.instance.impl.Instances
Problem analysis: When developing in SpringBootcombination IDEAwith Maven, we only introduced one dependency: opencc4j, but in fact, when observing the external dependency library, we found that there are two other dependencies: heavenand nlp-common.
2023-07-15-13-Jar.jpg
Solution: Upload all three packages opencc4j-1.8.1.jar, , heaven-0.2.0.jarand , to the directory of , re-update the configuration, and restart the service.nlp-common-0.0.5.jarjarETLCloud/usr/tomcat/webapps/ROOT/WEB-INF/libETLCloudETLCloud
2023-07-15-14-Jar.jpg

[root@etl ~]# docker cp /opt/heaven-0.2.0.jar de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
                                             Successfully copied 304kB to de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
[root@etl ~]# docker cp /opt/nlp-common-0.0.5.jar de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib
                                             Successfully copied 1.97MB to de63b29c71d0:/usr/tomcat/webapps/ROOT/WEB-INF/lib

2023-07-15-15-Update.jpg
Note: JarThe package can be found and downloaded from the Alibaba Cloud mirror repository: , or searched in the directory https://developer.aliyun.com/mvn/searchof the local development environment ..m2\repository\com\github\houbb

Summarize

The above describes how to ETLCloudcomplete the data cleaning and conversion function through the powerful custom rule function, and realize the conversion of table field values ​​from Traditional Chinese to Simplified Chinese. The following two points should be noted:

  1. Custom rules are attached to a certain process;
  2. Third-party Jarpackage dependencies must be complete in quantity.

Reference


If you have any questions or any bugs are found, please feel free to contact me.
Your comments and suggestions are welcome!

Guess you like

Origin blog.csdn.net/u013810234/article/details/132574809