Alibaba Cloud's big data tool - RDS migrated to Maxcompute to achieve dynamic partitioning

Currently, many users' business data is stored in traditional relational databases, such as Alibaba Cloud's RDS, for business read and write operations. When the amount of data is very large, the traditional relational database will be a little difficult at this time, so the data of the mysql database will often be migrated to [Big Data Processing Platform - Big Data Computing Service (Maxcompute, original ODPS) ( https:/ /www.aliyun.com/product/odps?spm=5176.doc27800.765261.309.dcjpg2) , use its powerful storage and computing capabilities to perform various query calculations, and then return the results to RDS.
         In general, business data is distinguished by date, and some static data may be distinguished by region or region. In Maxcompute, data can be stored by partition, which can be simply understood as a piece of data placed in different subdirectories. , the subdirectories are named after the date. In the process of migrating RDS data to Maxcompute, many users want to automatically create partitions and dynamically store data in RDS, such as data classified by date, into Maxcompute. This process is automatically created. The synchronization tool is to use the supporting product of Maxcompute - [Big Data Development Kit]( https://data.aliyun.com/product/ide?spm=5176.7741945.765261.313.TQqfkK) . The following examples illustrate the use of several methods of RDS-Maxcompute automatic partitioning.
1. Synchronize the data in the RDS to Maxcompute on a daily basis, and automatically create partitions based on the date of the day.
Here we will use the function of big data development kit - data integration, we use interface configuration.
As shown in the figure, set the partition format of Maxcompute
7e6208ade7eb6988d5825185cdfcb21eee2fe20c
Generally, when configuring this place, the default is the system's own time parameter: ${bdp.system.bizdate} The format is yyyymmdd. That is to say, when the task is scheduled to be executed, this partition will be automatically replaced with **the day before the task execution date**, which is more convenient for users, because the general user business data is the business data of the current day before the current run. Dates are also called business dates.
As shown
742352e50f703175b018316749fae8b154756380
 
If the user wants to use the date when the task runs on the current day as the partition value, he needs to customize this parameter. The method is as shown in the figure. You can also refer to the documentation.
https://help.aliyun.com/document_detail/30281.html?spm=5176.product30254.6.604.SDunjF
Customized parameters, the format is very flexible, the date is the date of the day, the user can freely choose which day and format.
The variable parameter configuration methods for reference are as follows:
After N years: $[add_months(yyyymmdd,12*N)]
The first N years: $[add_months(yyyymmdd,-12*N)]
After N months: $[add_months(yyyymmdd,N)]
First N months: $[add_months(yyyymmdd,-N)]
Last N weeks: $[yyyymmdd+7*N]
First N weeks: $[yyyymmdd-7*N]
Next N days: $[yyyymmdd+N]
Last N days: $[yyyymmdd-N]
Last N hours: $[hh24miss+N/24]
First N hours: $[hh24miss-N/24]
Last N minutes: $[hh24miss+N/24/60]
First N minutes: $[hh24miss-N/24/60]
Notice:
Please use square brackets [] to edit the value calculation formula of the custom variable parameter, such as key1=$[yyyy-mm-dd].
By default, custom variable parameters are calculated in days. For example, $[hh24miss-N/24/60] represents the calculation result of (yyyymmddhh24miss-(N/24/60 * 1 day)), and then takes the hour, minute, and second in the format of hh24miss.
使用 add_months 的计算单位为月。例如 $[add_months(yyyymmdd,12 N)-M/24/60] 表示 (yyyymmddhh24miss-(12 N 1月))-(M/24/60 1天) 的结果,然后按 yyyymmdd 的格式取年月日。
如图,配置完成后,我们来测试运行看下,直接查看日志
e8db68c9ea59598baf83fbf25763252531592b3f
可以,看到日志中,Maxcompute(日志中打印原名ODPS)的信息中
partition分区,date_test=20170829,自动替换成功。
再看下实际的数据过去了没呢
91a1c862682c842c32d9d9301f0b81eb8f3b36d2
我们看到数据是过来了,成功自动创建了一个分区值。那么这个任务定时调度的时候,就会自动生成一个分区,每天自动的将RDS中的数据同步到Maxcompute中的按照日期创建的分区中。
二,如果用户的数据有很多运行日期之前的历史数据,怎么自动同步,自动分区呢。大数据开发套件-运维中心-有个补数据的功能。
首先,我们需要在RDS端把历史数据按照日期筛选出来,比如历史数据2017-08-25这天的数据,我要让他自动同步到Maxcompute的20170825的分区中。
在RDS阶段可以设置where过滤条件,如图
978e24ec0e6466cb970fdeaf6dace94d7300c9df
在Maxcompute页面,还是按照之前一样配置
5fca7fcd1e56cb358c1907239207f925f5a254d5
然后一定要 保存-提交。
提交后到运维中心-任务管理-图形模式-补数据
77eee8edaf1113290ce93d87a60ef89a56f1fa42
选择日期区间
585b37615ec748788bb5410f2509bd18b5312be4
提交运行,这个时候就会同时生成多个同步的任务实例按顺序执行
3d81a0e5e45d4b4ed5c7e369fc4006d51b807408
看下运行的日志,可以看到运行过程对RDS数据的抽取,在Maxcompute自动创建的分区
d33959555f20268e703470380d17258434bbc3a8
看下运行结果,数据写入的情况,自动创建了分区,数据同步过来了。
3a2a193e17534c06a3c9f1e29a20c9bea64c4c9a
三,如果用户数据量比较巨大,第一次全量的数据,或者并不是按照日期分区,是按照省份等分区。那么此时数据集成就不能做到自动分区了。也就是说,想按照RDS中某个字段进行hash,相同的字段值自动放到Maxcompute中以这个字段对应值的分区中。
Synchronization itself cannot be done. It is done through SQL in Maxcompute. It is a unique function of Maxcompute. In fact, it is also a real dynamic partition. You can refer to the article.
. Then we need to first synchronize the full amount of data to a temporary table of Maxcompute.
The process is as follows
1, first create a SQL script node - used to create a temporary table

drop table if exists emp_test_new_temp;
CREATE TABLE emp_test_new_temp 
(date_time STRING,
	name STRING,
	age BIGINT,
	sal DOUBLE);	


2. The node that creates the synchronization task is a simple synchronization task that synchronizes the full amount of RDS data to Maxcompute without setting partitions.
3. Use sql to dynamically partition to the destination table
 

 

drop table if exists emp_test_new;
--创建一个ODPS分区表(最终目的表)
	CREATE TABLE emp_test_new (
	date_time STRING,
	name STRING,
	age BIGINT,
	sal DOUBLE
)
PARTITIONED BY (
	date_test STRING
);
--执行动态分区sql,按照临时表的字段date_time自动分区,date_time字段中相同的数据值,会按照这个数据值自动创建一个分区值
--例如date_time中有些数据是2017-08-25,会自动在ODPS分区表中创建一个分区,date=2017-08-25
--动态分区sql如下
--可以注意到sql中select的字段多写了一个date_time,就是指定按照这个字段自动创建分区
insert overwrite table emp_test_new partition(date_test)select date_time,name,age,sal,date_time from emp_test_new_temp
--导入完成后,可以把临时表删除,节约存储成本
drop table if exists emp_test_new_temp;

Finally, configure the three nodes into a workflow and execute them in sequence

78b6ceb9adff0dcb1be35f1a44ad1eff0c507fc6

The execution process, we focus on observing the dynamic partitioning process of the last node

6ff3be2c9bf9aaa3cc83e9dc2f2c53fc1a6ddcea

Finally, look at the data

258e38caedf94aa880c6f7188a865d977feb7694

Complete dynamic partitioning, automatic partitioning. Isn't it amazing that the same date data goes to the same partition. If it is named after a province, the same is true, I am afraid.

The big data development kit can actually complete most of the automated operations, especially data synchronization migration, scheduling, etc. The interface operation makes data integration simple, and you don't have to work overtime to do ETL, you know.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326034296&siteId=291194637