MySQL database data synchronization to MaxCompute

This section describes how to use data transmission DTS quickly create a MySQL database to real-time synchronization work between MaxCompute example, for online (MySQL) data to offline systems (MaxCompute) real-time synchronization, and further lay the foundation for real-time data analysis.

data source

  • Real-time support via dedicated access Ali cloud of self-built MySQL data synchronization to MaxCompute
  • Real-time support self MySQL on ECS to MaxCompute data synchronization
  • RDS for MySQL support real-time data synchronization instance to MaxCompute Ali cloud under one account
  • RDS for live support in different MySQL instance Ali cloud account to MaxCompute data synchronization

Synchronization objects

  • Synchronization only supports synchronous table, the table does not support other non-objects.

principle

As shown above, the entire synchronization process is divided into two steps:
(1) the total amount of the initialization, the initialization step of the total amount of data that already exists in the MySQL in MaxCompute. For each table synchronization, the total amount of the initialization data are stored independently of the total amount of baseline MaxCompute table, the default format of this table is: the name of the source table _base. Such as tables t1, then the total amount of base tables stored in the table named MaxCompute: t1_dts_base. The storage table name prefix can be changed as needed, you can configure task, modify the name of the table in MaxCompute store.

(2) incremental data synchronization, the delta data generating step MySQL MaxCompute in real-time to synchronize. And stored in the delta log table, each table corresponds to a delta synchronization log table. When a delta data sync, used to merge a plurality of records written to the file MaxCompute the embodiment. Incremental log table stored in MaxCompute the default format for the table name: the name of the source table _log. The storage table name prefix can be changed as needed, you can configure task, modify the name of the table in MaxCompute store.

In addition to storing the incremental update data log table, it also stores some meta information structure definition table delta log table as follows:

record_id operation_flag utc_timestamp before_flag after_flag col1 …. colN
1 I 1476258462 N Y 1 ….. JustInsert
2 The 1476258463 Y N 1 ….. JustInsert
2 The 1476258463 N Y 1 ….. JustUpdate
3 D 1476258464 Y N 1 ….. JustUpdate

Where:
record_id: This uniquely identifies the incremental logs, the only increase. If you change the type of update, then incremental update will be split into two, an insert, a delete. Then the same record_id these two records.
operation_flag: marking operation of this type of log increments. Values comprising:
the I: INSERT Operation
D: delete operations
U: update operation

dts_utc_timestamp: This timestamp increment operation logs, time stamps update binlog for this record. This timestamp to UTC time.
before_flag: each column represents a value before the value of this incremental update whether the end of the log band. The value includes: Y and N. When the back of the column value before updating, before_flag = Y, when the value of the column is later updated, before_flag = N.
after_flag: each column represents the value of this incremental value whether the end of the log with the update. The value includes: Y and N. When the back of the column value before updating, after_flag = N, when the value of the column is later updated, after_flag = Y.

For different types of operation, and increments the log before_flag after_flag defined as follows:

1) Type of operation: insert

record_id operation_flag utc_timestamp before_flag after_flag col1 …. colN
1 I 1476258462 N Y 1 ….. JustInsert

When the operation type of insert, the latter column is all recorded values ​​newly inserted, i.e. the updated value. Therefore before_flag = N, after_flag = Y.

2) The action types are: update

record_id operation_flag utc_timestamp before_flag after_flag col1 …. colN
2 The 1476258463 Y N 1 ….. JustInsert
2 The 1476258463 N Y 1 ….. JustUpdate

When the operation type of update, the update operation will be split to 2 log increments. Record_id these two log incremental same operation_flag and dts_utc_timestamp.
The first log record of the value before the update, so before_flag = Y, after_flag = N.
The second log records updated value, so before_flag = N, after_flag = Y.

3) Type of Operation: delete

record_id operation_flag dts_utc_timestamp before_flag after_flag col1 …. colN
3 D 1476258464 Y N 1 ….. JustUpdate

When the operation type is delete, all column values ​​behind the deleted records the value, i.e. the value before the update. Therefore before_flag = Y, after_flag = N.

(3) MySQL-> MaxCompute data synchronization, the synchronization for each table, the total amount of base will generate a table and a table in the delta log in MaxCompute, if the total amount of data need to obtain a particular table MaxCompute some point in it We need to merge the whole amount of the baseline and incremental table log table this table. Behind the specific method will be explained in detail.

The following details the process of configuring MySQL to MaxCompute real-time data synchronization job.

The following details the process of configuring MySQL to MaxCompute real-time data synchronization job.

1. Purchase synchronization link

DTS data transmission into the console, enter data synchronization page, click on the top right of the console "create sync" Start job configuration.

Before link configuration need to buy a synchronous link. Link currently supports synchronization package monthly subscription and pay according to the amount of the two modes, you can choose different payment modes as required.

In the purchase page parameters need to be configured include:

  • 源实例
    同步作业的源实例类型,目前只支持MySQL。
  • 源地域
    如果为本地自建 MySQL,那么选择专线在阿里云上的接入点所在的地区。如果为 ECS 上的自建 MySQL,那么选择 ECS 实例所在的地区。如果为 RDS for MySQL,那么选择 RDS 实例所在的地区。
  • 目标实例
    目标实例为同步作业的目标实例类型,目前支持 MySQL、MaxCompute (原 ODPS)、分析型数据库 AnalyticDB、DataHub。配置 MySQL->MaxCompute 同步链路时,目标实例选择:MaxCompute 即可。
  • 目标地域
    选择目标地域,可选地域仅限于当前已经开通了 MaxCompute 的实例区域。
  • 实例规格
    实例规格影响了链路的同步性能,可以根据业务性能选择合适的规则。

当购买完同步实例,返回数据传输控制台,点击新购链路右侧的“配置同步作业” 开始链路配置。

2.同步链路连接信息配置

在这一步主要配置:

  • 同步作业名称
    同步作业名称没有唯一性要求,主要为了更方便识别具体的作业,建议选择一个有业务意义的作业名称,方便后续的链路查找及管理。

  • 数据源连接信息配置

在这个步骤中需要配置源实例的连接信息,及目标 MaxCompute 实例的 project。配置的 MaxCompute project 必须属于登录 DTS 的阿里云账号的资源。

源实例可以支持:通过专线接入阿里云的自建数据库、ECS 上的自建数据库、RDS。

如果源实例为通过专线接入阿里云的自建数据库,那么需要配置的连接信息如下:

  • 实例类型:选择 通过专线接入阿里云的本地 DB
  • 实例地区:选择 专线 接入阿里云的接入点,例如接入阿里云的北京,那么选择 华北 2 即可。
  • 对端专有网络:专线 接入的阿里云上的 专有网络的 VPC ID
  • 主机名或IP地址:配置本地 MySQL 数据库访问地址,这个地址为本地局域网访问地址
  • 端口:本地 MySQL 实例监听端口
  • 数据库账号:本地 MySQL 实例访问账号
  • 数据库密码:上面指定的 MySQL 访问账号对应的密码

Local self DB

如果源实例为 ECS 上的自建数据库,那么需要配置的连接信息如下:

  • 实例类型:选择 ECS 上的自建数据库
  • ECS 实例 ID: 配置ECS实例的实例 ID
  • 端口:本地 MySQL 实例监听端口
  • 数据库账号:本地 MySQL 实例访问账号
  • 数据库密码:上面指定的 MySQL 访问账号对应的密码

Self-built database on the ECS

如果源实例为 RDS for MySQL,那么只需要配置 RDS 实例的实例 ID。

Examples of the connection information

当这些内容配置完成后,可以点击授权白名单并进入下一步

3.授权 RDS 实例白名单

这个步骤,主要是将给 DTS 服务账号授权 MaxCompute 写权限,让 DTS 能够将数据同步复制到 MaxCompute 中。

Step 2

授权权限包括对 project 的:
CreateTable
CreateInstance
CreateResource
CreateJob
List
为了保证同步作业的稳定性,在同步过程中,请勿将写权限回收。当白名单授权后,点击下一步,进入同步账号创建。

当授权完成后,即进入同步对象选择。

4.选择同步对象

当 MaxCompute 账号授权完成后,即进入同步表及同步初始化的相关配置。

Step 3

在这个步骤中,需要配置 同步初始化 和 同步表。其中:
(1)同步初始化
同步初始化选项包括: 结构初始化 和 全量数据初始化。
结构初始化是指对于待同步的表,在 MaxCompute 中创建对应的表,完成表结构定义。
全量数据初始化是指对于待同步的表,将历史数据初始化到 MaxCompute 中。
配置任务时,建议同时选择 结构初始化+全量数据初始化。

(2) 同步表选择

同步表只能选择某些表,不能直接选择整个库。对于同步的表,可以修改表在 MaxCompute 对应的全量基线表及增量日志表的表名前缀。如需修改,可以点击右边已选择对象后面的编辑按钮,进入修改界面。Step 4

当配置完同步对象后,进入同步初始化配置。

5.预检查

当上面所有选项配置完成后,即进入启动之前的预检查。具体检查项内容详见本文最后的 预检查内容 一节。
当同步作业配置完成后,数据传输服务会进行限制预检查,当预检查通过后,可以点击 确定 按钮,启动同步作业。

当同步作业启动之后,即进入同步作业列表。此时刚启动的作业处于同步初始化状态。初始化的时间长度依赖于源实例中同步对象的数据量大小。当初始化完成后同步链路即进入同步中的状态,此时源跟目标实例的同步链路才真正建立完成。

当同步任务进入 同步中 时,可以在 MaxCompute 中可以查询出与之前的配置所对应的全量基线表和增量日志表:
Table Information

至此,完成 RDS->MaxCompute 数据实时同步作业的配置。

This section describes how the total amount of base tables and the incremental log data MaxCompute the total amount of data obtained in accordance with a schedule to synchronize.
DTS offers a full merger of the amount of data capacity is achieved by MaxCompute SQL.

To give the total amount of data time t by MaxCompute SQL merge the total amount of baseline data and delta log table. MaxCompute SQL wording as follows:

Several variables given above is the code as follows:
. 1) result_storage_table represents table storage table the total amount merge result set
2) col1, col2, colN represents column names sync columns in the table
3) primary_key_column represents the synchronization table primary key column Listing
4) table_log represent incremental log table
5) table_base represent the full amount of baseline table
6) timestmap represent the full amount of data which needs to merge moment

For example, in the example above, the configuration task jiangliu_test table, jiangliu_test_20161010_base jiangliu_test corresponding to the total amount of base tables, jiangliu_test_20161010_log jiangliu_test corresponding to delta log table.

Jiangliu_test table structure is defined as:
Table Structure

Then the query time stamp 1476263486 moment, in jiangliu_test example MaxCompute SQL table full amount of data as follows:

You can also large data development kit, subsequent calculations before analysis operation, the total amount of data added Merge node, when the total amount of data merge is completed, automatic scheduling can be calculated from analysis of the subsequent node. Configuration while scheduling period periodically analyzed off-line data.

This completes the MySQL database data synchronization to MaxCompute full amount of configuration and data consolidation.

Guess you like

Origin www.cnblogs.com/shishitongbu/p/11019534.html
Recommended