Build a real-time data warehouse based on Alibaba Cloud Flink+Hologres

Abstract: The authors of this article are Zhang Gaodi, senior R&D engineer of Alibaba Cloud Hologres, and Zhang Yingnan, technical content engineer of Alibaba Cloud Flink. This article will introduce you to how to build a real-time data warehouse through the real-time computing Flink version and the real-time data warehouse Hologres.

Tips: Click "Read the original text" to receive 5000CU*hours of Flink cloud resources for free

Background Information

With the development of social digitization, enterprises have increasingly strong demands for data timeliness. In addition to the traditional offline scenarios designed for massive data processing scenarios, a large number of businesses need to solve real-time scenario issues for real-time processing, real-time storage, and real-time analysis. The methodology for building traditional offline data warehouses is relatively clear, and data warehouse stratification is achieved through scheduled scheduling (ODS->DWD->DWS->ADS); however, there is currently a lack of clear methodology for building real-time data warehouses. Based on the Streaming Warehouse concept, the efficient flow of real-time data between data warehouse layers can be realized, which can solve the problem of real-time data warehouse layering.

Solution architecture

Real-time computing Flink version is a powerful streaming computing engine that supports efficient processing of massive real-time data. Hologres is a one-stop real-time data warehouse that supports real-time writing and updating of data, and real-time data can be checked after writing. Hologres is deeply integrated with Flink to provide an integrated real-time data warehouse joint solution. This article’s solution for building a real-time data warehouse based on Flink+Hologres is as follows:

1. Flink writes the data source into Hologres to form the ODS layer.

2. Flink subscribes to the Binlog of the ODS layer for processing, forming the DWD layer and writing it to Hologres again.

3. Flink subscribes to the Binlog of the DWD layer, forms the DWS layer through calculation, and writes it to Hologres again.

4. Finally, Hologres provides external application queries.

1615b55c6f2cd0295475a7d5391d73dd.png

This solution has the following advantages:

  • Each layer of data in Hologres supports efficient updating and modification, and can be checked immediately after writing. This solves the problem that the middle layer data of traditional real-time data warehouse solutions is difficult to check, update, and correct.

  • Each layer of data in Hologres can provide external services independently, and the efficient reuse of data can truly achieve the goal of hierarchical reuse of data warehouses.

  • The model is unified and the architecture is simplified. The logic of the real-time ETL link is implemented based on Flink SQL; the data of the ODS layer, DWD layer and DWS layer are uniformly stored in Hologres, which can reduce the complexity of the architecture and improve the efficiency of data processing.

This solution relies on the 3 core capabilities of Hologres, as detailed in the table below.

Hologres core competencies

Details

Binlog

Hologres provides Binlog capabilities to drive Flink for real-time calculations as the upstream of streaming computing. For details about Hologres' Binlog capabilities, please see Subscribe to Hologres Binlog  [1]  .

Coexistence of ranks and ranks

Hologres supports row and column coexistence storage formats. A table stores both row-stored data and column-stored data, and the two sets of data are strongly consistent. This feature ensures that the middle layer table can not only be used as Flink's source table, but also as Flink's dimension table for primary key enumeration and dimension table join, and can also be queried by other applications (OLAP, online services, etc.). For details about Hologres' row and column coexistence capabilities, please refer to Table Storage Format: Column Storage, Row Storage, and Row and Row Coexistence  [2]  .

Strong resource isolation

When the load of the Hologres instance is high, it may affect the query performance of the middle layer. Hologres supports strong resource isolation through master-slave instance read-write separation deployment (shared storage) [3]  or computing group instance architecture  [4]  , thereby ensuring that Flink's data pull from Hologres Binlog does not affect online services.

Practical scenario

This article takes an e-commerce platform as an example. By building a real-time data warehouse, it can realize real-time processing and cleaning of data and connect upper-layer application data query to form the layering and reuse of real-time data to support the report query (transaction) of each business party. Multiple business scenarios such as large screens, behavioral data analysis, user portrait tags) and personalized recommendations.

0afa40581f9a3fa6e22a7519833ce847.png

1. Build the ODS layer: real-time warehousing of business databases

MySQL has three business tables: orders (order table), orders_pay (order payment table), and product_catalog (product category dictionary table). These three tables are synchronized to Hologres in real time through Flink as the ODS layer.

outside_default.png

2. Build the DWD layer: real-time topic wide table

The order table, product category dictionary table, and order payment table are widened in real time to generate a DWD layer width table.

outside_default.png

3. Build DWS layer: real-time indicator calculation

The binlog of the wide table is consumed in real time, and the corresponding DWS layer indicator table is aggregated in an event-driven manner.

outside_default.png


Prerequisites

  • An exclusive general-purpose Hologres instance has been purchased. For details, please see Purchasing Hologres  [5]  .

After purchasing the instance, you need to create the order_dw database and user (give the user admin permissions). It is recommended to use the simple permission model to create the database. For details, please see the use of the simple permission model [6]  and  DB management  [7]  .

illustrate:

- In Hologres1.3 version, after creating the database, you need to execute the create extension hg_binlog command to enable the binlog extension.

- Hologres2.0 and later versions enable binlog extension by default, no need to execute it manually.

  • Flink full hosting has been activated. For details, please refer to Activating Flink full hosting  [8]  .

Note: Fully managed Flink needs to be in the same VPC and the same availability zone as the Hologres instance.

  • The MySQL CDC data source has been prepared. The table creation DDL of the three business tables in the order_dw database and the inserted data are as follows.

CREATE TABLE `orders` (
  order_id bigint not null primary key,
  user_id varchar(50) not null,
  shop_id bigint not null,
  product_id bigint not null,
  buy_fee numeric(20,2) not null,   
  create_time timestamp not null,
  update_time timestamp not null default now(),
  state int not null 
);




CREATE TABLE `orders_pay` (
  pay_id bigint not null primary key,
  order_id bigint not null,
  pay_platform int not null, 
  create_time timestamp not null
);




CREATE TABLE `product_catalog` (
  product_id bigint not null primary key,
  catalog_name varchar(50) not null
);


-- 准备数据
INSERT INTO product_catalog VALUES(1, 'phone_aaa'),(2, 'phone_bbb'),(3, 'phone_ccc'),(4, 'phone_ddd'),(5, 'phone_eee');


INSERT INTO orders VALUES
(100001, 'user_001', 12345, 1, 5000.05, '2023-02-15 16:40:56', '2023-02-15 18:42:56', 1),
(100002, 'user_002', 12346, 2, 4000.04, '2023-02-15 15:40:56', '2023-02-15 18:42:56', 1),
(100003, 'user_003', 12347, 3, 3000.03, '2023-02-15 14:40:56', '2023-02-15 18:42:56', 1),
(100004, 'user_001', 12347, 4, 2000.02, '2023-02-15 13:40:56', '2023-02-15 18:42:56', 1),
(100005, 'user_002', 12348, 5, 1000.01, '2023-02-15 12:40:56', '2023-02-15 18:42:56', 1),
(100006, 'user_001', 12348, 1, 1000.01, '2023-02-15 11:40:56', '2023-02-15 18:42:56', 1),
(100007, 'user_003', 12347, 4, 2000.02, '2023-02-15 10:40:56', '2023-02-15 18:42:56', 1);


INSERT INTO orders_pay VALUES
(2001, 100001, 1, '2023-02-15 17:40:56'),
(2002, 100002, 1, '2023-02-15 17:40:56'),
(2003, 100003, 0, '2023-02-15 17:40:56'),
(2004, 100004, 0, '2023-02-15 17:40:56'),
(2005, 100005, 0, '2023-02-15 18:40:56'),
(2006, 100006, 0, '2023-02-15 18:40:56'),
(2007, 100007, 0, '2023-02-15 18:40:56');

Usage restrictions

  • Only the real-time computing engine VVR 6.0.7 and above supports this real-time data warehouse solution.

  • Only Hologres versions 1.3 and above support this real-time data warehouse solution.


Build a real-time data warehouse

Manage metadata

1. Create Hologres Catalog.

On the real-time computing console  [9]  , create a new SQL job named test. Copy the following code to the SQL editor of the test job. After modifying the target parameter value, select the code fragment and click on the code line on the left of operation.

CREATE CATALOG dw WITH (
  'type' = 'hologres',
  'endpoint' = '<ENDPOINT>', 
  'username' = '<USERNAME>',
  'password' = '<PASSWORD>',
  'dbname' = 'order_dw',
  'binlog' = 'true', -- 创建catalog时可以设置源表、维表和结果表支持的with参数,之后在使用此catalog下的表时会默认添加这些默认参数。
  'sdkMode' = 'jdbc', -- 推荐使用jdbc模式。
  'cdcmode' = 'true',
  'connectionpoolname' = 'the_conn_pool',
  'ignoredelete' = 'true',  -- 宽表merge需要开启,防止回撤。
  'partial-insert.enabled' = 'true', -- 宽表merge需要开启此参数,实现部分列更新。
  'mutateType' = 'insertOrUpdate', -- 宽表merge需要开启此参数,实现部分列更新。
  'table_property.binlog.level' = 'replica', --也可以在创建catalog时传入持久化的hologres表属性,之后创建表时,默认都开启binlog。
  'table_property.binlog.ttl' = '259200'
);

You need to modify the following parameter values ​​to your actual Hologres service information.

parameter

illustrate

Remark

endpoint

Endpoint address of Hologres.

See instance configuration  [10]  for details .

username

AccessKey of Alibaba Cloud account.

The user corresponding to the currently configured AccessKey needs to be able to access all Hologres databases. For Hologres database permissions, please refer to Hologres Permission Model Overview  [11]  .

password

AccessSecret of Alibaba Cloud account.

Note: When creating a Catalog, you can set the default WITH parameters of the source table, dimension table, and result table, and you can also set the default properties of the Hologres physical table, such as the parameters starting with table_property above. For details, please see Managing Hologres Catalog [12]  and Real-time Data Warehouse Hologres WITH Parameters  [13]  .

2. Create MySQL Catalog

In the real-time computing console  [9]  , copy the following code to the SQL editor of the test job. After modifying the target parameter value, select the code fragment and click Run on the left code line.

CREATE CATALOG mysqlcatalog WITH(
  'type' = 'mysql',
  'hostname' = '<hostname>',
  'port' = '<port>',
  'username' = '<username>',
  'password' = '<password>',
  'default-database' = 'order_dw'
);

You need to modify the following parameter values ​​to your actual MySQL service information.

parameter

illustrate

hostname

The IP address or Hostname of the MySQL database.

port

The port number of the MySQL database service. The default value is 3306.

username

The user name of the MySQL database service.

password

The password for the MySQL database service.

Build the ODS layer: real-time warehousing of business databases

Based on Catalog's CREATE DATABASE AS (CDAS) statement  [14] function, the ODS layer can be built at one time. The ODS layer generally does not directly do OLAP or SERVING (KV checking). It is mainly used as an event driver for streaming jobs. Turning on binlog can meet the needs.

1. Create a CDAS synchronization job ODS.

    a. On the real-time computing console  [9] , create a new SQL stream job named ODS, and copy the following code to the SQL editor.

CREATE DATABASE IF NOT EXISTS dw.order_dw   -- 创建catalog时设置了table_property.binlog.level参数,因此通过CDAS创建的所有表都开启了binlog。
AS DATABASE mysqlcatalog.order_dw INCLUDING all tables -- 可以根据需要选择上游数据库需要入仓的表。
/*+ OPTIONS('server-id'='8001-8004') */ ;   -- 指定mysql-cdc源表。

    b. Click Deploy in the upper right corner to deploy the job.

    c. Click Job Operation and Maintenance in the left navigation bar, click Start in the operation column of the ODS job you just deployed, and start the job.

2. View the 3 table data synchronized from MySQL to Hologres.

After connecting to the Hologres instance on the HoloWeb development page and logging in to the target database, execute the following command on the SQL editor.

---查orders中的数据。
SELECT * FROM orders;


---查orders_pay中的数据。
SELECT * FROM orders_pay;


---查product_catalog中的数据。
SELECT * FROM product_catalog;


7e367f5a8ea8b2feb4501813750ecc83.png


Building the DWD layer: real-time topic wide table

1. Create the wide table dwd_orders of the DWD layer in Hologres through the Flink Catalog function.

On the real-time computing console  [9] , copy the following code to the SQL editor of the test job, select the target fragment and click Run on the left code line.

-- 宽表字段要nullable,因为不同的流写入到同一张结果表,每一列都可能出现null的情况。
CREATE TABLE dw.order_dw.dwd_orders (
  order_id bigint not null primary key,
  order_user_id string,
  order_shop_id bigint,
  order_product_id bigint,
  order_product_catalog_name string,
  order_fee numeric(20,2),
  order_create_time timestamp,
  order_update_time timestamp,
  order_state int,
  pay_id bigint,
  pay_platform int comment 'platform 0: phone, 1: pc', -- catalog建表可以设置注释。
  pay_create_time timestamp
);


-- 支持通过catalog修改Hologres物理表属性。
ALTER TABLE dw.order_dw.dwd_orders SET (
  'table_property.binlog.ttl' = '604800' --修改binlog的超时时间为一周。
);

2. Real-time consumption of the binlog of orders and orders_pay tables in the ODS layer.

On the real-time computing console  [9] , create a new SQL job named DWD, copy the following code to the SQL editor, deploy and start the job. Through the following SQL job, the orders table will be associated with the product_catalog table as a dimension table, and the final results will be written into the dwd_orders table to achieve real-time widening of data.

BEGIN STATEMENT SET;


INSERT INTO dw.order_dw.dwd_orders 
 (
   order_id,
   order_user_id,
   order_shop_id,
   order_product_id,
   order_fee,
   order_create_time,
   order_update_time,
   order_state,
   order_product_catalog_name
 ) SELECT o.*, dim.catalog_name 
   FROM dw.order_dw.orders as o
   LEFT JOIN dw.order_dw.product_catalog FOR SYSTEM_TIME AS OF proctime() AS dim
   ON o.product_id = dim.product_id;


INSERT INTO dw.order_dw.dwd_orders 
  (pay_id, order_id, pay_platform, pay_create_time)
   SELECT * FROM dw.order_dw.orders_pay;


END;

3. View the wide table dwd_orders data.

After connecting to the Hologres instance on the HoloWeb development page and logging in to the target database, execute the following command on the SQL editor.

SELECT * FROM dwd_orders;

a931f83e7405f00c32c8bb3feb5b221b.png


Building the DWS layer: real-time metric calculation

1. Create the aggregation dws_users and dws_shops of the dws layer in Hologres through the Flink Catalog function.

On the real-time computing console  [9] , copy the following code to the SQL editor of the test job, select the target fragment and click Run on the left code line.

-- 用户维度聚合指标表。
CREATE TABLE dw.order_dw.dws_users (
  user_id string not null,
  ds string not null,
  paied_buy_fee_sum numeric(20,2) not null, -- 当日完成支付的总金额。
  primary key(user_id,ds)  NOT ENFORCED
);


-- 商户维度聚合指标表。
CREATE TABLE dw.order_dw.dws_shops (
  shop_id bigint not null,
  ds string not null,
  paied_buy_fee_sum numeric(20,2) not null, -- 当日完成支付总金额。
  primary key(shop_id,ds)  NOT ENFORCED
);

2. Consume the wide table dw.order_dw.dwd_orders of the DWD layer in real time, perform aggregation calculations in Flink, and finally write it to the DWS table in Hologres.

On the real-time computing console  [9] , create a new SQL flow job named DWS, copy the following code to the SQL editor, deploy and start the job.

BEGIN STATEMENT SET;


INSERT INTO dw.order_dw.dws_users
  SELECT 
    order_user_id,
    DATE_FORMAT (pay_create_time, 'yyyyMMdd') as ds,
    SUM (order_fee)
    FROM dw.order_dw.dwd_orders c
    WHERE pay_id IS NOT NULL AND order_fee IS NOT NULL -- 订单流和支付流数据都已写入宽表。
    GROUP BY order_user_id, DATE_FORMAT (pay_create_time, 'yyyyMMdd');


INSERT INTO dw.order_dw.dws_shops
  SELECT 
    order_shop_id,
    DATE_FORMAT (pay_create_time, 'yyyyMMdd') as ds,
    SUM (order_fee)
   FROM dw.order_dw.dwd_orders c
   WHERE pay_id IS NOT NULL AND order_fee IS NOT NULL -- 订单流和支付流数据都已写入宽表。
   GROUP BY order_shop_id, DATE_FORMAT (pay_create_time, 'yyyyMMdd');
END;

3. View the aggregation results of the DWS layer, which will be updated in real time based on changes in upstream data.

After connecting to the Hologres instance on the HoloWeb development page and logging in to the target database, execute the following command on the SQL editor.

  • Query the results of the dws_users table.

SELECT * FROM dws_users;


536299c87945f21b3dcc8a3577f2eed7.png

  • Query the results of the dws_shops table.

SELECT * FROM dws_shops;


0c7c395533287fe97b0cac26b22681ec.png


Data exploration

If the intermediate results require ad-hoc business data probing, or the final calculation results need to be checked for data correctness, each layer of data in this solution is persisted, making it easy to probe the intermediate processes.

  • Streaming pattern probing

    a. Create a new and start the data exploration flow job.

On the real-time computing console  [9] , create a new SQL flow job named Data-exploration, copy the following code to the SQL editor, deploy and start the job.

-- 流模式探查,打印到print可以看到数据的变化情况。
CREATE TEMPORARY TABLE print_sink(
  order_id bigint not null primary key,
  order_user_id string,
  order_shop_id bigint,
  order_product_id bigint,
  order_product_catalog_name string,
  order_fee numeric(20,2),
  order_create_time timestamp,
  order_update_time timestamp,
  order_state int,
  pay_id bigint,
  pay_platform int,
  pay_create_time timestamp
) WITH (
  'connector' = 'print'
);


INSERT INTO print_sink SELECT *
FROM dw.order_dw.dwd_orders /*+ OPTIONS('startTime'='2023-02-15 12:00:00') */ --这里的startTime是binlog生成的时间
WHERE order_user_id = 'user_001';

    b. View the data exploration results.

On the job operation and maintenance details page, click the target job name, click the Run Log tab on the left under the Job Discovery tab, and click Path, ID under the Run Task Managers tab. Search the Stdout page for log information related to user_001.

3e380cb48a1e36edee4ac646a6431c86.png

  • Batch mode probing

On the real-time computing console  [9] , create a SQL stream job, copy the following code to the SQL editor, and click Debug. See job debugging  [15] for details .

Batch mode probing is to obtain the final state data at the current moment. The debugging results in the Flink job development interface are shown in the figure below.

SELECT *
FROM dw.order_dw.dwd_orders /*+ OPTIONS('binlog'='false') */ 
WHERE order_user_id = 'user_001' and order_create_time > '2023-02-15 12:00:00'; --批量模式支持filter下推,提升批作业执行效率。


6dcec74e50e9495a1a275a920425c7a7.png


Use real-time data warehouse

The previous section showed that through Flink Catalog, a real-time hierarchical data warehouse based on Flink and Hologres can be built on the Flink side only. This section shows some simple application scenarios after the data warehouse is built.

Key-Value service

Query the aggregate indicator table of the DWS layer based on the primary key, supporting millions of RPS.

The code example for querying the consumption amount of a specified user on a specified date on the HoloWeb development page is as follows.

-- holo sql
SELECT * FROM dws_users WHERE user_id ='user_001' AND ds = '20230215';


645e3ee5a43618e11879b89c92698593.png


Detailed query

Perform OLAP analysis on DWD layer width table.

The code example for querying the order details paid by a customer on a specific payment platform in February 23 on the HoloWeb development page is as follows.

-- holo sql
SELECT * FROM dwd_orders
WHERE order_create_time >= '2023-02-01 00:00:00'  and order_create_time < '2023-03-01 00:00:00'
AND order_user_id = 'user_001'
AND pay_platform = 0
ORDER BY order_create_time LIMIT 100;

c0c5503d880e00313cada09af2e0c179.png


real-time reporting

Display real-time reports based on DWD layer wide table data, supporting second-level response.

The code example for querying the total order quantity and total order amount of each category in February 2023 on the HoloWeb development page is as follows.

-- holo sql
SELECT
  TO_CHAR(order_create_time, 'YYYYMMDD'),
  order_product_catalog_name,
  COUNT(*),
  SUM(order_fee)
FROM
  dwd_orders
WHERE
  order_create_time >= '2023-02-01 00:00:00'  and order_create_time < '2023-03-01 00:00:00'
GROUP BY
  1, 2
ORDER BY
  1, 2;

27ae82539a6fdc3954721a76894a3c12.png

reference

[1] Subscribe to Hologres Binlog:

https://help.aliyun.com/zh/hologres/user-guide/subscribe-to-hologres-binary-logs

[2] Table storage format: column storage, row storage, and row and column coexistence:

https://help.aliyun.com/zh/hologres/user-guide/storage-models-of-tables

[3] Master-slave instance read and write separation deployment (shared storage):

https://help.aliyun.com/zh/hologres/user-guide/configure-multi-instance-high-availability-deployment

[4] Compute group instance architecture:

https://help.aliyun.com/zh/hologres/user-guide/architecture-of-virtual-warehouses

[5] Buy Hologres:

https://help.aliyun.com/zh/hologres/getting-started/purchase-a-hologres-instance

[6] Use of simple permission model:

https://help.aliyun.com/zh/hologres/user-guide/use-the-spm

[7] DB management:

https://help.aliyun.com/zh/hologres/user-guide/manage-databases

[8] Activate Flink full hosting:

https://help.aliyun.com/zh/flink/getting-started/activate-fully-managed-flink

[9] Real-time computing console:

https://realtime-compute.console.aliyun.com/regions/cn-shanghai#/region/cn-hangzhou/resource/all/dashboard/serverless/asi

[10] Instance configuration:

https://help.aliyun.com/zh/hologres/user-guide/instance-configurations?spm=a2c4g.11186623.0.0.7d706105cYpaCU

[11] Overview of Hologres permission model:

https://help.aliyun.com/zh/hologres/user-guide/overview#concept-2021277

[12] Managing Hologres Catalog:

https://help.aliyun.com/zh/flink/user-guide/manage-hologres-catalogs

[13] Real-time data warehouse Hologres WITH parameters:

https://help.aliyun.com/zh/flink/developer-reference/hologres-connector

[14] CREATE DATABASE AS (CDAS) statement:

https://help.aliyun.com/zh/flink/developer-reference/create-database-as-statement?spm=a2c4g.11186623.0.0.7d706105cYpaCU

[15] Job debugging:

https://help.aliyun.com/zh/flink/user-guide/debug-a-deployment

8/26 Event Preview

Event time: August 26, 13:00

Event location: Beijing Alibaba Center·Wangjing Tower A

Offline registration address: https://developer.aliyun.com/trainingcamp/4bb294cf64b04a2a8b3f8b153e188e9f

Online live viewing address: https://gdcop.h5.xeknow.com/sl/1l4Sye

Activity details: Expert teachers will teach! Live Q&A! Alibaba Cloud real-time computing Flink version offline training camp Beijing station is here!


▼ " 8/26  Event Preview " Scan the picture below to reserve an online live broadcast ▼

39dbd02f25af80e0965595bf89886b00.png

▼ Follow " Apache Flink " to get more technical information ▼

7be29e379a89d73b78ee0e4c53f95df1.png

 97a6c0239636811f89ced827cf0b9981.gif  Click " Read the original text " to receive 5000CU* hours of Flink cloud resources for free

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/132373301