Practice of Streaming Data Warehouse Based on Apache Paimon

5e5184686be5adf2e16e4581dbba4b1e.png3 million words! The most complete big data learning interview community on the whole network is waiting for you!

Summary

This article mainly introduces the implementation plan of Haicheng Bangda, a supply chain logistics service provider, in the process of digital transformation using Paimon to realize the streaming data warehouse. We provide an easy-to-use production operation manual suitable for the k8s environment, designed to help readers quickly master the use of Paimon.

  • Company business introduction

  • Pain points and selection of big data technologies

  • Production Practice

  • Troubleshooting Analysis

  • future plan

01

Company business introduction

Haicheng Bangda Group has been focusing on the field of supply chain logistics, providing customers with end-to-end one-stop intelligent supply chain logistics services by creating an excellent international logistics platform. The group currently has more than 2,000 employees, an annual turnover of more than 12 billion yuan, a network covering more than 200 ports around the world, and more than 80 branches and subsidiaries at home and abroad, helping Chinese companies connect with the world.

Business background:

With the continuous expansion of the company's scale and the increase of business complexity, in order to better achieve resource optimization and process improvement, the company's operations and process management department needs to monitor the company's business operations in real time to ensure the stability and efficiency of business processes .

The company's operation and process management department is responsible for supervising the execution of various business processes of the company, including the order volume of each region and business department of sea, air, and rail transport, the order volume of major customers, the volume of airline orders, customs affairs, warehousing, and land transportation. The amount of entrustment of the operation site, the actual income and expenditure of each region and business department of the company on the day, etc. Through the monitoring and analysis of these processes, the company can identify potential problems and bottlenecks, and propose improvement measures and suggestions to optimize the company's operational efficiency.

Data warehouse batch processing architecture:

6613459baa2a25548c2d37723bcb68b9.png

Real-time data warehouse architecture:

13830d50ab16982d46999b50c309ea5f.png

The current system requires the collection of real-time data directly from the production system, but there are multiple data sources that need to be associated with queries, and Fanruan reports are not friendly enough when processing multiple data sources, and cannot aggregate multiple data sources again. Regularly querying the production system will put pressure on the production system database and affect the stable operation of the production system. Therefore, we need to introduce a data warehouse that can implement streaming processing through Flink CDC technology to solve the problem of real-time data processing. This data warehouse needs to be able to collect real-time data from multiple data sources and implement complex associated SQL queries, machine learning and other operations on this basis, and avoid querying the production system from time to time, thereby reducing the pressure on the production system and ensuring the reliability of the production system. Stable operation.

02

Pain points and selection of big data technologies

Since the establishment of Haicheng Bangda's big data team, it has been using efficient operation and maintenance tools or platforms to realize efficient allocation of personnel, optimize repetitive labor, and manual work.

Under the condition that the offline batch processing can already support the basic cockpit and management reports of the group, the operation and management department of the group put forward the demand for real-time statistics of order quantity and operation order quantity, and the demand of the financial department for real-time display of cash flow. In the context of the current situation, the stream-batch integration solution based on big data is imperative.

Although the big data department has used Apache Doris to realize the integrated storage and computing of lakes and warehouses, and has previously published an article on the construction of lakes and warehouses in the Doris community, there are still some problems to be solved. Streaming data storage cannot be reused, and middle layer data cannot Check, can not do real-time aggregation calculation problem.

Sorted by architecture evolution time, the common architecture solutions in recent years are as follows:

hadoop architecture:

The dividing point between traditional data warehouses and Internet data warehouses, in the early days of the Internet, people did not have high requirements for data analysis, mainly to make reports with low real-time performance and support decision-making, and the corresponding offline data analysis solutions came into being.

Advantages: Rich data type support, support for massive calculations, low machine configuration requirements, low timeliness, fault tolerance

Disadvantages: does not support real-time; complex operation and maintenance; query optimizer is not as good as MPP, slow response

Selection basis: does not support real-time; O&M is complex, does not comply with the principle of thin staffing; poor performance

lambda architecture:

The Lambda architecture is a real-time big data processing framework proposed by Nathan Marz, the author of Storm. Marz developed the famous real-time big data processing framework Storm during his work at Twitter. The Lambda architecture is based on his years of experience in distributed big data systems.

994d191fcbb44413456e7d7882baaf47.png

Data flow processing is divided into three layers: ServingLayer, SpeedLayer, and BatchLayer:

The Batch layer mainly processes offline data, and finally provides view services to the business;

In the Speed ​​layer, it mainly processes real-time incremental data, and finally provides view services to the business;

In the Serving layer, it mainly responds to user requests, realizes the aggregation calculation of offline and incremental data, and finally provides services;

The advantages are: offline and real-time calculations are separated, two sets of frameworks are used, and the structure is stable

The disadvantage is: it is difficult to maintain consistency between offline and real-time data, operation and maintenance personnel need to maintain two sets of framework three-tier architecture, developers need to write three sets of code

Selection basis: Data consistency is uncontrollable; O&M and development workloads are heavy, which does not meet the principle of staff reduction;

kappa architecture:

The kappa architecture only uses a set of data stream processing architecture to solve offline and real-time data, and uses real-time streams to solve all problems, aiming to provide fast and reliable query access results. It is ideal for a variety of data processing workloads, including continuous data pipelines, real-time data processing, machine learning models and real-time data analytics, IoT systems, and many other use cases with a single technology stack.

It is usually implemented using a stream processing engine, such as Apache Flink, Apache Storm, Apache Kinesis, Apache Kafka, designed to process large data streams and provide fast and reliable query access results.

4e0cd8e64ef76df64136f0a5da4eab96.png

The advantage is: single data stream processing framework

The disadvantages are: although its architecture is simpler than the lamabda architecture, the setup and maintenance of the streaming processing framework is relatively complicated, and it does not have the ability to process offline data in the true sense; the cost of storing big data in the streaming platform is high

Selection basis: Offline data processing capabilities need to be retained to control costs

Iceberg

For this reason, we also investigated Iceberg. Its snapshot function can realize stream-batch integration to a certain extent, but its problem is that the middle layer of the real-time table based on Kafka cannot be checked or the existing table cannot be reused, and it has a strong dependence on Kafka. , it is necessary to use kafka to write the intermediate results to the iceberg table, which increases the complexity and maintainability of the system.

Selection basis: No kafka real-time architecture has been implemented, and the intermediate data cannot be checked and reused

Streaming data warehouse (continuation of kappa architecture)

The Haicheng Bangda big data team has participated in the construction of streaming data warehouses since version FTS0.3.0, aiming to further reduce the complexity of the data processing framework and streamline the allocation of personnel. The purpose of the early stage is to participate in the trend since it is a trend, and to learn continuously Sophistication, moving closer to the most cutting-edge technology, the team agreed that if there is a pit, step on the pit, and cross the river by feeling the stone. Fortunately, after several iterations of the version, with the efficient cooperation of the community, the initial problems have gradually disappeared. be resolved

The streaming data warehouse architecture is as follows:

8e63d52c455ddba823226e5d7024723b.png

Continuing the characteristics of the kappa architecture, a set of stream processing architecture has the advantages. The technical support of the underlying paimon enables data to be checked in the entire link, and the layered architecture of the data warehouse can be reused. At the same time, it takes into account offline and real-time processing capabilities. Reduce waste of storage and computation

03

Production Practice

This solution adopts the Flink Application On K8s cluster, and Flink CDC ingests the relational database data of the business system in real time, submits the Flink+Paimon Streaming Data Warehouse task through the streampark task platform, and finally uses the Trino engine to access Finereport to provide services and queries from developers. The underlying storage of paimon supports the S3 protocol. Because the company's big data services depend on Alibaba Cloud, it uses object storage OSS as the data file system.

Form a full-link real-time flow, checkable, layered and reusable Pipeline

Architecture diagram:

934d9c76610ee0572ae8512cbcf7f6ef.png

The main component versions are as follows:

hefty-1.16.0-scala-2.12

paimon-flink-1.16-0.4-20230424.001927-40.jar

apache-streampark_2.12-2.0.0

kubernetes v1.18.3

Environment construction

Download flink-1.16.0-scala-2.12.tar.gz You can download the corresponding version of the installation package from the flink official website to the streampark server

#解压


tar zxvf flink-1.16.0-scala-2.12.tar.gz






#修改 flink-conf 配置文件并启动集群


vim flink-1.16.0-scala-2.12/conf/flink-conf.yaml 文件,按如下配置修改


jobmanager.rpc.address: localhost


jobmanager.rpc.port: 6123


jobmanager.bind-host: localhost


jobmanager.memory.process.size: 4096m


taskmanager.bind-host: localhost


taskmanager.host: localhost


taskmanager.memory.process.size: 4096m


taskmanager.numberOfTaskSlots: 4


parallelism.default: 4


akka.ask.timeout: 100s


web.timeout: 1000000


#checkpoints&&savepoints


state.checkpoints.dir: file:///opt/flink/checkpoints


state.savepoints.dir: file:///opt/flink/savepoints


execution.checkpointing.interval: 2min


#当作业手动取消/暂停时,将会保留作业的 Checkpoint 状态信息


execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION


state.backend: rocksdb


已完成的 cp 保存个数


state.checkpoints.num-retained: 2000


state.backend.incremental: true


execution.checkpointing.checkpoints-after-tasks-finish.enabled: true


#OSS


fs.oss.endpoint: oss-cn-zhangjiakou-internal.aliyuncs.com


fs.oss.accessKeyId: xxxxxxxxxxxxxxxxxxxxxxx


fs.oss.accessKeySecret: xxxxxxxxxxxxxxxxxxxxxxx


fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem


jobmanager.execution.failover-strategy: region


rest.port: 8081


rest.address: localhost

It is recommended to add FLINK_HOME locally to facilitate local troubleshooting before using k8s

vim /etc/profile

#FLINK


export FLINK_HOME=/data/src/flink-1.16.0-scala-2.12


export PATH=$PATH:$FLINK_HOME/bin


source /etc/profile

Add flink conf to streampark

eaf7a8db7715511d72a0240813db5697.png

Build the flink1.16.0 base image and pull the corresponding version of the image from dockerhub

#拉取镜像


docker pull flink:1.16.0-scala_2.12-java8






#打上 tag


docker tagflink:1.16.0-scala_2.12-java8  registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink:1.16.0-scala_2.12-java8






#push 到公司仓库


docker pushregistry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink:1.16.0-scala_2.12-java8

Create a Dockerfile & target directory and place the flink-oss-fs-hadoop JAR package in this directory Shaded Hadoop OSS file system jar package download address: https://repository.apache.org/snapshots/org/apache/paimon/paimon- oss/

.

├── Dockerfile

└── target

  └── flink-oss-fs-hadoop-1.16.0.jar

touch Dockerfile


mkdir target
#vim Dockerfile


FROM registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink:1.16.0-scala_2.12-java8


RUN mkdir /opt/flink/plugins/oss-fs-hadoop


COPY target/flink-oss-fs-hadoop-1.16.0.jar  /opt/flink/plugins/oss-fs-hadoop

#build base image

docker build -t flink-table-store:v1.16.0 .


docker tag flink-table-store:v1.16.0 registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink-table-store:v1.16.0


docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink-table-store:v1.16.0

#Prepare the paimon jar package

You can download the corresponding version from the Apache Repository. It should be noted that it must be consistent with the major version of flink

Use the streampark platform to submit paimon tasks

prerequisites:

Kubernetes client connection configuration

Kubernetes RBAC configuration

Container mirror warehouse configuration (Alibaba Cloud's container mirror service personal free version is used in the case)

Create a pvc resource to mount checkpoint/savepoint

Kubernetes client connection configuration:

Copy the ~/.kube/config configuration of the k8s master node directly to the directory of the streampark server, and then execute the following command on the streampark server to display that the k8s cluster running represents successful authorization and network verification.

kubectl cluster-info

Kubernetes RBAC configuration

Create streamx namespace

kubectl create ns streamx

Create a clusterrolebinding resource using the default account

kubectl create clusterrolebinding flink-role-binding-default --clusterrole=edit --serviceaccount=streamx:default

Container mirror warehouse configuration

In this case, the Alibaba Cloud container mirroring service ACR is used, or the self-built mirroring service harbor can be used instead.

Create a namespace streampark (security settings need to be set to private)

049363b0b1456e7c863a384c736970cf.png

Configure the mirror warehouse in streampark, and the task build image will be pushed to the mirror warehouse

477355f39f10c20c4ec5627917d84cb8.png

Create a k8s secret key to pull the image streamparksecret in ACR to customize the key name

kubectl create secret docker-registry streamparksecret --docker-server=registry-vpc.cn-zhangjiakou.aliyuncs.com --docker-username=xxxxxx --docker-password=xxxxxx -n streamx

Create a pvc resource to mount checkpoint/savepoint

Persistence of K8S based on Alibaba Cloud's object storage OSS

OSS CSI plugin:

An OSS CSI plugin can be used to help simplify storage management. You can use csi configuration to create pv, and pvc, pod are defined as usual, yaml file reference: https://bondextest.oss-cn-zhangjiakou.aliyuncs.com/ossyaml.zip

Configuration requirements:

- Create a service account with the required RBAC permissions

Reference: https://github.com/kubernetes-sigs/alibaba-cloud-csi-driver/blob/master/docs/oss.md 

kubectl -f rbac.yaml

- Deploy the OSS CSI plugin

kubectl -f oss-plugin.yaml

- Create CP&SP PV 

kubectl -f checkpoints_pv.yaml kubectl -f savepoints_pv.yaml

- Create CP&SP PVC 

kubectl -f checkpoints_pvc.yaml kubectl -f savepoints_pvc.yaml

After configuring the dependent environment, we will start using paimon for layered development of streaming data warehouses.

case:

Statistics of real-time orders for sea and air freight

Task submission:

Initialize paimon catalog configuration

394a8a388a69d267ceeaa2ccba204447.png

SET 'execution.runtime-mode' = 'streaming';


set 'table.exec.sink.upsert-materialize' = 'none'; 


SET 'sql-client.execution.result-mode' = 'tableau';


-- 创建并使用 FTS Catalog 底层存储方案采用阿里云oss


CREATE CATALOG `table_store` WITH (


'type' = 'paimon',


'warehouse' = 'oss://xxxxx/xxxxx' #自定义oss存储路径


);


USE CATALOG `table_store`;

A task simultaneously extracts table data from three databases of postgres, mysql, and sqlserver and writes them to paimon

7a29bf2beaf5a9b75cfc87c0a4f31998.png

cb731d9b87904c93a704eea18503bd81.png

499864147490366e3aca6c52b547ec64.png

Development Mode:Flink SQL

Execution Mode :kubernetes application

Flink Version :flink-1.16.0-scala-2.12

Kubernetes Namespace :streamx

Kubernetes ClusterId: (you can customize the task name)

Flink Base Docker Image: registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink-table-store:v1.16.0 #The base image uploaded to the Alibaba Cloud mirror warehouse

Rest-Service Exposed Type:NodePort

paimon basic dependency package:

paimon-flink-1.16-0.4-20230424.001927-40.jar

flink-shaded-hadoop-2-uber-2.8.3-10.0.jar

Flinkcdc dependent package download address:

https://github.com/ververica/flink-cdc-connectors/releases/tag/release-2.2.0

pod template

apiVersion: v1


kind: Pod


metadata:


name: pod-template


spec:


containers:


  - name: flink-main-container


    volumeMounts:


      - name: flink-checkpoints-csi-pvc


        mountPath: /opt/flink/checkpoints


      - name: flink-savepoints-csi-pvc


        mountPath: /opt/flink/savepoints


volumes:


  - name: flink-checkpoints-csi-pvc


    persistentVolumeClaim:


      claimName: flink-checkpoints-csi-pvc


  - name: flink-savepoints-csi-pvc


    persistentVolumeClaim:


      claimName: flink-savepoints-csi-pvc


imagePullSecrets:      


- name: streamparksecret

flink sql:

1. Build the relationship between the source table and the ods table in paimon, here is the one-to-one mapping between the source table and the target table

-- postgre数据库 示例


CREATE TEMPORARY TABLE `shy_doc_hdworkdochd` (


`doccode` varchar(50) not null COMMENT '主键',


`businessmodel` varchar(450) COMMENT '业务模式',


`businesstype` varchar(450)  COMMENT '业务性质',


`transporttype` varchar(50) COMMENT '运输类型',


......


`bookingguid` varchar(50) COMMENT '操作编号',


PRIMARY KEY (`doccode`) NOT ENFORCED


) WITH (


'connector' = 'postgres-cdc',


'hostname' = '数据库服务器IP地址',


'port' = '端口号',


'username' = '用户名',


'password' = '密码',


'database-name' = '数据库名',


'schema-name' = 'dev',


'decoding.plugin.name' = 'wal2json',,


'table-name' = 'doc_hdworkdochd',


'debezium.slot.name' = 'hdworkdochdslotname03'


);


CREATE TEMPORARY TABLE `shy_base_enterprise` (


`entguid` varchar(50) not null COMMENT '主键',


`entorgcode` varchar(450) COMMENT '客户编号',


`entnature` varchar(450)  COMMENT '客户类型',


`entfullname` varchar(50) COMMENT '客户名称',


PRIMARY KEY (`entguid`,`entorgcode`) NOT ENFORCED


) WITH (


'connector' = 'postgres-cdc',


'hostname' = '数据库服务器IP地址',


'port' = '端口号',


'username' = '用户名',


'password' = '密码',


'database-name' = '数据库名',


'schema-name' = 'dev',


'decoding.plugin.name' = 'wal2json',


'table-name' = 'base_enterprise',


'debezium.snapshot.mode'='never', -- 增量同步(全量+增量忽略该属性)


'debezium.slot.name' = 'base_enterprise_slotname03'


);


-- 根据源表结构在paimon上ods层创建对应的目标表


CREATE TABLE IF NOT EXISTS ods.`ods_shy_jh_doc_hdworkdochd` (


`o_year` BIGINT NOT NULL COMMENT '分区字段',


`create_date` timestamp NOT NULL COMMENT '创建时间',


PRIMARY KEY (`o_year`, `doccode`) NOT ENFORCED


) PARTITIONED BY (`o_year`)


WITH (


'changelog-producer.compaction-interval' = '2m'


) LIKE `shy_doc_hdworkdochd` (EXCLUDING CONSTRAINTS EXCLUDING OPTIONS);


CREATE TABLE IF NOT EXISTS ods.`ods_shy_base_enterprise` (


`create_date` timestamp NOT NULL COMMENT '创建时间',


PRIMARY KEY (`entguid`,`entorgcode`) NOT ENFORCED


)


WITH (


'changelog-producer.compaction-interval' = '2m'


) LIKE `shy_base_enterprise` (EXCLUDING CONSTRAINTS EXCLUDING OPTIONS);


-- 设置作业名,执行作业任务将源表数据实时写入到paimon对应表中


SET 'pipeline.name' = 'ods_doc_hdworkdochd';


INSERT INTO


ods.`ods_shy_jh_doc_hdworkdochd`


SELECT


*


,YEAR(`docdate`) AS `o_year`


,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS `create_date`


FROM


`shy_doc_hdworkdochd` where `docdate` is not null and `docdate` > '2023-01-01';


SET 'pipeline.name' = 'ods_shy_base_enterprise';


INSERT INTO


ods.`ods_shy_base_enterprise`


SELECT


*


,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS `create_date`


FROM


`shy_base_enterprise` where entorgcode is not null and entorgcode <> '';


-- mysql数据库 示例


CREATE TEMPORARY TABLE `doc_order` (


`id` BIGINT NOT NULL COMMENT '主键',


`order_no` varchar(50) NOT NULL COMMENT '订单号',


`business_no` varchar(50) COMMENT 'OMS服务号',


......


`is_deleted` int COMMENT '是否作废',


PRIMARY KEY (`id`) NOT ENFORCED


) WITH (


'connector' = 'mysql-cdc',


'hostname' = '数据库服务器地址',


'port' = '端口号',


'username' = '用户名',


'password' = '密码',


'database-name' = '库名',


'table-name' = 'doc_order'


);


-- 根据源表结构在paimon上ods层创建对应的目标表


CREATE TABLE IF NOT EXISTS ods.`ods_bondexsea_doc_order` (


`o_year` BIGINT NOT NULL COMMENT '分区字段',


`create_date` timestamp NOT NULL COMMENT '创建时间',


PRIMARY KEY (`o_year`, `id`) NOT ENFORCED


) PARTITIONED BY (`o_year`)


WITH (


'changelog-producer.compaction-interval' = '2m'


) LIKE `doc_order` (EXCLUDING CONSTRAINTS EXCLUDING OPTIONS);


-- 设置作业名,执行作业任务将源表数据实时写入到paimon对应表中


SET 'pipeline.name' = 'ods_bondexsea_doc_order';


INSERT INTO


ods.`ods_bondexsea_doc_order`


SELECT


*


,YEAR(`gmt_create`) AS `o_year`


,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS `create_date`


FROM `doc_order` where gmt_create > '2023-01-01';


-- sqlserver数据库 示例


CREATE TEMPORARY TABLE `OrderHAWB` (


`HBLIndex` varchar(50) NOT NULL COMMENT '主键',


`CustomerNo` varchar(50) COMMENT '客户编号',


......


`CreateOPDate` timestamp COMMENT '制单日期',


PRIMARY KEY (`HBLIndex`) NOT ENFORCED


) WITH (


'connector' = 'sqlserver-cdc',


'hostname' = '数据库服务器地址',


'port' = '端口号',


'username' = '用户名',


'password' = '密码',


'database-name' = '数据库名',


'schema-name' = 'dbo',


-- 'debezium.snapshot.mode' = 'initial' -- 全量增量都抽取


'scan.startup.mode' = 'latest-offset',-- 只抽取增量数据


'table-name' = 'OrderHAWB'


);


-- 根据源表结构在paimon上ods层创建对应的目标表


CREATE TABLE IF NOT EXISTS ods.`ods_airsea_airfreight_orderhawb` (


`o_year` BIGINT NOT NULL COMMENT '分区字段',


`create_date` timestamp NOT NULL COMMENT '创建时间',


PRIMARY KEY (`o_year`, `HBLIndex`) NOT ENFORCED


) PARTITIONED BY (`o_year`)


WITH (


'changelog-producer.compaction-interval' = '2m'


) LIKE `OrderHAWB` (EXCLUDING CONSTRAINTS EXCLUDING OPTIONS);


-- 设置作业名,执行作业任务将源表数据实时写入到paimon对应表中


SET 'pipeline.name' = 'ods_airsea_airfreight_orderhawb';


INSERT INTO


ods.`ods_airsea_airfreight_orderhawb`


SELECT


RTRIM(`HBLIndex`) as `HBLIndex`


......


,`CreateOPDate`


,YEAR(`CreateOPDate`) AS `o_year`


,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS `create_date`


FROM `OrderHAWB` where CreateOPDate > '2023-01-01';

The effect of writing business table data to paimon ods table in real time is as follows:

57d25b3b3c1e0ad72deef4b00fcc5ed6.png

2. Write the data of the ods layer table into the dwd layer. In fact, this is to merge the relevant business tables of the ods layer into the dwd. Here, the value of the count_order field is mainly processed, because the data in the source table has logical deletion and There will be problems with the count function for physical deletion, so we use sum aggregation to calculate the single quantity here. The count_order corresponding to each reference_no is 1. If the logic is invalidated, it will be processed into 0 through SQL, and the physical deletion of paimon will be automatically processed.

We directly use the dimension table processed in doris to use the dim dimension table. The update frequency of the dimension table is low, so no secondary development is carried out in paimon.

-- 在paimon-dwd层创建宽表


CREATE TABLE IF NOT EXISTS dwd.`dwd_business_order` (


`reference_no` varchar(50) NOT NULL COMMENT '委托单号主键',


`bondex_shy_flag` varchar(8) NOT NULL COMMENT '区分',


`is_server_item` int NOT NULL COMMENT '是否已经关联订单',


`order_type_name` varchar(50) NOT NULL COMMENT '业务分类',


`consignor_date` DATE COMMENT '统计日期',


`consignor_code` varchar(50) COMMENT '客户编号',


`consignor_name` varchar(160) COMMENT '客户名称',


`sales_code` varchar(32) NOT NULL COMMENT '销售编号',


`sales_name` varchar(200) NOT NULL COMMENT '销售名称',


`delivery_center_op_id` varchar(32) NOT NULL COMMENT '交付编号',


`delivery_center_op_name` varchar(200) NOT NULL COMMENT '交付名称',


`pol_code` varchar(100) NOT NULL COMMENT '起运港代码',


`pot_code` varchar(100) NOT NULL COMMENT '中转港代码',


`port_of_dest_code` varchar(100) NOT NULL  COMMENT '目的港代码',


`is_delete` int not NULL COMMENT '是否作废',


`order_status` varchar(8) NOT NULL COMMENT '订单状态',


`count_order` BIGINT not NULL COMMENT '订单数',


`o_year` BIGINT NOT NULL COMMENT '分区字段',


`create_date` timestamp NOT NULL COMMENT '创建时间',


PRIMARY KEY (`o_year`,`reference_no`,`bondex_shy_flag`) NOT ENFORCED


) PARTITIONED BY (`o_year`)


WITH (


-- 每个 partition 下设置 2 个 bucket


'bucket' = '2',


'changelog-producer' = 'full-compaction',


'snapshot.time-retained' = '2h',


'changelog-producer.compaction-interval' = '2m'


);


-- 设置作业名,将ods层的相关业务表合并写入到dwd层


SET 'pipeline.name' = 'dwd_business_order';


INSERT INTO


dwd.`dwd_business_order`


SELECT


o.doccode,


......,


YEAR (o.docdate) AS o_year


,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS `create_date`


FROM


ods.ods_shy_jh_doc_hdworkdochd o


INNER JOIN ods.ods_shy_base_enterprise en ON o.businessguid = en.entguid


LEFT JOIN dim.dim_hhl_user_code sales ON o.salesguid = sales.USER_GUID


LEFT JOIN dim.dim_hhl_user_code op ON o.bookingguid = op.USER_GUID


UNION ALL


SELECT


business_no,


......,


YEAR ( gmt_create ) AS o_year


,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS `create_date`


FROM


ods.ods_bondexsea_doc_order


UNION ALL


SELECT


  HBLIndex,


 ......,


  YEAR ( CreateOPDate ) AS o_year


,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS `create_date`


FROM


  ods.`ods_airsea_airfreight_orderhawb`


;

Flink ui can see that the ods data is cleaned to the table dwd_business_order by real-time join of paimon

71bb892254801c892fdea4263806dcab.png

3. Lightly aggregate dwd layer data into dwm layer and write related data

In the dwm.`dwm_business_order_count` table, the data in this table will sum the aggregated fields according to the primary key, and the sum_orderCount field is the aggregated result. Paimon will automatically process the sum of physically deleted data.

-- 创建dwm层轻度汇总表,根据日期、销售、操作、业务类别、客户、起运港、目的港汇总单量


CREATE TABLE IF NOT EXISTS dwm.`dwm_business_order_count` (


`l_year` BIGINT NOT NULL COMMENT '统计年',


`l_month` BIGINT NOT NULL COMMENT '统计月',


`l_date` DATE NOT NULL  COMMENT '统计日期',


`bondex_shy_flag` varchar(8) NOT NULL COMMENT '区分',


`order_type_name` varchar(50) NOT NULL COMMENT '业务分类',


`is_server_item` int NOT NULL COMMENT '是否已经关联订单',


`customer_code` varchar(50) NOT NULL COMMENT '客户编号',


`sales_code` varchar(50) NOT NULL COMMENT '销售编号',


`delivery_center_op_id` varchar(50) NOT NULL COMMENT '交付编号',


`pol_code` varchar(100) NOT NULL COMMENT '起运港代码',


`pot_code` varchar(100) NOT NULL COMMENT '中转港代码',


`port_of_dest_code` varchar(100) NOT NULL COMMENT '目的港代码',


`customer_name` varchar(200) NOT NULL COMMENT '客户名称',


`sales_name` varchar(200) NOT NULL COMMENT '销售名称',


`delivery_center_op_name` varchar(200) NOT NULL COMMENT '交付名称',


`sum_orderCount` BIGINT NOT NULL COMMENT '订单数',


`create_date` timestamp NOT NULL COMMENT '创建时间',


PRIMARY KEY (`l_year`, `l_month`,`l_date`,`order_type_name`,`bondex_shy_flag`,`is_server_item`,`customer_code`,`sales_code`,`delivery_center_op_id`,`pol_code`,`pot_code`,`port_of_dest_code`) NOT ENFORCED


) WITH (


'changelog-producer' = 'full-compaction',


  'changelog-producer.compaction-interval' = '2m',


  'merge-engine' = 'aggregation', -- 使用 aggregation 聚合计算 sum


  'fields.sum_orderCount.aggregate-function' = 'sum',


  'fields.create_date.ignore-retract'='true',


  'fields.sales_name.ignore-retract'='true',


  'fields.customer_name.ignore-retract'='true',


  'snapshot.time-retained' = '2h',


'fields.delivery_center_op_name.ignore-retract'='true'


);


-- 设置作业名


SET 'pipeline.name' = 'dwm_business_order_count';


INSERT INTO


dwm.`dwm_business_order_count`


SELECT


YEAR(o.`consignor_date`) AS `l_year`


,MONTH(o.`consignor_date`) AS `l_month`


......,


,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS create_date


FROM


dwd.`dwd_business_order` o


;

The Flink UI effect is as follows dwd_business_orders data aggregation is written to dwm_business_order_count:

d2ec940af79d68f23085586c447f2f59.png

4. Aggregate the dwm layer data into the dws layer, and the dws layer is a summary of smaller dimensions

-- 创建根据操作人、业务类型聚合当天的单量
CREATE TABLE IF NOT EXISTS dws.`dws_business_order_count_op` (
  `l_year` BIGINT NOT NULL COMMENT '统计年',
  `l_month` BIGINT NOT NULL COMMENT '统计月',
  `l_date` DATE NOT NULL  COMMENT '统计日期',
  `order_type_name` varchar(50) NOT NULL COMMENT '业务分类',
  `delivery_center_op_id` varchar(50) NOT NULL COMMENT '交付编号',
  `delivery_center_op_name` varchar(200) NOT NULL COMMENT '交付名称',
  `sum_orderCount` BIGINT NOT NULL COMMENT '订单数',
  `create_date` timestamp NOT NULL COMMENT '创建时间',
  PRIMARY KEY (`l_year`, `l_month`,`l_date`,`order_type_name`,`delivery_center_op_id`) NOT ENFORCED
) WITH (
  'merge-engine' = 'aggregation', -- 使用 aggregation 聚合计算 sum
  'fields.sum_orderCount.aggregate-function' = 'sum',
  'fields.create_date.ignore-retract'='true',
  'snapshot.time-retained' = '2h',
  'fields.delivery_center_op_name.ignore-retract'='true'
);
-- 设置作业名
SET 'pipeline.name' = 'dws_business_order_count_op';
INSERT INTO
  dws.`dws_business_order_count_op`
SELECT
  o.`l_year`
  ,o.`l_month`
  ,o.`l_date`
  ,o.`order_type_name`
  ,o.`delivery_center_op_id`
  ,o.`delivery_center_op_name`
  ,o.`sum_orderCount`
  ,TO_TIMESTAMP(CONVERT_TZ(cast(CURRENT_TIMESTAMP as varchar), 'UTC', 'Asia/Shanghai')) AS create_date
FROM
  dwm.`dwm_business_order_count` o
;

The Flink UI effect is as follows dws_business_order_count_op data is written to dws_business_order_count_op:

def7fd0fdf50709fc79ad1300f9f3757.png

Overall Data Flow Example

c24da5c8ff8a4d36bfa94e50f2c3c34d.png

Source table:

df354784de11a13e3eb124153bd5021f.png

paimon-mosquito:

e43ae25c08f57d1f0b887fea3fbb2a88.png

paimon-dwd:

aadd441e72de85b8db5051a2c6a93bba.png

paimon-dwm:

0233eb34296b82c0e6bdb5f502e6eb03.png

paimon-dws:

27714a5aa5017c0c30740d1495c9127d.png

It is specially reminded that if the amount of data in the source table is too large when extracting the sqlserver database, the table will be locked when the data is extracted in large quantities. It is recommended to use incremental extraction when the business permits. For the full extraction of sqlserver, you can use the transfer method to import the full amount of data from sqlserver to mysql, from mysql to paimon-ods, and then use sqlserever to do incremental extraction.

04

Troubleshooting Analysis

1. Inaccurate calculation of aggregated data

sqlserver cdc collects data to paimon table

illustrate:

dwd table:

'changelog-producer' = 'input'

ads table:

'merge-engine' = 'aggregation', -- use aggregation to calculate sum

'fields.sum_amount.aggregate-function' = 'sum'

If the ADS layer aggregation table adopts agg sum, the dwd data stream will not generate update_before, and an error data stream update_after will be generated. For example, the upstream source table update 10 to 30. The dwd layer data will be changed to 30, and the ads aggregation layer data will also be changed to 30, but now it changes In order to append the data becomes 10+30=40 wrong data.

Solution:

By specifying 'changelog-producer' = 'full-compaction', Table Store will compare the results between full compactions and produce the differences as changelog. The latency of changelog is affected by the frequency of full compactions.

By specifying changelog-producer.compaction-interval table property (default value 30min), users can define the maximum interval between two full compactions to ensure latency. This table property does not affect normal compactions and they may still be performed once in a while by writers to reduce reader costs.

This can solve the above problem. But then a new problem arose. The default changelog-producer.compaction-interval is 30 minutes, which means that the upstream change to ads query needs to be 30 minutes apart. During the production process, it is found that if the compaction interval is changed to 1 minute or 2 minutes, the above-mentioned ADS layer aggregation data does not appear again. standard situation.

'changelog-producer.compaction-interval' = '2m'

It is necessary to configure table.exec.sink.upsert-materialize=none when writing to the Flink Table Store to avoid the Upsert flow, so as to ensure that the complete changelog can be saved in the Flink Table Store and prepare for subsequent stream read operations.

set 'table.exec.sink.upsert-materialize' = 'none'

2. The same sequence.field causes the dwd detail wide table to fail to receive update data update

mysql cdc collects data to paimon table

illustrate:

Execute update on the MySQL source

After the data modification is successful, the dwd_orders table data can be synchronized successfully

d9d949a1c9c5ac3fa1322e01c402392b.png

However, the data in the dwd_enriched_orders table cannot be synchronized, and the stream mode is started to view the data, and it is found that there is no data flow direction

d0c45d7f1f65120d2e522651cb8e8771.png

solve:

The investigation found that it was caused by configuring the parameter 'sequence.field' = 'o_orderdate' (using o_orderdate to generate sequence id, and selecting a record with a larger sequence id when merging the same primary key), because the time of the o_orderdate field does not change when the price is modified, Then the 'sequence.field' is the same, resulting in an uncertain order, so ROW1 and ROW2, their o_orderdate is the same, so they will be randomly selected when updating, all this parameter can be removed, after removal, it will normally follow the order of input Sequence, a sequence number is automatically generated, which will not affect the synchronization result.

3. Aggregate function 'last_non_null_value' does not support retraction

报错:Caused by: java.lang.UnsupportedOperationException: Aggregate function 'last_non_null_value' does not support retraction, If you allow this function to ignore retraction messages, you can configure 'fields.${field_name}.ignore-retract'='true'.

An explanation can be found in the official documentation:

Only sum supports retraction (UPDATE_BEFORE and DELETE), others aggregate functions do not support retraction.

It can be understood as: Except for the SUM function, other Agg functions do not support Retraction. In order to avoid receiving DELETE and UPDATEBEFORE messages, you need to configure the specified field with 'fields.${field_name}.ignore-retract'='true' to ignore , to solve this error

WITH (

'merge-engine' = 'aggregation', -- use aggregation to calculate sum

'fields.sum_orderCount.aggregate-function' = 'sum',

'fields.create_date.ignore-retract'='true'  #create_date 字段

);

4. Paimon task interruption failed

The task is interrupted abnormally and the pod hangs up

Viewing the loki log shows akka.pattern.AskTimeoutException: Ask timed out on

45dd66f505d4da5d71136057bbc61a96.png

java.util.concurrent.TimeoutException: Invocation of [RemoteRpcInvocation(JobMasterGateway.updateTaskExecutionState(TaskExecutionState))] at recipient [akka.tcp://[email protected]:6123/user/rpc/jobmanager_2] timed out. This is usually caused by: 1) Akka failed sending the message silently, due to problems like oversized payload or serialization failures. In that case, you should find detailed error information in the logs. 2) The recipient needs more time for responding, due to problems like slow machines or network jitters. In that case, you can try to increase akka.ask.timeout.\n"

The preliminary judgment should be that the akka timeout mechanism is triggered due to the above two reasons, then adjust the akka timeout configuration of the cluster and split a single task or increase the resource configuration.

Let's first look at how to modify the parameters:

key

default

describe

akka.ask.timeout

10s

Timeout used for all futures and blocking Akka calls. If Flink fails due to timeouts then you should try to increase this value. Timeouts can be caused by slow machines or a congested network. The timeout value requires a time-unit specifier (ms/s/min/h/d).

web.timeout

600000

Timeout for asynchronous operations by the web monitor in milliseconds.

Add the following parameters at the end of conf/flink-conf.yaml

akka.ask.timeout: 100s

web.timeout:1000000

Then manually refresh flink-conf.yaml in streampark to verify whether the parameters are synchronized successfully.

5. snapshot no such file or director

It was found that cp failed

75f597651e9b3125e7075c4cdc6f1b45.png

9e6cfac750642c6f49c8c6ebe0071eb5.png

The log at the corresponding time point shows that the Snapshot is lost, and the task is displayed as running, but the source table mysql data cannot be written into the paimon ods table

4d0fd707933fe1c3b7072a0b5cd59a63.png

The reason for the failure of locating cp is: a large amount of calculation and CPU-intensive, causing the thread in the TM to processElement all the time, and there is no time to do CP

The reason why the snapshot cannot be read is: the flink cluster resources are insufficient, the Writer and the Committer compete, and the incomplete snapshot that has expired is read during the Full-Compaction. Currently, the official has fixed this problem

https://github.com/apache/incubator-paimon/pull/1308

e9a253747b3211a75da512da5ac0b5cf.png

The solution to cp failure is to increase parallelism, increase deploymenttaskmanager slot and jobmanager cpu

-D kubernetes.jobmanager.cpu=0.8

-D kubernetes.jobmanager.cpu.limit-factor=1

-D taskmanager.numberOfTaskSlots=8

-D jobmanager.adaptive-batch-scheduler.default-source-parallelism=2

93cfa93e9a9c0cfd8e9a9d9787150e82.png

b017cdd705fecd3b1d6f65ecf97e93ee.png

In complex real-time tasks, resources can be increased by modifying dynamic parameters.

05

future plan

  • The self-built data platform bondata is integrating paimon's metadata information, data index system, blood relationship, one-key pipeline and other functions to form Haicheng Bonda's data assets, and will carry out one-stop data governance on this basis

  • Later, it will be connected to Doris based on trino Catalog to realize the one service of real offline data and real-time data

  • Adopt the architecture scheme of doris+paimon to continue to promote the pace of the construction of the group's internal flow batch integrated data warehouse

If this article is helpful to you, don't forget to  "Like",  "Like",  and "Favorite"  three times!

0bb1fb3f512b514eb70b5e2abd05b225.png

d78c073e4da1be7590ce8ca3744f3de1.jpeg

It will be released on the whole network in 2022 | Big data expert-level skill model and learning guide (Shengtian Banzi)

The Internet's worst era may indeed be here

I am studying in university at Bilibili, majoring in big data

What are we learning when we are learning Flink?

193 articles beat Flink violently, you need to pay attention to this collection

Flink production environment TOP problems and optimization, Alibaba Tibetan Scripture Pavilion YYDS

Flink CDC I'm sure Jesus can't keep him! | Flink CDC online problem inventory

What are we learning when we are learning Spark?

Among all Spark modules, I would like to call SparkSQL the strongest!

Hard Gang Hive | 40,000-word Basic Tuning Interview Summary

A Small Encyclopedia of Data Governance Methodologies and Practices

A small guide to user portrait construction under the label system

40,000-word long text | ClickHouse basics & practice & tuning full perspective analysis

[Interview & Personal Growth] More than half of 2021, the experience of social recruitment and school recruitment

Another decade begins in the direction of big data | The first edition of "Hard Gang Series" ends

Articles I have written about growth/interview/career advancement

What are we learning when we are learning Hive? "Hard Hive Sequel"

Guess you like

Origin blog.csdn.net/u013411339/article/details/131545975