Flink SQL CDC is online

01 Project background

The project I am currently involved in is an important data-intensive and computationally intensive project in the company. It needs to provide efficient and accurate OLAP services and provide flexible and real-time reports. Business data is stored in MySQL and synchronized to the report database through master-slave replication. As a group-level company, data has grown rapidly and there have been many tens of millions and hundreds of millions of large tables. In order to achieve various complex report services in various dimensions, some tens of millions of large tables still need to be joined, and the scale of calculation is very alarming, and it is often impossible to respond to requests in time.

With the ever-increasing amount of data and the increasing demand for real-time analysis, there is an urgent need for stream computing and real-time transformation of the system. It is in this context that started our story with Flink SQL CDC.

02 Solution

In response to the existing problems of the platform, we have proposed a real-time solution to the data of the report. This solution is mainly implemented through Flink SQL CDC + Elasticsearch. Flink SQL supports data synchronization in CDC mode. It collects, precalculates, and synchronizes the full incremental data in MySQL to Elasticsearch in real time. Elasticsearch serves as our real-time report and ad hoc analysis engine. The overall structure of the project is as follows:

 

The specific idea of ​​real-time report implementation is to use Flink CDC to read the full amount of data. After the full amount of data is synchronized, Flink CDC will seamlessly switch to the binlog location of MySQL to continue consuming incremental change data, and it is guaranteed that no more data will be consumed. Will consume one less. The full incremental data of the read bills and orders will be associated with the product table to complete the information, and do some pre-aggregation, and then output the aggregated results to Elasticsearch. The front-end page only needs to go to Elasticsearch to find data through precise matching (terms) , Or use agg to do high-dimensional aggregation statistics to get the report data of multiple service centers.

From the overall architecture, we can see that Flink SQL and its CDC function play a core role in our architecture. We use Flink SQL CDC instead of Canal + Kafka's traditional architecture. The main reason is that it has fewer dependent components, low maintenance costs, out-of-the-box use, and easy operation. Specifically, Flink SQL CDC is a tool that integrates collection, calculation, and transmission. The advantages that attract us are:

① Reduce maintenance components and simplify implementation links; 

② Reduce end-to-end delay; 

③ Reduce maintenance costs and development costs; 

④ Support Exactly Once reading and calculation (because we are an accounting system, data consistency is very important); 

⑤ Data does not fall to the ground, reducing storage costs; 

⑥ Support full and incremental streaming reading;

For the introduction and tutorial of Flink SQL CDC, you can watch related videos released by the Apache Flink community:

https://www.bilibili.com/video/BV1zt4y1D7kt/

The project uses the mysql-cdc component provided in flink-cdc-connectors. This is a Flink data source that supports full and incremental reading of MySQL databases. It will first add a global read lock before scanning the entire table, then acquire the binlog position at this time, and then release the global read lock. Then it starts to scan the whole table. When the snapshot of the whole table is read, the incremental change record will be obtained from the previously obtained binlog position. Therefore, this read lock is very lightweight, and the holding time is very short, which will not cause much impact on online business. For more information, please refer to the official website of the flink-cdc-connectors project: https://github.com/ververica/flink-cdc-connectors.

03 Project operating environment and current situation

We built a Hadoop + Flink + Elasticsearch distributed environment in the production environment, using Flink on YARN per-job mode operation, using RocksDB as the state backend, HDFS as the checkpoint persistent address, and HDFS fault tolerance to ensure checkpoint No data is lost. We use SQL Client to submit jobs, and all jobs use pure SQL, without writing a single line of Java code.

At present, 3 Flink CDC-based jobs have been launched, and they have been running stably for two weeks, and the actual order receipt and bill receipt data generated by the business can be aggregated and output to Elasticsearch in real time, and the output data is accurate. Now it is also using Flink SQL CDC for real-time transformation of other reports, replacing the old business system, and making system data more real-time.

04 concrete realization

① Enter Flink/bin and use ./sql-client.sh embedded to start the SQL CLI client. 

② Use DDL to create Flink Source and Sink tables. The number of fields in the table created here does not have to be consistent with the number and order of MySQL fields. You only need to select the fields required by the business in the MySQL table, and the field types are consistent.

-- 在Flink创建账单实收source表CREATE TABLE bill_info (  billCode STRING,  serviceCode STRING,  accountPeriod STRING,  subjectName STRING ,  subjectCode STRING,  occurDate TIMESTAMP,  amt  DECIMAL(11,2),  status STRING,  proc_time AS PROCTIME() -–使用维表时需要指定该字段) WITH (  'connector' = 'mysql-cdc', -- 连接器  'hostname' = '******',   --mysql地址  'port' = '3307',  -- mysql端口  'username' = '******',  --mysql用户名  'password' = '******',  -- mysql密码  'database-name' = 'cdc', --  数据库名称  'table-name' = '***');
-- 在Flink创建订单实收source表CREATE TABLE order_info (  orderCode STRING,  serviceCode STRING,  accountPeriod STRING,  subjectName STRING ,  subjectCode STRING,  occurDate TIMESTAMP,  amt  DECIMAL(11, 2),  status STRING,  proc_time AS PROCTIME()  -–使用维表时需要指定该字段) WITH (  'connector' = 'mysql-cdc',  'hostname' = '******',  'port' = '3307',  'username' = '******',  'password' = '******',  'database-name' = 'cdc',  'table-name' = '***',);
-- 创建科目维表CREATE TABLE subject_info (  code VARCHAR(32) NOT NULL,  name VARCHAR(64) NOT NULL,  PRIMARY KEY (code) NOT ENFORCED  --指定主键) WITH (  'connector' = 'jdbc',  'url' = 'jdbc:mysql://xxxx:xxxx/spd?useSSL=false&autoReconnect=true',  'driver' = 'com.mysql.cj.jdbc.Driver',  'table-name' = '***',  'username' = '******',  'password' = '******',  'lookup.cache.max-rows' = '3000',  'lookup.cache.ttl' = '10s',  'lookup.max-retries' = '3');
-- 创建实收分布结果表,把结果写到 ElasticsearchCREATE TABLE income_distribution (  serviceCode STRING,  accountPeriod STRING,  subjectCode STRING,  subjectName STRING,  amt  DECIMAL(13,2),  PRIMARY KEY (serviceCode, accountPeriod, subjectCode) NOT ENFORCED) WITH (  'connector' = 'elasticsearch-7',  'hosts' = 'http://xxxx:9200',  'index' = 'income_distribution',  'sink.bulk-flush.backoff.strategy' = 'EXPONENTIAL');

The above table creation DDL creates an order receipt source table, bill receipt source table, product account dimension table and Elasticsearch result table respectively. After the table is created, Flink will not immediately synchronize the MySQL data, but will not perform the synchronization until the user submits an insert job, and Flink will not store the data. Our first job is to calculate the income distribution. The data comes from two MySQL tables, bill_info and order_info, and both the bill receipt table and the order receipt table need to associate dimension table data to obtain the latest Chinese name of the account receivable, according to the service center , Account period, account code and title name are grouped to calculate the sum value of the actual amount received. The specific DML of the actual received distribution is as follows:

INSERT INTO income_distributionSELECT t1.serviceCode, t1.accountPeriod, t1.subjectCode, t1.subjectName, SUM(amt) AS amt FROM (  SELECT b.serviceCode, b.accountPeriod, b.subjectCode, s.name AS subjectName, SUM(amt) AS amt   FROM bill_info AS b  JOIN subject_info FOR SYSTEM_TIME AS OF b.proc_time s ON b.subjectCode = s.code   GROUP BY b.serviceCode, b.accountPeriod, b.subjectCode, s.nameUNION ALL  SELECT b.serviceCode, b.accountPeriod, b.subjectCode, s.name AS subjectName, SUM(amt) AS amt  FROM order_info AS b  JOIN subject_info FOR SYSTEM_TIME AS OF b.proc_time s ON b.subjectCode = s.code   GROUP BY b.serviceCode, b.accountPeriod, b.subjectCode, s.name) AS t1GROUP BY t1.serviceCode, t1.accountPeriod, t1.subjectCode, t1.subjectName;

 

Flink SQL's dimension table JOIN and Shuangliu JOIN are not written in the same way. For dimension tables, you also need to add a proctime field proc_time AS PROCTIME() on the Flink source table, and use FOR SYSTEM_TIME AS OF SQL syntax to query the tense when associating. Table means to query the latest version of the dimension table data associatively. For the use of dimension table JOIN, please refer to: https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/table/streaming/joins.html.

③ After executing the above job in SQL Client, YARN will create a Flink cluster to run the job, and the user can view all the information of the executed job on Hadoop, and can enter Flink's Web UI page to view the Flink job details. The following are all Hadoop jobs Happening.

 

④ After the job is submitted, Flink SQL CDC will scan the specified MySQL table. During this period, Flink will also checkpoint, so you need to configure the checkpoint retry strategy and retry times as described above. When the data is read into Flink, Flink will perform the calculation of the job logic in a streaming manner, and output the aggregated results to Elasticsearch (sink side) in real time. It is equivalent to using Flink to maintain a real-time materialized view on a MySQL table, and store the results of this real-time materialized view in Elasticsearch. Use the GET /income_distribution/_search{ "query": {"match_all": {}}} command in Elasticsearch to view the output distribution results, as shown in the figure below:

 

From the results in the figure, it can be seen that the aggregation results are calculated in real time and written into Elasticsearch.

05 The pits stepped on and the lessons learned

1. The Flink job was originally running in standalone session mode. Submitting multiple Flink jobs will cause the job to fail and report an error.

  • Reason: Starting multiple jobs in standalone session mode will cause the Tasks of multiple jobs to share a JVM, which may cause some instability problems. And when troubleshooting, the logs of multiple jobs are mixed in one TaskManager, which increases the difficulty of troubleshooting.

  • Solution: Use YARN's per-job mode to start multiple jobs, which can have better isolation.

2. SELECT elasticsearch table reports the following error:

 

  • Reason: The Elasticsearch connector currently only supports sink, not source. So you cannot SELECT elasticsearch table.

3. Modify the default parallelism in flink-conf.yaml, but you can see that the parallelism of the job is still 1 in the Web UI, and the modification of the parallelism does not take effect.

 

  • Solution: When using SQL Client, the priority of parallelism configuration in sql-client-defaults.yaml is higher. Modify the degree of parallelism in sql-client-defaults.yaml, or delete the configuration of parallelism in sql-client-defaults.yaml. The latter is more recommended.

 

4. When the Flink job scans the full amount of MySQL data, the checkpoint times out and job failover appears, as shown in the following figure:

 

 

  • Reason: Flink CDC needs hour-level time to scan the full table data (our actual collection table has tens of millions of data) (affected by the downstream aggregate back pressure), and there is no offset to record during the scan full table (meaning There is no way to do checkpoint), but the Flink framework will do checkpoints at fixed intervals at any time, so here mysql-cdc source uses a more tricky way, that is, in the process of scanning the entire table, the executing checkpoint will always be Wait or even time out. A checkpoint that has timed out will still be considered as a failed checkpoint. In the default configuration, this will trigger Flink's failover mechanism, and the default failover mechanism will not restart. So it will cause the above phenomenon.

  • Solution: Configure the number of failed checkpoint tolerance and failed restart strategy in flink-conf.yaml, as follows:

execution.checkpointing.interval: 10min   # checkpoint间隔时间execution.checkpointing.tolerable-failed-checkpoints: 100  # checkpoint 失败容忍次数restart-strategy: fixed-delay  # 重试策略restart-strategy.fixed-delay.attempts: 2147483647   # 重试次数

At present, the Flink community also has an issue (FLINK-18578) to support the mechanism that the source actively rejects the checkpoint. Based on this mechanism in the future, this problem can be solved more gracefully.

5. How does Flink enable YARN's per-job mode?

  • Solution: Configure execution.target: yarn-per-job in flink-conf.yaml.

6. After entering the SQL Client to create the table, entering the SQL Client from another node can not query the table.

  • Reason: Because the default catalog of SQL Client is in-memory, not a persistent catalog, this is a normal phenomenon, and the catalog is empty every time you start it.

7. Elasticsearch reports the following error when the job is running:

 

Caused by: org.apache.Flink.elasticsearch7.shaded.org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=mapper [amt] cannot be changed from type [long] to [float]]

 

  • Reason: the type of the field amt of the database table is decimal, and the type of the amt field output to es created by DDL is also decimal, because if the amt of the first data output to es is an integer, such as 10, the type output to es is For the long type, the es client will automatically create the es index and set the amt field to the long type format. Then if the amt output to es next time is a non-integer 10.1, then there will be a type mismatch error when output to es.

 

  • Solution: Manually generate es index and mapping information, specify that the decimal type data format is saclefloat, but the field type can still be retained in the DDL as decimal.

8. The mysql cdc source reports the following error when the job is running:

 

  • Reason: Because other tables in the database have modified fields, the CDC source is synchronized to the ALTER DDL statement, but the parsing fails and the exception is thrown.

  • Solution: This problem has been fixed in the latest version of flink-cdc-connectors (the unresolvable DDL is skipped). Upgrade the connector jar package to the latest version 1.1.0: flink-sql-connector-mysql-cdc-1.1.0.jar, and replace the old package under flink/lib.

9. The stage of scanning the whole table is slow, and the following phenomena appear in the Web UI:

 

  • Reason: The slowness of scanning the whole table is not necessarily the problem of the cdc source, it may be that the downstream node is too slow to process back pressure.

 

  • Solution: Through the anti-pressure tool of the Web UI, it is found that the bottleneck is mainly on the aggregation node. By matching MiniBatch related parameters in the sql-client-defaults.yaml file and enabling distinct optimization (count distinct in our aggregation), the scan efficiency of the job has been greatly improved, from the original 10 hours to 1 hour . For performance tuning parameters, please refer to: https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/table/tuning/streaming_aggregation_optimization.html.

configuration:  table.exec.mini-batch.enabled: true  table.exec.mini-batch.allow-latency: 2s  table.exec.mini-batch.size: 5000  table.optimizer.distinct-agg.split.enabled: true

10. During CDC source scanning the MySQL table, it was found that it was unable to insert data into the table.

  • Reason: Because the MySQL user used is not authorized to RELOAD, the global read lock (FLUSH TABLES WITH READ LOCK) cannot be obtained, and the CDC source will degenerate into a table-level read lock, and the use of table-level read locks needs to wait until the full table scan is completed. The lock can be released, so you will find that the lock time is too long, which affects other businesses to write data.

  • Solution: Grant RELOAD permission to the MySQL user used. For a list of required permissions, see the document: https://github.com/ververica/flink-cdc-connectors/wiki/mysql-cdc-connector#setup-mysql-server. If the RELOAD permission cannot be granted for some reason, you can also explicitly add'debezium.snapshot.locking.mode' ='none' to avoid all locks, but it should be noted that only the schema of the table will not change during the snapshot. It's safe.

11. When multiple jobs share the same source table, if the server id is not modified, the data read will be lost.

  • Reason: The principle of MySQL binlog data synchronization is that the CDC source will pretend to be a slave of the MySQL cluster (using the specified server id as the unique id), and then pull the binlog data from MySQL. If there are multiple slaves with the same id in a MySQL cluster, it will cause the problem of pulling data disorderly.

  • Solution: A server id will be randomly generated by default, which is prone to collision risk. Therefore, it is recommended to use dynamic parameters (table hint) to override the server id in the query. As follows:

SELECT *FROM bill_info /*+ OPTIONS('server-id'='123456') */ ;

12. When starting the job, YARN received the task, but the job has not been started:

 

 

  • Reason: Queue Resource Limit for AM exceeded the limit resource limit. The default maximum memory is 30G (cluster memory) * 0.1 = 3G, and each JM applies for 2G memory. When the second task is submitted, the resources are not enough.

  • Solution: Increase the resource limit of AM, configure yarn.scheduler.capacity.maximum-am-resource-percent in capacity-scheduler.xml, which represents the percentage of AM in total resources, the default is 0.1, change to 0.3 (according to the server Flexible configuration of performance).

13. The AM process can't get up and is killed all the time.

 

  • Reason: 386.9 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory use. The default physical memory is 1GB, and 1GB is dynamically applied for, of which 386.9 MB is used. Physical memory x 2.1=virtual memory, 1GBx2.1≈2.1GB, 2.1GB virtual memory has been exhausted, when the virtual memory is not enough, the AM container will commit suicide.

  • Solution: Two solutions, or adjust the value of yarn.nodemanager.vmem-pmem-ratio to a larger value, or yarn.nodemanager.vmem-check-enabled=false, turn off the virtual memory check. Reference: https://blog.csdn.net/lzxlfly/article/details/89175452.

06 Summary

In order to improve the availability and real-time performance of the real-time report service, we adopted the Canal+Kafka+Flink solution at the beginning, but found that we need to write a lot of Java code, and we also need to deal with the conversion of DataStream and Table and the acquisition of binlong location. , Development is relatively difficult. In addition, the need to maintain the stable operation of the two components of Kafka and Canal is not small for our small team. Since our company already has Flink-based tasks running online, it is logical to adopt Flink SQL CDC. The solution based on Flink SQL CDC only needs to write SQL, without writing a line of Java code to complete the real-time link connection and real-time report calculation, for us is very simple and easy to use, and the stability and performance of the online operation Also satisfied us.

 

We are vigorously promoting the use of Flink SQL CDC within the company, and we are also working to transform several other real-time links. Thank you very much for the open source community for providing us with such a powerful tool. I also hope that Flink CDC will become more powerful and support more databases and functions. Thanks again to Teacher Yunxie for supporting our project online!

Guess you like

Origin blog.csdn.net/Baron_ND/article/details/109644917