Apache Flink 1.16 Function Interpretation


1. Overview

insert image description here
Compared with Flink 1.15, Flink 1.16 maintains a higher level in terms of Commits, Issues, and Contributors. The biggest difference is that most of our functions and codes in Flink 1.16 are mainly completed by Chinese developers.
insert image description here
Many thanks to more than 240 Chinese Contributors for their contributions to Flink 1.16. Next, let's take a detailed look at the improvements of Flink 1.16 in three aspects.

2. Continuously leading stream processing

insert image description here
Flink, as the standard of streaming computing engine, has still made many improvements and explorations in the stream processing of Flink 1.16.

State is a very important concept in Flink. With the existence of State, Flink can guarantee the semantics of end-to-end Exactly Once in stream processing. After years of continuous development, State proposed the Changelog State Backend in Flink 1.15, in order to solve the problem of periodic jitter in the CPU and bandwidth of RocksDB because TM needs to perform both Compaction and Upload at the same time. This part was addressed with the introduction of the Changelog.

Its basic principle is that when TM performs State operation, it will double write at the same time. Part of the data will be written to the original local State Table as before. At the same time, the State data will be written to the local Changelog in the form of Append Only. This part of the information in the Changelog will be periodically uploaded to the remote DFS.

Since this part of Changelog information is relatively fixed and is in a periodic quantitative form, when doing Checkpoint, the persistent data is reduced and the speed of job failover is accelerated. Secondly, because this data is relatively small, the speed of Checkpoint will be faster.

In addition, it also improves the problem of Checkpoint and solves the problem of CPU and bandwidth jitter. Because of its faster speed and low latency, the end-to-end data freshness is better and kept within minutes.

This part of the function is fully available in Flink 1.16, and the entire cluster becomes more stable.
insert image description here
We have also made a lot of improvements on RocksDB of Flink 1.16, and we have introduced Rescaling Benchmark. You can observe the time-consuming of RocksDB's Rescaling and where the time-consuming part is.

In addition, we have also made some improvements to RocksDB Rescaling, greatly improving the performance of RocksDB Rescaling. As shown in the above figure, we can see that in the left figure, for a WordCount job, the performance has been improved by 3~4 times.

In addition to the improvement of RocksDB Rescaling, we have also improved the existing State Metrics and Monitor. We redirected some log information from RocksDB to Flink log. Second, we introduced RocksDB-based Metric information at the Database level into the Flink system. Users can view the status of RocksDB through the Flink Metric system.
insert image description here
I just talked about our improvements to State and RocksDB in Flink 1.16. In addition, we have also made many improvements on Checkpoint.

In Flink 1.11, we introduced Unaligned Checkpoint, and made it available for production in Flink 1.13. Since then, many companies have started using Unaligned Checkpoint in their production environments. But in the process of using, also found some problems. After a small partner of a company used Unaligned Checkpoint in his production environment, he found some problems, made improvements, and gave back to the community.

Let's take a brief look at some of the improvements made by Unaligned Checkpoint.

The first one is to support overdraft buffer. The introduction of this function is to solve the possible problems caused by our Unaligned Checkpoint because Flink's execution flow is based on Mailbox's processing flow. We know that the process of making Checkpoint is carried out in the main thread, and when the main thread is processing the Process Function logic, the data output by the Process Function needs to be output to the downstream buffer and some output buffers need to be applied for.

In the above process, when the back pressure is very serious, it may be difficult to apply for a buffer, causing the main thread to get stuck in the Process Function logic. Before Flink 1.16, the community also made related improvements: before the main thread needs to enter the Process Function logic, it needs to pre-apply for the buffer in the output buffer. After having this buffer, it will enter the logic of Process Function, so as to avoid being stuck in it.

The above solution solves some problems, but still does not solve other types of Case. If the input/output data is too large, or the Process Function is a Flatmap Function that needs to output multiple pieces of data, one buffer will not be sufficient, and the main thread will still be stuck in the Process Function.

In Flink 1.16, the method of overdrawing buffer is introduced. If there are some additional buffers on the TM, you can apply for this part of memory. Then, by overdrawing this part of the buffer, the main thread will not be stuck in the Process Function, and it can jump out of the Process Function normally. The main thread can receive the Unaligned Checkpoint Barrier.

Before, Unaligned Checkpoint introduced a Timeout Aligned mechanism. If your Input Channel receives a Checkpoint Barrier and does not achieve Barrier alignment within the specified time period, the Task will switch to Unaligned Checkpoint. But if the barrier is stuck in the output buffer, the downstream Task is still Aligned Checkpoint. In Flink 1.16, the situation that Barrier is stuck in the output queue has been solved.

Through the above two improvements, Unaligned Checkpoint has been greatly improved and has higher stability.
insert image description here
In Flink 1.16, we have enhanced the dimension table part.

  1. We have introduced a caching mechanism to improve the query performance of dimension tables.

  2. We introduced an asynchronous query mechanism to improve the overall throughput.

  3. We introduce a retry mechanism, mainly to solve the slow update of the external system encountered when querying the dimension table, resulting in incorrect results and stability problems.

Through the above improvements, our dimension table query ability has been greatly improved.
insert image description here
In Flink 1.16, we support more DDL. For example, CREATE FUNCTION USING JAR supports dynamic loading of user's JAR, which is convenient for platform users to manage user's UDF.

Secondly, we support CREATE TABLE AS SELECT, allowing users to easily create a new Table through an existing Table.

Finally, ANALYZE TABLE is a new syntax for Table. Help users generate more efficient statistics. These statistics will allow the optimizer to generate a better execution graph and improve the performance of the entire job.
insert image description here
Besides that, we have done a lot of optimizations on streams. Here are just a few of the more important optimizations.

We improved the non-deterministic problem in stream processing systems. This part of the non-deterministic problem mainly includes two parts, one is the non-deterministic problem on the dimension table query, and the other is that the user's UDF is a non-deterministic UDF.

  1. We have provided a very complete systemic solution in Flink 1.16. First, we can automatically detect whether there are some non-deterministic problems in SQL. Secondly, the engine helps users solve the non-deterministic problem of dimension table query. Finally, some documents are provided, and users can better discover and solve non-deterministic problems in their own work based on these documents.
  2. We finally solved our Protobuf Format support. In Flink 1.16, users are supported to use PB format data.
  3. We introduced RateLimitingStrategy. The Strategy in this part before is customized and cannot be configured. In Flink 1.16, we made it configurable. Users can implement their own set of configurations according to their own network congestion strategies.

3. More stable and easy-to-use high-performance batch processing

insert image description here
We just talked about Flink's improvements in stream processing. Flink is not only a stream computing engine, but also a stream-batch integrated computing engine. So we have also done a lot of work in batch processing. The goal of Flink 1.16 is to enable batch computing to achieve more stable applications and high performance.

In terms of ease of use, in the existing batch ecosystem, many user jobs still run in the Hive ecosystem. In Flink 1.16, we hope that Hive's SQL can be migrated to Flink at a very low price. The Hive ecological compatibility of Flink 1.16 has reached 94%. If some acid operations are deducted, the ecological compatibility of Hive reaches 97%. At the same time, with Catalog, Hive SQL can run federated queries on the Flink engine.
insert image description here
In Flink 1.16, we also introduced a very important component SQL Gateway. Through SQL Gateway and the supported HiveServer2, mainstream ecological products in the Hive ecosystem can be naturally connected to the Flink ecosystem.

In Flink 1.16 SQL Gateway, we support multi-tenancy, are compatible with the HiveServer2 protocol, and the Hive ecosystem brought by HiveServer2. With the cooperation of HiveServer2, the entire Hive ecosystem can be easily migrated to the Flink ecosystem.
insert image description here
Next, let's see what optimizations the Flink engine itself does for batches. First, let’s talk about scheduling-related optimizations. Flink batch jobs often suffer from such problems. Due to the busy IO or high CPU load of some hot machines, the tasks running on the machines drag down the end-to-end execution time of Flink batch jobs.

In Flink 1.16, in order to solve this problem, we introduced Speculative Execution, a speculative execution method. Its basic principle is that at each stage, if we detect a certain machine, it is a hot machine, and the tasks running on it are called slow tasks. The so-called slow task is that its slow task execution time is much longer than that of other tasks in the same stage, thus defining this machine as a hotspot machine.

After having a hotspot machine, in order to reduce the execution time of the entire job. We hope that the slow tasks running on the hotspot machine can be run on other non-hot machines through a backup task. This reduces the total execution time of the entire task.
insert image description here
Next, let's take a brief look at its specific details. First, there is a component called Slow Task Detector. This component will periodically check whether there are some slow tasks and the hot machines corresponding to the slow tasks.

After collecting this information, it will report to Speculative Scheduler. Scheduler will report this information to Blocklist Handler. Then Blocklist Handler will blacken these machines.

With these blackened machines, the backup tasks of the slow tasks on the blackened machines will be scheduled to other non-hot machines in the cluster, so that these slow tasks and backup tasks can run at the same time. Whoever completes first will admit the result of which task. The output of the recognized instance can also be used as the input of downstream operators. Unfinished tasks will be canceled.
insert image description here
In Speculative Execution, we also introduced some support for Rest and Web UI. As shown in the left figure above, you can observe which slow tasks are canceled and which are their backup tasks. From the picture on the right, you can see which TM machines are marked as black machines in real time.

Regarding the follow-up work on Speculative Execution, we currently do not support Speculative Execution on Sink. Second, our existing detection strategy is relatively crude. We did not effectively take into account the detection errors of some slow tasks due to data skew. Mark some machines that are normal in themselves as hot machines. This is all the work we need to do after Flink 1.17.
insert image description here
The work we do on Shuffle. As we all know, Flink has two Shuffle strategies, one is Pipelined Shuffle and the other is Blocking Shuffle.

The upstream and downstream tasks of the streaming Pipelined Shuffle can be scheduled to run at the same time. Its data transmission is a kind of air-to-air transmission, the data is not dumped, and the performance is better. But its disadvantage is that it will take up more resources. Because it requires upstream and downstream tasks to be scheduled at the same time. In some cases where resources are relatively tight, it may be difficult to start the job, or the entire job may be deadlocked due to resource requirements.

In terms of Blocking Shuffle, because it is in each stage, Task will write its result to disk. Then, the downstream Task goes through the disk and reads its data. The advantage of this is that in theory, only one Slot is needed to complete the operation of the entire batch job. But its shortcomings are also obvious. Because each stage has to put the result data to the disk, and the next step needs to read the disk, so its performance is poor.
insert image description here
Based on this consideration, we propose a new Shuffle strategy in Flink 1.16, namely Hybrid Shuffle. Its purpose is to utilize the advantages of the above two sets of Shuffle at the same time. When resources are abundant, we take advantage of the performance benefits of Pipelined Shuffle. When resources are insufficient, we take advantage of the resources of Blocking Shuffle. The whole set of Shuffle strategies is adaptive switching, which is the basic idea of ​​Hybrid Shuffle.
insert image description here
Hybrid Shuffle has two strategies for data storage. One set is a full order, and the other is a selective order.

The advantage of selective disk placement is that we write less data to disk, and the overall performance is relatively higher. The advantage of all-disk is that the performance during Failover will be better. According to the different user scenarios, choose which placement strategy of Hybrid Shuffle to use.

In terms of performance, Flink 1.16's Hybrid Shuffle reduces TPC-DS execution time by 7.2% compared to Blocking Shuffle. If broadcast optimization is added, the optimized TPC-DS execution time ratio will be 17% lower than that of Blocking Shuffle.

Regarding the follow-up work of Hybrid Shuffle, one is the performance optimization of broadcasting. The other is the adaptation to the Adaptive Batch Scheduler proposed by Flink 1.15, and the adaptation to Speculative Execution.
insert image description here
We have many other optimizations in batching as shown above. We briefly list some of the more important optimizations.

  1. First of all, we support Dynamic Partition Pruning, that is, dynamic branch pruning. Prior to 1.16, branch and pruning strategies were based on static pruning on the execution graph. However, in batch processing, some statistical information of Runtime can be used to implement branch and cut strategies more efficiently. This implementation makes Flink 1.16 have a 30% performance improvement on TPC-DS.
  2. We introduce Adaptive Hash Join, an adaptive strategy. We use some statistical information of Runtime to adaptively roll back Hash Join to Sort Merge Join to improve the stability of Join.
  3. We made some improvements on Blocking Shuffle before batching. We introduced more compression algorithms (LZO and ZSTD). The new compression algorithm is to solve the balance between data size and CPU consumption.
  4. We optimized the existing Blocking Shuffle implementation. Through adaptive buffer allocation, IO sequential reading, and Result Partition sharing, there is a 7% performance improvement on TPC-DS.
    We support Join Hints on Batch SQL. Join Hints allow users to manually intervene in the Join strategy. Users will know to generate more efficient execution plans and improve the performance of the entire job.

4. A thriving ecosystem

insert image description here
Next, introduce some thriving ecological products. As shown in the figure above, PyFlink is a very important product in our ecosystem. PyFlink has developed from Flink 1.9 to Flink 1.16.

  1. The coverage of the Python API has reached more than 95%. On the one hand, we optimize the built-in Window support. Before Flink 1.16, we supported custom Window in Flink 1.15. However, for Windows that need to be customized, the implementation cost for users is still high and difficult to use. On the other hand, we introduced support for side output, broadcast state, etc. in Flink 1.16.
  2. PyFlink supports all built-in Connector&Format. Expanded the ability of PyFlink to interface with various systems.
  3. PyFlink supports M1 and Python 3.9. With these two parts of capabilities, the cost of getting started for users is reduced. At the same time, Deprecate Python 3.6 will remove support for Python 3.6 in Flink 1.17 and introduce support for Python 3.10.
  4. PyFlink built its own user website. Provides installation tutorials in various execution environments, and can run QuickStart examples online. These examples are directly attached to the online notebook site. At the same time, we have summarized many frequently asked questions by users to facilitate new users to get started. At the same time, we sorted out the end-to-end scenario cases of PyFlink. These sections are essentially designed to lower the barrier to entry for new users.

insert image description here
In terms of performance, we introduced Thread Mode in PyFlink 1.15. The biggest difference between Thread Mode and Process Mode is that it solves the communication problem between Python process and Java process. If it is inter-process communication, there will be some serialization/deserialization overhead, and Thread Mode will no longer have this problem.

In Flink 1.16, we fully support Thread Mode. Compared with Process Mode, its performance will be better and the end-to-end delay will be lower.

As shown in the figure above, in the scenario of JSON calculation, the end-to-end delay of Thread Mode is only 1/500 of that of Process Mode. Its performance in common typical scenarios and scenarios where data is relatively common, the performance of Thread Mode is basically equal to that of Java.
insert image description here
It can be seen that in Flink 1.16, PyFlink has reached full production availability in terms of functions and performance. In addition, CEP is also an important part of the Flink ecosystem. We are expanding the function of CEP in Flink 1.16.

  1. We support the CEP capability on Batch SQL.
  2. We have expanded the original function that only supports the first and last time intervals, and supported the definition of events and event intervals.

insert image description here
What I just talked about is the important ecological products inside Flink. Outside of the Flink project, we also have some important ecological projects. Such as Flink Table Store, Flink CDC, Flink ML, Feathub.

  1. Flink Table Store cooperates with Changelog State Backend to achieve end-to-end data freshness within minutes.
  2. We have made some improvements on the correctness of the data. After the CDC flows in, the Join and aggregation will be smoother.
  3. We support the Cache function on DataStream, so that Flink ML can achieve higher performance when implementing built-in operators.
  4. The Feathub project that was just open sourced a while ago. Feathub relies on PyFlink as its computing engine base. With the improvement of PyFlink performance, the performance of Feathub using Python Function is close to that of Java Function, and there is no disadvantage anymore.

Guess you like

Origin blog.csdn.net/qq_40822132/article/details/131111899