Flink SQL performance optimization: detailed explanation of multiple input

The optimization of execution efficiency has always been the goal Flink pursues. In most jobs, especially batch jobs, the cost of transferring data between tasks through the network (called data shuffle) is relatively high. Under normal circumstances, a piece of data through the network needs to go through serialization, disk read and write, socket read and write, and deserialization before it can be transmitted from the upstream task to the downstream; while the transmission of the same data in the memory requires only a few CPUs. It is enough to transmit an eight-byte pointer periodically.

In the early version of Flink, the operator chaining mechanism has been used to integrate the same concurrent adjacent single-input operators into the same task, eliminating unnecessary network transmission between single-input operators. However, there are also additional data shuffle problems between multi-input operators such as join, and the data transmission between the source node with the largest amount of shuffle data and the multi-input operator cannot be optimized using the operator chaining mechanism.

In Flink 1.12, we introduced multiple input operator and source chaining optimizations for scenarios that currently cannot be covered by operator chaining. This optimization will eliminate most redundant shuffles in Flink jobs and further improve the execution efficiency of the jobs. This article will take a SQL job as an example to introduce the above optimization and show the results of Flink 1.12 on the TPC-DS test set.

Optimization case analysis: order volume statistics

We will take TPC-DS q96 as an example to introduce in detail how to eliminate redundant shuffles. This SQL is intended to filter and count orders that meet specific conditions through multiple joins.

select count(*) 
from store_sales
    ,household_demographics 
    ,time_dim, store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'

Figure 1-Initial execution plan

How does the redundant shuffle come about?

Because some operators have requirements for the distribution of input data (for example, the hash join operator requires the same hash value of the data join key in the same concurrency), the data may need to be rearranged and sorted when passing between operators. Similar to the shuffle process of map-reduce, Flink shuffle sorts the intermediate results generated by upstream tasks and sends them to downstream tasks that need these intermediate results on demand. However, in some cases, the upstream output data already meets the data distribution requirements (for example, multiple consecutive hash join operators with the same join key). At this time, it is no longer necessary to organize the data, and the resulting shuffle is also It becomes a redundant shuffle, which is represented by forward shuffle in the execution plan.

The hash join operator in Figure 1 is a special operator called broadcast hash join. Take store_sales join time_dim as an example. Since the amount of data in the time_dim table is very small, at this time, broadcast shuffle sends the full amount of data in the table to each concurrent hash join, so that any concurrent accepts any data in the store_sales table without affecting the join The correctness of the results, while improving the execution efficiency of hash join. At this time, the network transmission of the store_sales table to the join operator also becomes a redundant shuffle. Similarly, shuffle between several joins is unnecessary.

Figure 2-Redundant shuffle (marked with red box)

In addition to hash join and broadcast hash join, there are many scenarios for generating redundant shuffles, such as hash aggregate + hash join with the same group key and join key, multiple hash aggregates with a containment relationship for group keys, etc., which will not be expanded here. description.

Can Operator Chaining solve it?

Readers who have a certain understanding of the Flink optimization process may know that in order to eliminate unnecessary forward shuffle, Flink has introduced the operator chaining mechanism in the early stage. This mechanism integrates the same concurrent adjacent single-input operators into the same task and performs operations together in the same thread. The Operator chaining mechanism is already in effect in Figure 1. Without it, the operators separated by "->" in the names of the three Source nodes of the broadcast shuffle will be split into multiple different tasks, resulting in redundancy Data shuffle. Figure 3 shows the execution plan of Operator chaining.

Figure 3-Execution plan after Operator chaining is closed

It is a very effective optimization to reduce the data transmission between TMs through the network and files and merge the operator link into the task: it can reduce the switching between threads, reduce the serialization and deserialization of messages, and reduce the amount of data in the buffer. Exchange and improve overall throughput while reducing latency. However, operator chaining has very strict restrictions on the integration of operators. One of them is "the in-degree of the downstream operator is 1", which means that the downstream operator can only have one input. This excludes multi-input operators (such as join).

The solution of multiple input operators: Multiple Input Operator

If we can imitate the optimization ideas of operator chaining, introduce a new optimization mechanism and meet the following conditions:

  1. This mechanism can combine multi-input operators;
  2. The mechanism supports multiple inputs (providing inputs for the operators being combined)

We can put the multi-input operator connected with forward shuffle into one task for execution, thereby eliminating unnecessary shuffle. The Flink community has paid attention to the shortcomings of operator chaining a long time ago. In Flink 1.11, the MultipleInputTransformation of the streaming api layer and the corresponding MultipleInputStreamTask were introduced. These APIs meet the above condition 2, and Flink 1.12 implements a new operator that meets the condition 1 in the SQL layer on this basis-multiple input operator, you can refer to the FLIP document [1].

Multiple input operator is a pluggable optimization of the table layer. It is located in the last step of table layer optimization, traversing the generated execution plan and integrating adjacent operators that are not blocked by exchange into a multiple input operator. Figure 4 shows the modification of the optimization to the original SQL optimization step.

Figure 4-Optimization steps after adding multiple input operator

Readers may have questions: Why not modify the existing operator chaining and start anew? In fact, in addition to completing operator chaining, multiple input operators also need to sort the priority of each input. This is because some multi-input operators (such as hash join and nested loop join) have strict order restrictions on the input. If the input priority is not properly sorted, it may cause deadlock. Since the information of the operator input priority is only described in the operator of the table layer, a more natural way is to introduce the optimization mechanism in the table layer.

It is worth noting that the multiple input operator is different from the operator chaining that manages multiple operators. It itself is a large operator, and its internal operation is a black box from the outside world. The internal structure of the multiple input operator is fully reflected in the operator name. Readers can see from the operator name which operators are combined into the multiple input operator in what topology when running a job containing the operator.

Figure 5 shows the topological diagram of the operator optimized by multiple input and the perspective diagram of the multiple input operator. After the redundant shuffles between the three hash join operators in the figure are removed, they can be executed in one task, but operator chaining cannot handle this multi-input situation. Put them in the multiple input operator Execution, the multiple input operator manages the input sequence of each operator and the calling relationship between the operators.

Figure 5-Operator topology after multiple input optimization

The construction and operation of multiple input operator is more complicated. Readers who are interested in this detail can refer to the design document [2].

Source can’t be missed: Source Chaining

After the optimization of multiple input operator, we optimize the execution plan in Fig. 1 to Fig. 6, and Fig. 3 becomes the execution graph of Fig. 6 after optimization by operator chaining.

Figure 6-Execution plan optimized by multiple input operator

The forward shuffle (shown in the red box) generated from the store_sales table in Figure 6 indicates that we still have room for optimization. As mentioned in the preamble, in most operations, the data generated directly from the source is not filtered and processed by operators such as join, and the amount of data in shuffle is the largest. Taking the TPC-DS q96 under 10T data as an example, if no further optimization is performed, the task containing the store_sales source table will transmit 1.03T of data to the network, and after a join screening, the data volume drops rapidly to 16.5G. If we can omit the forward shuffle of the source table, the overall execution efficiency of the job can take a big step forward.

Unfortunately, the multiple input operator cannot cover the source shuffle scenario, because the source is different from any other operators in that it has no input. For this reason, Flink 1.12 adds a source chaining function to operator chaining, which merges the source that is not blocked by shuffle into the operator chaining, eliminating the forward shuffle between the source and the downstream operator.

Currently only FLIP-27 source and multiple input operator can use the source chaining function, but this is enough to solve the optimization scenario in this article.

After combining multiple input operator and source chaining, Figure 7 shows the final execution plan of the optimization case in this article.

Figure 7-Optimized execution plan

TPC-DS test results

Multiple input operator and source chaining have significant optimization effects on most jobs, especially batch jobs. We used the TPC-DS test set to test the overall performance of Flink 1.12. Compared with the total time of 12267s announced by Flink 1.10, the total time of Flink 1.12 was only 8708s, which shortened the running time by nearly 30%!

Figure 8-TPC-DS test set total time comparison

Figure 9-Time comparison of some test points of TPC-DS

Future plan

Through the test results of TPC-DS, we can see that source chaining + multiple input can bring us a great performance improvement. At present, the overall framework has been completed, and the commonly used batch operators have supported the derivation logic of eliminating redundant exchanges. We will support more batch operators and more refined derivation algorithms in the future.

Although the data shuffle of a streaming job does not need to write data to disk like a batch job, the performance improvement brought by the network transmission into a memory transmission is also very impressive. Therefore, the streaming job supporting source chaining + multiple input is also very irritating. Expected optimization. At the same time, it requires a lot of work to support this optimization on stream jobs. For example, the derivation logic for eliminating redundant exchanges on stream operators is not yet supported, and some operators need to be reconstructed to eliminate the requirement that the input data is binary, etc. This is why Flink 1.12 has not yet introduced the reason for this optimization in the streaming job. In subsequent versions, we will gradually complete these tasks, and we hope that more community forces will join us to implement more optimizations as soon as possible.

Authors|He Xiaoling, Weng Caizhi

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

 

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/114086774