Flink 1.17 Tutorial: Operator Chain

Operator Chain Operator Chain

In Apache Flink, Operator Chaining is an optimization technology that connects multiple operators (operators) together to form a chain structure. The role of the operator chain is to combine multiple operations into a single task unit to reduce communication overhead, improve execution efficiency, and reduce resource occupation.

In layman's terms, the role of the operator chain can be compared to combining multiple operations into a whole, like bundling multiple small tasks together so that they can be executed as a whole instead of scattered.

Application scenario:

  1. Reduce communication overhead: In an operator chain, multiple operators can be executed in the same thread, avoiding data serialization and network communication overhead. This is especially useful for operators that need to transfer data frequently, and can greatly reduce the cost of data transfer between different operators.
  2. Improve execution efficiency: The operator chain can reduce the overhead of task switching. In an operator chain, multiple operators can be executed in the same thread, avoiding switching between threads, and improving execution efficiency through memory data transfer.
  3. Reduce resource usage: The operator chain can reduce the number of parallel tasks, thereby reducing the required computing resources. If multiple operators are executed in an operator chain, they can share the same thread or task resource, which can effectively reduce the overall resource usage.

In general, the operator chain can provide an optimized execution mode in Apache Flink, reduce communication overhead, improve execution efficiency and reduce resource occupation. It is suitable for operators that require frequent data transfer and dependencies, and scenarios that need to improve overall execution efficiency and reduce resource consumption.

merge operator chain

In Flink, one-to-one operator operations with the same degree of parallelism can be directly linked together to form a "big" task, so that the original operator becomes a part of the real task. As shown below. Each task will be executed by a thread. Such technology is called "Operator Chain".

img

Operator Chain Demo

1. The transmission relationship between operators:
​ One-to-one
​ Repartitioning
2. Conditions for stringing operators together:
​ 1) One-to-one
​ 2) The same degree of parallelism
3. APIs about operator chains:
​ 1 ) Disable the operator chain globally: env.disableOperatorChaining();
2) A certain operator does not participate in chaining: 算子A.disableChaining(), operator A will not be chained together with the previous and subsequent operators
3) Open a new chain from a certain operator: 算子A.startNewChain(), operator A is not chained together with the front, and it is chained normally from A

package com.atguigu.wc;
 
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
 
/**
 * TODO DataStream实现Wordcount:读socket(无界流)
 *
 * @author
 * @version 1.0
 */
public class OperatorChainDemo {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // TODO 1.创建执行环境
		// StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // IDEA运行时,也可以看到webui,一般用于本地测试
        // 需要引入一个依赖 flink-runtime-web
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
 
        // 在idea运行,不指定并行度,默认就是 电脑的 线程数
        env.setParallelism(1);
 
        // 全局禁用 算子链
		//env.disableOperatorChaining();
 
        // TODO 2.读取数据:socket
        DataStreamSource<String> socketDS = env.socketTextStream("hadoop102", 7777);
 
        // TODO 3.处理数据: 切换、转换、分组、聚合
        SingleOutputStreamOperator<Tuple2<String,Integer>> sum = socketDS
				//.disableChaining()
                .flatMap(
                        (String value, Collector<String> out) -> {
    
    
                            String[] words = value.split(" ");
                            for (String word : words) {
    
    
                                out.collect(word);
                            }
                        }
                )
                .startNewChain()
				//.disableChaining()
                .returns(Types.STRING)
                .map(word -> Tuple2.of(word, 1))
                .returns(Types.TUPLE(Types.STRING,Types.INT))
                .keyBy(value -> value.f0)
                .sum(1);
 
        // TODO 4.输出
        sum.print();
 
        // TODO 5.执行
        env.execute();
    }
}
 
/**
 1、算子之间的传输关系:
     一对一
     重分区
 2、算子 串在一起的条件:
    1) 一对一
    2) 并行度相同
 3、关于算子链的api:
    1)全局禁用算子链:env.disableOperatorChaining();
    2)某个算子不参与链化:  算子A.disableChaining(),  算子A不会与 前面 和 后面的算子 串在一起
    3)从某个算子开启新链条:  算子A.startNewChain(), 算子A不与 前面串在一起,从A开始正常链化
 */

Guess you like

Origin blog.csdn.net/a772304419/article/details/132626480