Shang Silicon Valley Big Data Flink1.17 Practical Tutorial - Note 03 [Flink Runtime Architecture]

  1. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial - Note 01 [Flink Overview, Flink Quick Start]
  2. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial - Note 02 [Flink Deployment]
  3. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial - Note 03 [Flink Runtime Architecture]
  4. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 04【】
  5. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 05【】
  6. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 06【】
  7. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 07【】
  8. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 08【】

Table of contents

Basics

Chapter 04-Flink Deployment

P023【023_Flink runtime architecture_system architecture】07:13

P024【024_Flink runtime architecture_core concept_parallelism】06:45

P025【025_Flink runtime architecture_core concept_parallelism setting & priority】18:40

P026【026_Flink runtime architecture_core concept_operator chain】08:34

P027【027_Flink runtime architecture_core concept_operator chain demonstration】17:11

P028 [028_Flink runtime architecture_core concept_task slot] 09:52

P029 [029_Flink runtime architecture_core concept_sharing group of task slots] 07:59

P030 [030_Flink runtime architecture_core concept_relationship between slot and parallelism & demonstration] 21:27

P031 [031_Flink runtime architecture_submission process_Standalone session mode & four pictures] 09:49

P032【032_Flink runtime architecture_submission process_Yarn application mode】05:18


Basics

Chapter 04-Flink Deployment

P023【023_Flink runtime architecture_system architecture】07:13

Flink runtime architecture - Standalone session mode as an example

P024【024_Flink runtime architecture_core concept_parallelism】06:45

  • The number of subtasks for a specific operator is called its parallelism . In this way, a data flow containing parallel subtasks is a parallel data flow , which requires multiple partitions (stream partitions) to distribute parallel tasks. Generally speaking, the degree of parallelism of a stream program can be considered to be the maximum degree of parallelism among all its operators . In a program, different operators may have different degrees of parallelism.
  • For example: As shown in the figure above, there are four operators: source, map, window, and sink in the current data flow. The parallelism degree of the sink operator is 1, and the parallelism degree of other operators is 2. So the parallelism of this stream processing program is 2.

P025【025_Flink runtime architecture_core concept_parallelism setting & priority】18:40

4.2.1 Parallelism

2) Parallelism setting

In Flink, different methods can be used to set the degree of parallelism, and their effective ranges and priorities are also different.

( 1 ) Set in code

In our code, we can simply call the setParallelism() method after the operator to set the parallelism of the current operator:

stream.map(word -> Tuple2.of(word, 1L)).setParallelism(2);

The degree of parallelism set in this way is only valid for the current operator.

In addition, we can also directly call the setParallelism() method of the execution environment to set the degree of parallelism globally:

env.setParallelism(2);

In this way, the default parallelism of all operators in the code is 2. We generally do not set the global parallelism in the program, because if the global parallelism is hard-coded in the program, dynamic expansion will not be possible.

What should be noted here is that since keyBy is not an operator, the degree of parallelism cannot be set for keyBy.

( 2 ) Set when submitting the application

When submitting an application using the flink run command, you can add the -p parameter to specify the parallelism of the current application execution. Its function is similar to the global settings of the execution environment:

bin/flink run p 2 c com.atguigu.wc.SocketStreamWordCount

./FlinkTutorial-1.0-SNAPSHOT.jar

If we submit the job directly on the Web UI, we can also directly add the degree of parallelism in the corresponding input box.

package com.atguigu.wc;

import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * TODO DataStream实现Wordcount:读socket(无界流)
 *
 * @author
 * @version 1.0
 */
public class WordCountStreamUnboundedDemo {
    public static void main(String[] args) throws Exception {
        // TODO 1.创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // IDEA运行时,也可以看到webui,一般用于本地测试
        // 需要引入一个依赖 flink-runtime-web
        // 在idea运行,不指定并行度,默认就是 电脑的 线程数
        // StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
        env.setParallelism(3);

        // TODO 2.读取数据: socket
        DataStreamSource<String> socketDS = env.socketTextStream("hadoop102", 7777);

        // TODO 3.处理数据: 切换、转换、分组、聚合
        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = socketDS
                .flatMap(
                        (String value, Collector<Tuple2<String, Integer>> out) -> {
                            String[] words = value.split(" ");
                            for (String word : words) {
                                out.collect(Tuple2.of(word, 1));
                            }
                        }
                )
                .setParallelism(2)
                .returns(Types.TUPLE(Types.STRING,Types.INT))
                // .returns(new TypeHint<Tuple2<String, Integer>>() {})
                .keyBy(value -> value.f0)
                .sum(1);

        // TODO 4.输出
        sum.print();

        // TODO 5.执行
        env.execute();
    }
}

/**
 并行度的优先级:
    代码:算子 > 代码:env > 提交时指定 > 配置文件
 */

Parallelism priority: Code: Operator > Code: Global env > Specify command when submitting > Configuration file.

P026【026_Flink runtime architecture_core concept_operator chain】08:34

4.2.2 Operator Chain

2) Merge operator chain

In Flink, one- to-one operator operations with the same degree of parallelism can be directly linked together to form a "large" task ( task ) , so that the original operator becomes part of the real task. As shown below. Each task will be executed by a thread. Such technology is called " Operator Chain ".

P027【027_Flink runtime architecture_core concept_operator chain demonstration】17:11

package com.atguigu.wc;

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * TODO DataStream实现Wordcount:读socket(无界流)
 *
 * @author
 * @version 1.0
 */
public class OperatorChainDemo {
    public static void main(String[] args) throws Exception {
        // TODO 1.创建执行环境
		// StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // IDEA运行时,也可以看到webui,一般用于本地测试
        // 需要引入一个依赖 flink-runtime-web
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());

        // 在idea运行,不指定并行度,默认就是 电脑的 线程数
        env.setParallelism(1);

        // 全局禁用 算子链
		//env.disableOperatorChaining();

        // TODO 2.读取数据:socket
        DataStreamSource<String> socketDS = env.socketTextStream("hadoop102", 7777);

        // TODO 3.处理数据: 切换、转换、分组、聚合
        SingleOutputStreamOperator<Tuple2<String,Integer>> sum = socketDS
				//.disableChaining()
                .flatMap(
                        (String value, Collector<String> out) -> {
                            String[] words = value.split(" ");
                            for (String word : words) {
                                out.collect(word);
                            }
                        }
                )
                .startNewChain()
				//.disableChaining()
                .returns(Types.STRING)
                .map(word -> Tuple2.of(word, 1))
                .returns(Types.TUPLE(Types.STRING,Types.INT))
                .keyBy(value -> value.f0)
                .sum(1);

        // TODO 4.输出
        sum.print();

        // TODO 5.执行
        env.execute();
    }
}

/**
 1、算子之间的传输关系:
     一对一
     重分区

 2、算子 串在一起的条件:
    1) 一对一
    2) 并行度相同

 3、关于算子链的api:
    1)全局禁用算子链:env.disableOperatorChaining();
    2)某个算子不参与链化:  算子A.disableChaining(),  算子A不会与 前面 和 后面的算子 串在一起
    3)从某个算子开启新链条:  算子A.startNewChain(), 算子A不与 前面串在一起,从A开始正常链化
 */

P028 [028_Flink runtime architecture_core concept_task slot] 09:52

4.2.3 Task Slots

P029 [029_Flink runtime architecture_core concept_sharing group of task slots] 07:59

3) Sharing of task slots by tasks

By default, Flink allows subtasks to share slots. If we keep the parallelism of the sink task at 1 and set the global parallelism to 6 when submitting the job, then the first two task nodes will each have 6 parallel subtasks, and the entire stream processing program will have 13 subtasks. As shown in the figure above, as long as they belong to the same job , parallel subtasks of different task nodes (operators) can be executed in the same slot . So for the first task node source→map, its 6 parallel subtasks must be divided into different slots, but the parallel subtasks of the second task node keyBy/window/apply can be shared with the first task node. slot.

package com.atguigu.wc;

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * TODO DataStream实现Wordcount:读socket(无界流)
 *
 * @author
 * @version 1.0
 */
public class SlotSharingGroupDemo {
    public static void main(String[] args) throws Exception {
        // TODO 1.创建执行环境
		// StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // IDEA运行时,也可以看到webui,一般用于本地测试
        // 需要引入一个依赖 flink-runtime-web
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());

        // 在idea运行,不指定并行度,默认就是 电脑的 线程数
        env.setParallelism(1);

        // TODO 2.读取数据:socket
        DataStreamSource<String> socketDS = env.socketTextStream("hadoop102", 7777);

        // TODO 3.处理数据: 切换、转换、分组、聚合
        SingleOutputStreamOperator<Tuple2<String,Integer>> sum = socketDS
                .flatMap(
                        (String value, Collector<String> out) -> {
                            String[] words = value.split(" ");
                            for (String word : words) {
                                out.collect(word);
                            }
                        }
                )
                .returns(Types.STRING)
                .map(word -> Tuple2.of(word, 1)).slotSharingGroup("aaa")
                .returns(Types.TUPLE(Types.STRING,Types.INT))
                .keyBy(value -> value.f0)
                .sum(1);


        // TODO 4.输出
        sum.print();

        // TODO 5.执行
        env.execute();
    }
}

/**
 1、slot特点:
    1)均分隔离内存,不隔离cpu
    2)可以共享:
          同一个job中,不同算子的子任务 才可以共享 同一个slot,同时在运行的
          前提是,属于同一个 slot共享组,默认都是“default”

 2、slot数量 与 并行度 的关系
    1)slot是一种静态的概念,表示最大的并发上限
       并行度是一种动态的概念,表示 实际运行 占用了 几个

    2)要求: slot数量 >= job并行度(算子最大并行度),job才能运行
       TODO 注意:如果是yarn模式,动态申请
         --> TODO 申请的TM数量 = job并行度 / 每个TM的slot数,向上取整
       比如session: 一开始 0个TaskManager,0个slot
         --> 提交一个job,并行度10
            --> 10/3,向上取整,申请4个tm,
            --> 使用10个slot,剩余2个slot
 */

P030 [030_Flink runtime architecture_core concept_relationship between slot and parallelism & demonstration] 21:27

4.2.4 Relationship between task slots and parallelism

Task slots and parallelism are both related to the parallel execution of programs, but they are completely different concepts . Simply put, task slots are a static concept , which refers to the concurrent execution capability of TaskManager, which can be configured through the parameter taskmanager.numberOfTaskSlots; while parallelism is a dynamic concept, which is the actual concurrency capability used by TaskManager when running the program, which can be configured through the parameter parallelism.default to configure.

The relationship between the number of slots and the degree of parallelism
    1) Slot is a static concept, indicating the maximum concurrency upper limit.
       Parallelism is a dynamic concept, indicating how many slots are occupied by the actual operation.

    2) Requirements: The number of slots >= job parallelism (maximum parallelism of the operator), the job can run
       TODO. Note: If it is yarn mode, dynamic application
         --> Number of TMs applied for TODO = job parallelism / slot of each TM Number, round up,
       for example, session: 0 TaskManagers at the beginning, 0 slots
         --> submit a job, parallelism 10
            --> 10/3, round up, apply for 4 tm
            --> use 10 slots, 2 slots remaining

P031 [031_Flink runtime architecture_submission process_Standalone session mode & four pictures] 09:49

4.3 Assignment submission process

4.3.1 Standalone session mode job submission process

4.3.2 Logical flow graph/job graph/execution graph/physical flow graph

Logical flow graph (StreamGraph) → Job graph (JobGraph) → Execution graph (ExecutionGraph) → Physical graph (Physical Graph).

P032【032_Flink runtime architecture_submission process_Yarn application mode】05:18

4.3.3 Yarn application mode job submission process

Guess you like

Origin blog.csdn.net/weixin_44949135/article/details/131760766
Recommended