flink chain task assigned task parallelism

distribution of tasks in parallel flink

Here Insert Picture Description

Flink each TaskManager is a JVM process, it may perform one or more subtask on a separate thread
In order to control a number TaskManager receive task, TaskManager be controlled by the task slot (a TaskManager has at least one slot)

The main slot isolation memory, cpu is shared between slot. That 4-core machine, sufficient memory slot can be set to 8. Run up to eight tasks simultaneously. It recommended a core slot number assigned a
Here Insert Picture Description
degree of parallelism in this FIG source, map the task of synthesizing 6
keyby, window, task parallelism apply synthesis of 6
parallelism sink 1 for
a total of 13 task
but not a 13 a slot in order to meet the requirements of the degree of parallelism

Different operators operate different complexity
we can call this calculation as the source map sink uncomplicated operator called resource-intensive operators of non-resource-intensive aggregate reduce sum window computational complexity of this operator is called the operator

If these two operators of the same priority seen, equal distribution to slo, when the data flow source to the same data rate, it will cause some slot has been run complex operator, has been in operation, at that time simple operator has been running the slot will be very idle.

flink here is non-resource-intensive and resource-intensive operators Operators can be assigned to the same slot, so that all tasks will be equal between slot, there will be no high load has been idle.

A task parallelism is 6 will be divided into six parallel task to run the six task can not be assigned to the same slot in a slot must have only one. That is when you slot cluster of only six, you can not set the operator parallelism more than 6.

flink can do the non-resource and resource intensive operator into different slot sharing group needs to be set in here, non-resource-intensive operator in a sharing group, resource-intensive operator in a shared group, so that both operators will not be shared using slot. By default operators have operators belong to the same group sharing, sharing all the slot.

By default, Flink allow subtasks shared slot, even if they are sub-tasks can be assigned different tasks but to the same slot. Such a result is that a plurality of slot can hold an entire pipeline operation
Task Slot is a static concept, refers to the TaskManager has concurrent execution capabilities.

Let's look at a few examples
Here Insert Picture Description

It can be divided into two areas in parallel

Parallel data
source pull-parallel data processing parallel data map
Calculating parallel
source pull in the new data, map prior to processing the data source pull
parallel two job execution

A number of operator specific subtasks (SubTask) degree of parallelism is called (parallelism).
In general, the degree of parallelism of a stream, it can be considered that all the operators in the maximum degree of parallelism
Here Insert Picture Description

flink idea was to run the program default degree of parallelism is the number of cores to run the program the machine.

Each operator individually can be set in parallel.

.map((_, 1)).setParallelism(2)

Degree of parallelism may be specified globally.

val env = ExecutionEnvironment.getExecutionEnvironment.setParallelism(2)
此时不支持并行的算子 比如env.readTextFile(inputpath) 就会报错
具体情况调整source和sink的并行度

Degree of parallelism can be configured to three positions

flink profile
Code
flink job submission time

priority

Code> Submit> Profile

Code setting when submitted with a code, the code did not set up, no configuration file to set the configuration.
Operator code separately provided precedence over the global set priority

The share group may be provided as uniform as possible task distributed throughout the cluster

Task chain
set reasonable degree of parallelism

Reduce the cost of local communication
Reducing the serialization and deserialization

Multiple operators into one task, the operator had to be inside the subtask
Here Insert Picture Description
meet mission needs about chain conditions

算子具有相同并行度(具有相同的分区数)
算子属于one-to-one

Here Insert Picture Description

one-to-one ：stream维护着分区以及元素的顺序（比如source和map之间）。这意味着map 算子的子任务看到的元素的个数以及顺序跟 source 算子的子任务生产的元素的个数、顺序相同。map、fliter、flatMap等算子都是one-to-one的对应关系。

Redistributing：stream的分区会发生改变。每一个算子的子任务依据所选择的transformation发送数据到不同的目标任务。例如，keyBy 基于 hashCode 重分区、而 broadcast 和 rebalance 会随机重新分区，这些算子都会引起redistribute过程，而 redistribute 过程就类似于 Spark 中的 shuffle 过程。

并行度不同的算子之前传递数据会进行重分区，Redistributing类型的算子也会进行重分区。

例子

配置文件中默认并行度设置为2 ，提交代码是并行度设置为2
socket source 并行度只能是1
flatmap fliter map 并行度都是2 且属于one-to-one 合成任务链
keyby 属于redistrubuting hash 重分区
sum print 并行度为2 属于one-to-one

执行图如下
Here Insert Picture Description
当然还可以禁止掉合成任务链

单个算子不参与合成任务链

.flatMap(_.split(" ")).disableChaining()

从单个算子开启一个新的任务链

.startNewChain()

全局不合成任务链

env.disableOperatorChaining()

下面是一个全局不合成任务链的job执行图,只是在上一个例子的基础上添加了全局不合成任务链。
Here Insert Picture Description

算子设置并行度

source 文件保证数顺序需要并行度为 1
sink 只输出到一个文件需要并行度为 1
socketsource 并行度只能为1

yagch

Published 46 original articles · won praise 5 · Views 1137

Private letter concerns