DESCRIPTION flink Flink slot on the relationship between the partition and the topic kafka kafka read data, partition allocation

Today there are a small partner in the group slot and ask kafka topic partition (hereinafter topic, by default kafka the topic) relationship, the answer is probably a bit of a consolidation here

First of all must be clear that, Flink Task Manager slot number and the number of partitions topic is not directly related to, and this question is actually asking is:  the relationship between the number of concurrent tasks and the number of slot

The maximum number of concurrent = number of slot

There are two reasons: Each operator can not parallel in the same slot, different operators may share a slot, so that the maximum degree of parallelism equal to the number of slot.

This will have an indirect relationship between the slot number and the number of partitions topic: the degree of parallelism child we could configure our source (and subsequent other operators) according to the number of partitions kafka operator, while the maximum degree of parallelism operator decisions slot data (calculated from the number of reverse TM slot number) of

Viewing an official website:

Description:

FIG first: 3 Task Manager, each slot 3, a total of 9 slot

The second chart: Example 1, wordcount case, a complicated, chain operators together, only a slot

Third figure: Examlple 2, wordcount case, two concurrent, accounting for 2 slot. Three settings parallelism by:

flink-conf.yaml parameters Parallelism. default : 2 
Flink -p # 2 plus start - P parameter specifies 
env.setParallelism ( 2)

Fourth Figure: Example 3, wordcount case, nine concurrent, accounting for 9 slot 

第五个图:Example 3,wordcount 案例,source 9 个并发,sink 1 个并发,占 9 个slot(sink 和其中一个 source chain 在一起了)

看一个具体的任务:  

我们要读的 topic 有 2 个 partition,我们设置 source 算子的并行度为 2,那我们最小就需要 4 个 slot,Task Manager 配置的 slot 数为2, 那最少就需要 2 个 TM 任务才能正常运行(不考虑其他算子)。

关键代码:

env.setParallelism(2)
env.addSource(source).addSink(sink)

提交到yarn 上

 

上面说明了算子的并发度与TM 的 slot 数的关系。

下面看下,kafka 分区数与 source 算子的并行度关系。

在不修改 kafka consumer 的分区分配策略的情况下,soure 的并行度与 topic 分区数在不同情况下,会有不同的表现,如下:

1、source 并行度 =  topic 分区数,正好的情况,一个 并行度,读一个分区的数据

2、source 并行读  < topic 分区数, 会出现部分 并行度读多个 分区的情况,具体可见:flink 读取kafka 数据,partition分配 

3、source 并行度 > topic 分区数,会出现部分并行度没有数据的情况

 

总结下问题:slot 数和 topic 的分区数并没有直接关系,以kafka 做 source 的情况最多,而 kafka topic 的分区数一般又是 Flink source 的并行度,又是 Flink 任务的最大并发度,一般情况下又是 slot 的数量,所以会有一种 slot 数 和 topic 分区数 有直接关系的假象。

 

注:Task Manager 的 slot 数在 flink-conf.yaml 中配置 参数:

# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.

taskmanager.numberOfTaskSlots: 2 # 默认值为1

 官网 slot 配置说明:https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/config.html#configuring-taskmanager-processing-slots (slot 数量推荐是在只有一个任务的情况下,具体配置要看实际情况)

 

 

欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文

Guess you like

Origin www.cnblogs.com/Springmoon-venn/p/12023350.html