Conceptual basis of Flink - parallelism, operator chain

Parallelism 

operator chain 


Parallelism 

        The number of subtasks of a particular operator is called its parallelism. In this way, a data stream containing parallel subtasks is a parallel data stream, which requires multiple partitions (stream partition) to distribute parallel tasks. In general, the parallelism of a stream program can be considered as the maximum parallelism among all its operators. In a program, different operators may have different degrees of parallelism.

        As shown in the figure, there are four operators in the current data stream: Source, map(), keyBy()/window()/apply(), and Sink. Except for the last Sink, the parallelism of other operators is 2. The whole program contains 7 subtasks, at least 2 partitions are needed for parallel execution. We can say that the parallelism of this stream processing program is 2.

        Setting the degree of parallelism in the code 

        In the code, we can simply call the setParallelism() method after the operator to set the parallelism of the current operator. The parallelism set in this way is only valid for the current operator.

stream.map((_,1)).setParallelism(2)

        In addition, we can also directly call the setParallelism() method of the execution environment to set the parallelism globally:

env.setParallelism(2)

        In this way, all operators in the code have a default parallelism of 2. We generally do not set the global parallelism in the program, because if the global parallelism is hard-coded in the program, dynamic expansion will not be possible.

It should be noted here that since the keyBy() method does not return an operator, the parallelism cannot be set for keyBy().

        Set when submitting assignment 

        When using the flink run command to submit a job, you can add the -p parameter to specify the parallelism of the current application execution, which is similar to the global setting of the execution environment: 

bin/flink run –p 2 –c com.atguigu.wc.StreamWordCount 
./FlinkTutorial-1.0-SNAPSHOT.jar

         set in the configuration file

        The default parallelism can be changed directly in the cluster configuration file flink-conf.yaml:

parallelism.default: 2

        This setting is valid for all jobs submitted on the entire cluster, and the initial value is 1. Whether it is set in the code or the -p parameter when submitting, it is not necessary. Therefore, when the degree of parallelism is not specified, the default degree of parallelism of the cluster in the configuration file will be used.

operator chain 

        In Flink, one-to-one operator operations with the same degree of parallelism can be directly linked together to form a "big" task (task), so that the original operator becomes part of the real task, such as As shown in the figure. Each task will be executed by a thread. Such technology is called "Operator Chain".

        By default, Flink will perform link merging according to the principle of operator chains. If we want to prohibit merging or define ourselves, we can also make some specific settings for operators in the code:

// 禁用算子链
.map((_,1)).disableChaining()
// 从当前算子开始新链
.map((_,1)).startNewChain()

Guess you like

Origin blog.csdn.net/dafsq/article/details/129674343