Flink WordCount Detailed processing task parallelism The streaming data

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_40713537/article/details/102696646
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._


object StreamWordCount {
    def main(args: Array[String]): Unit = {

        // 创建流处理的执行环境
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        
        // 利用传入参数来指定hostname和port
        val paramTool: ParameterTool = ParameterTool.fromArgs(args)
        val host = paramTool.get("host")
        val port = paramTool.getInt("port")

        val dataStream = env.socketTextStream(host, port)
        
        // 对每条数据进行处理
        val wordCountDataStream = dataStream.flatMap(_.split(" "))
            .filter(_.nonEmpty)
            .startNewChain()/*开始一个新的任务链(一般在forward(one-to-one)模式,并且任务执行并行度相同的情况下,如果不设置此参数,则任务会合并)*/
            .map((_, 1))
            .keyBy(0)
            .sum(1)
        
        wordCountDataStream.print()
            .setParallelism(1) //设定输出时候的并行执行个数(并行度)
        
        // 启动executor
        env.execute("stream word count job")
    }
}

Flink configuration parameter settings (document code will be provided to cover):

#每个 TaskManager 提供的任务 slots 数量
taskmanager.numberOfTaskSlots: 2

# 程序默认并行计算的个数,最大并行计算个数为所有的taskmanager上面的slots总和
parallelism.default: 1

When we see, behind the filter set startNewChain () after originally the same degree of parallelism, and style forward (one-to-one) implementation plan, is divided into two task chain. (When not set, the default will Flink all one-to-one and the same task parallelism composited together.)

If you remove the code behind filter .startNewChain (), if the forward (one-to-one) and the same degree of parallelism task necessarily combined.

There are two settings Merge task chain in the usual manner:

1. Set this job default settings for all tasks in the environment when not merge task chain:

  val env = StreamExecutionEnvironment.getExecutionEnvironment
        
  env.disableOperatorChaining()

2. that the above embodiment is provided not merge when a task chain in a step of

.filter(_.nonEmpty).startNewChain()

flink When reading the file, according to the official document: Under The Hood, the splits Flink INTO TWO The File Reading Process Sub-Tasks, namely  Directory Monitoring  and  Data Reading Each of Sub-Tasks THESE IS Implemented by Entity A separate Monitoring Implemented by IS.. SINGLE A,  non-parallel  (parallelism = 1) Task, the degree of parallelism can only be read is 1, socketsource same degree of parallelism can only be one. When print output task parallelism can only be 1 (this can be understood but difficult to describe why), so the whole wordcount mission plan is shown below (after removing the startNewChain):

Parallelism of 1:

 

Parallelism of 2:

3 parallelism is:

 

Note: It may be bad language organization, only to their own understanding and learning, do not like do not spray

Guess you like

Origin blog.csdn.net/qq_40713537/article/details/102696646