import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._
object StreamWordCount {
def main(args: Array[String]): Unit = {
// 创建流处理的执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
// 利用传入参数来指定hostname和port
val paramTool: ParameterTool = ParameterTool.fromArgs(args)
val host = paramTool.get("host")
val port = paramTool.getInt("port")
val dataStream = env.socketTextStream(host, port)
// 对每条数据进行处理
val wordCountDataStream = dataStream.flatMap(_.split(" "))
.filter(_.nonEmpty)
.startNewChain()/*开始一个新的任务链(一般在forward(one-to-one)模式,并且任务执行并行度相同的情况下,如果不设置此参数,则任务会合并)*/
.map((_, 1))
.keyBy(0)
.sum(1)
wordCountDataStream.print()
.setParallelism(1) //设定输出时候的并行执行个数(并行度)
// 启动executor
env.execute("stream word count job")
}
}
Flink configuration parameter settings (document code will be provided to cover):
#每个 TaskManager 提供的任务 slots 数量
taskmanager.numberOfTaskSlots: 2
# 程序默认并行计算的个数,最大并行计算个数为所有的taskmanager上面的slots总和
parallelism.default: 1
When we see, behind the filter set startNewChain () after originally the same degree of parallelism, and style forward (one-to-one) implementation plan, is divided into two task chain. (When not set, the default will Flink all one-to-one and the same task parallelism composited together.)
If you remove the code behind filter .startNewChain (), if the forward (one-to-one) and the same degree of parallelism task necessarily combined.
There are two settings Merge task chain in the usual manner:
1. Set this job default settings for all tasks in the environment when not merge task chain:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.disableOperatorChaining()
2. that the above embodiment is provided not merge when a task chain in a step of
.filter(_.nonEmpty).startNewChain()
flink When reading the file, according to the official document: Under The Hood, the splits Flink INTO TWO The File Reading Process Sub-Tasks, namely Directory Monitoring and Data Reading Each of Sub-Tasks THESE IS Implemented by Entity A separate Monitoring Implemented by IS.. SINGLE A, non-parallel (parallelism = 1) Task, the degree of parallelism can only be read is 1, socketsource same degree of parallelism can only be one. When print output task parallelism can only be 1 (this can be understood but difficult to describe why), so the whole wordcount mission plan is shown below (after removing the startNewChain):
Parallelism of 1:
Parallelism of 2:
3 parallelism is:
Note: It may be bad language organization, only to their own understanding and learning, do not like do not spray