Flink application architecture development Introduction

Flink program follow certain programming mode. DataStream API and have substantially the same DataSet API program structure. Word frequency statistics to the text file as a sample code in the program flow.

package com.realtime.flink.streaming
import org.apache.flink.apijava.utils.ParameterTool
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, _}

object WordCount {
    def main(args: Array[String]) {
        //第一步：设定执行环境
        val env = SreamExecutionEnvironment.getExecutionEnvironment
       //第二步：指定数据源地址，开始读取数据
        val text = env.readTextFile("file:///path/file") 
      //第三步：对数据集指定转换操作逻辑
      val counts ： DataStream[(String, int)]  = text
          .flatMap(_.toLowerCase.split(" ")) 
          .fliter(_.nonEmpty)
          .map(_, 1)
          .sum(1)
      //第四步：指定计算结果输出位置
      if (params.has("output")) {
          counts.writeAsText(params.get("output"))
      } else {
        println("Printing resule to stdout. Use --output to specify output path.")
        counts.print()
      }
      //第五步：指定名称并触发流式任务
      env.execute("Streaming WordCount")
    }
}

Flink entire program total is divided into five steps:

1. Flink execution environment

Different execution environment determines the type of application:

StreamExecutionEnvironmen for streaming, ExecutionEnvironment bulk data processing environment.

Get the environment in three ways:

Stream processing:

//设定Flink运行环境,如果在本地启动则创建本地环境,如果在集群启动就创建集群环境
StreamExecutionEnvironment.getExecutionEnvironment
//指定并行度创建本地执行环境
StreamExecutionEnvironment.createLocalEnvironment(5)
//指定远程JobManager ip和RPC 端口以及运行程序所在的jar包和及其依赖包
StreamExecutionEnvironment.createRemoteEnvironment("JobManagerHost", 6021, 5, "/user/application.jar")

A third way to create native code directly from the remote cluster JobManager RPC connection, specify the remote copy program will run the jar to the JobManager node, Flink application running in a remote environment, program corresponding to a local client.

Batch:

  //设定Flink运行环境,如果在本地启动则创建本地环境,如果在集群启动就创建集群环境
 ExecutionEnvironment.getExecutionEnvironment
  //指定并行度创建本地执行环境
  ExecutionEnvironment.createLocalEnvironment(5)
  //指定远程JobManager ip和RPC 端口以及运行程序所在的jar包和及其依赖包
 ExecutionEnvironment.createRemoteEnvironment("JobManagerHost", 6021, 5, "/user/application.jar")

Note Flink applications developed in different languages when the need to introduce different environments corresponding execution environment

2. initialization data

After creating the execution environment, ExecutionEnvironment need to provide different data access interface for data initialization, converts the data into an external DataStream Or DataSet data set.
Flink provides a variety of data read from the external connector, comprising a batch and real-time data connector, Flink system can be connected to the other third-party systems, direct access to external data
The following code is read by flle readTextFile () method: // pathfile data path and converted into DataStream data set.

val text: DataStream[String] = env.readTextFlie("flle://pathfile")

Reads the file into DataStream [String] data set, complete conversion from a local file to a distributed data set

3. perform the conversion

Transformation of various operations on the data set is achieved by a different Operator, Operator each be implemented within each data processing logic to complete the definition Operator by implementing Function interface.

DataStream API and DataSet API provides many conversion operator, such as: map, flatMap, filter, keyBy, the user only needs to define each of the logic function performed by the operator, and then applied to the data conversion operation Operator Interface.

 val counts: DataStream[String, Int] = text
     .flatMap(_.toLowerCase.split(" "))  //执行flatMap操作
     .filter(_.nonEmpty) //过滤空字段
     .map((_, 1) //执行map转换操作,转换成key - value 接口
     .keyBy(0) // 按照指定key对数据重分区
     .sum(1) /执行求和运算操作

Function definitions flink calculation logic may be defined by the completion of the following ways:

1. achieved by creating a Class Function Interface

//实现MapFunction接口
 class MyMapFunction extends MapFunction[String, String] {
     override def map(t: String): String {
         t.toUpperCase()
     }
 }
 
 val dataStream: DataStream[String] = env.fromElements("hello", flink)
 //将MyMapFunction实现类传入进去
 dataStream.map(new MyMapFunction)

Completion of data processing to achieve the centralized data string converted to uppercase

2. Implement Function Interface by creating an anonymous class

 val dataStream: DataStream[String] = env.fromElements("hello", flink)
 //通过创建MapFunction匿名实现类来定义map函数的计算逻辑
 dataStream.map(new MapFunction[String, String] {
     //实现对输入字符串大写转换
      override def map(t: String): String{
         t.toUpperCase()
     }
 })

3. By implementing RichFunction Interface

Flink RichFunction provides interface for more advanced data processing scenarios, there RichFunction interface open, close, getRuntimeContext setRuntimeContext and acquires internal state data, caching systems and MapFunction Similarly, RichFunction subclass also RichMapFunction.

//定义匿名类实现RichMapFunction接口,完成对字符串到整形数字的转换
 dataStream.map(new RichMapFunction[String, Int] {
     //实现对输入字符串大写转换
      override def map(in: String):Int = (in.toInt)
 })

4. specify the partition key

Some operators require a specified key conversion, the common operator has:. Join, coGroup, groupBy DataStream needs to be converted into a corresponding set of data or DataSet KeyedStream and GroupDataSet, primarily to route data to the same key in the same Pipeline

1. The designated location field

//DataStream API聚合计算

val dataStream : DataStream[(String,Int)] = env.fromElements(("a", 1),("c", 2))

//根据第一个字段重新分区,然后对第二个字段进行求和计算
val result = dataStream.keyBy(0).sum(1)

//DataSet API 聚合计算
val dataSet = env.fromElements(("a", 1),("c", 2))
//根据第一个字段进行数据重分区
val groupDataSet : GroupDataSet[(String , Int)] = dataSet.groupBy(0)
//求取相同key值第二个字段的最大值
groupDataSet.max(1)

2. Specify the name of the field according to

Using the name of the desired type of data structure must be DataStream Tuple class or classes POJOs

val personDataSet = env.fromElements(new Person("Alex", 18), new Person("Peter", 43))
//指定name字段名称来确定groupBy 字段
 personDataSet.groupBy("name").max(1)

If the data type of a program Tuple field names usually start from a calculated location index field is calculated from the start 0

val personDataStream = env.fromElements(new Person("Alex", 18), new Person("Peter", 43))
//通过名称指定第一个字段
personDataStream.keyBy("_1")

//通过位置指定第一个字段
personDataStream.keyBy(0)

Nested complex data structures:

class NestedClass {
    var id: int,
    tuples: (Long, Long, String)){
        def this() {
            this(0, (0, 0, " "))
        }
    }

class CompelexClass(var nested: NestedClass, var tag: String) {
    def this() {
        this(null, " ")
    }
}

By "nested" get the whole NestedClass objects for all fields, call id field "tag" get CompelexClass the tag field, calls "nested.id" get NestedClass, call the "nested.tuples._1" get NestedClass tuple in the first field Ganso

3. Key is specified by the selector

KeySelector that is defined, then getKey replication method, the Person object name acquired from the specified Key.

case class Person(name: String, age: Int)
var person = env.fromElements(Person("hello", 1), Person("Flink", 3) )
//
val keyed: KeyedStream[WC] = person.keyBy(new KeySelector[Person, String](){
    override def getKey(person: Person): String = person.name
})

5. The output

After the data conversion operations, typically will be output to an external system or console .Flink addition to the basic data output method, the system is also defined in many Connector, add user-defined output system operator by calling class DataSink AddSink (), This can output data to an external system.

//将数据输出到文件中
counts.writeAsText("file://path/to/savefile")
//将数据输出控制台
counts.print()

Program triggers

All calculation logic operation definition in place, the need to adjust ExecutionEnvironment execute () method to trigger the execution of a program, execute () method returns a result of type JobExecutionResult, JobExecutionResult accumulator contains the time and other indicators of program execution.

Note: DataStream streaming application needs to display call the execute () method, otherwise the application will not be executed but Flink DataSet API output for operator already contains a call to execute () method, eliminating the need to call display, otherwise there will be a program. abnormal.

//调StreamExecutionEnvironment的execute()方法来执行流式应用程序
env.execute("App Name")

to sum up

This paper describes the application development Flink Step 5: acquiring an execution environment; initialization data; perform a conversion operation; key specified partition; some implementations, the details of the trigger output, and the like, and the interior of the development model.