Flink-conversion operator

 basic conversion operator

        map (map)

        filter

        flatMap (flat map)

 aggregation operator

        keyBy (key partition)

        simple aggregation

        reduce (reduction aggregation)

Introduction to UDFs 

        function class

        Rich function class


       After the data source reads in data, we can use various conversion operators to convert one or more DataStreams into new DataStreams. The core of a Flink program is actually all transformation operations, which determine the business logic of processing. 

 basic conversion operator

        map (map)

        It is mainly used to convert the data in the data stream to form a new data stream.

        We only need to call the map() method based on the DataStrema to perform transformation processing. The parameter that the method needs to pass in is the implementation of the interface MapFunction; the return value type is still DataStream, but the generic type (element type in the stream) may change.

case class Event(user: String, url: String, timestamp: Long)
object test {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置全局并行度
    env.setParallelism(1)
        //创建当前样例类Event的对象
    val stream: DataStream[Event] = env.fromElements(
          Event("张三", "01", 1000L),
          Event("李四", "04", 2000L),
          Event("王五", "01", 6000L),
          Event("赵六", "03", 1000L)
        )

    //转换需求:提取用户名
      //实现方式1:使用匿名函数
    stream.map( _.user).print("方式1:")
      //实现方式2:实现MapFunction接口
    stream.map(new UserEX).print("方式2:")

    //执行
    env.execute()

  }
  //定义实现接口类
  class UserEX extends MapFunction[Event,String]{
    override def map(t: Event): String = t.user
  }
}

        filter

        The filter() conversion operation, as the name implies, performs a filter on the data stream, sets the filter condition through a Boolean conditional expression, and judges each element in the stream. If it is true, the element will be output normally, and if it is false, the element will be filtered out.

        The data type of the new data stream converted by filter() is the same as that of the original data stream. The parameters that need to be passed in for filter() conversion need to implement the FilterFunction interface, and the filter() method must be implemented in FilterFunction, which is equivalent to a conditional expression that returns a Boolean type.

case class Event(user: String, url: String, timestamp: Long)
object f5 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置全局并行度
    env.setParallelism(1)
    //创建当前样例类Event的对象
    val stream: DataStream[Event] = env.fromElements(
      Event("张三", "01", 1000L),
      Event("李四", "04", 2000L),
      Event("王五", "01", 6000L),
      Event("赵六", "03", 1000L)
    )

    //过滤出用户名为王五的点击事件
      //方式1:使用匿名函数
    stream.filter(_.user == "王五").print("方式1:")
    //方式2:使用匿名函数
    stream.filter(new UserF).print("方式2:")

    //执行
    env.execute()
  }

  //定义实现接口类
  class UserF extends FilterFunction[Event] {
    override def filter(t: Event): Boolean = t.user == "王五"
  }
}

        flatMap (flat map)

        The flatMap() operation, also known as flat mapping, mainly splits the whole (generally collection type) in the data stream into individual ones for use. Consuming an element can produce 0 or more elements. flatMap() can be considered as a combination of two-step operations of "flatten" and "map", that is, first break up and split the data according to certain rules, and then convert the split elements deal with.

        Like map(), flatMap() can also use Lambda expression or FlatMapFunction interface implementation class to pass parameters. The return value type depends on the specific logic of the passed parameters, which can be the same as the original data stream data type, or can be different.

case class Event(user: String, url: String, timestamp: Long)
object f5 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置全局并行度
    env.setParallelism(1)
    //创建当前样例类Event的对象
    val stream: DataStream[Event] = env.fromElements(
      Event("张三", "01", 1000L),
      Event("李四", "04", 2000L),
      Event("王五", "01", 6000L),
      Event("赵六", "03", 1000L)
    )

    //测试灵活的输出形式:如果点击用户为李四 则输出李四
    stream.flatMap(new UserF).print()

    //执行
    env.execute()
  }

  //定义实现接口类
  class UserF extends FlatMapFunction[Event,String] {
    override def flatMap(t: Event, collector: Collector[String]): Unit = {
      //如果当前数据是Mary的点击事件则直接输出user
      if(t.user == "李四"){
        collector.collect(t.user)
      }
    }
  }
}

 aggregation operator

        keyBy (key partition)

        For Flink, DataStream does not have an API for direct aggregation. Because we must perform partitioning and parallel processing when we aggregate massive data, so as to improve efficiency. So in Flink, to do aggregation, you need to partition first; this operation is done through keyBy().

        keyBy() is an operator that must be used before aggregation. keyBy() can logically divide a stream into different partitions by specifying a key. The partitions mentioned here are actually subtasks processed in parallel, which correspond to task slots.

        Based on different keys, the data in the stream will be allocated to different partitions, as shown in Figure 5-8; in this way, all data with the same key will be sent to the same partition, then the following One-step operator operation will be processed in the same slot.

case class Event(user: String, url: String, timestamp: Long)
object f6 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置全局并行度
    env.setParallelism(1)
    //创建当前样例类Event的对象
    val stream: DataStream[Event] = env.fromElements(
      Event("张三", "01", 1000L),
      Event("李四", "04", 2000L),
      Event("王五", "01", 6000L),
      Event("赵六", "03", 1000L)
    )
    //按键分区
      //1、接口实现
    stream.keyBy( new MyKeyb ).print()
      //匿名函数实现
    stream.keyBy( k => k.user)

    //执行
    env.execute()
  }
  //定义实现接口类
  class MyKeyb extends  KeySelector[Event,String]{
    override def getKey(in: Event): String = in.user
  }
}

        simple aggregation

        With the data stream KeyedStream partitioned by key, we can perform aggregation operations based on it. Flink implements some of the most basic and simplest aggregation APIs for us, mainly as follows: 

  • sum(): On the input stream, perform superposition and summation operations on the specified fields.
  • min(): On the input stream, find the minimum value for the specified field.
  • max(): On the input stream, find the maximum value for the specified field.
  • minBy(): Similar to min(), find the minimum value for the specified field on the input stream. The difference is that min() only calculates the minimum value of the specified field, and other fields will retain the value of the first data; while minBy() will return the entire piece of data including the minimum value of the field. 
  • maxBy(): Similar to max(), it calculates the maximum value for the specified field on the input stream. The difference between the two is exactly the same as min()/minBy().
object f6 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置全局并行度
    env.setParallelism(1)
    //创建当前样例类Event的对象
    val stream: DataStream[Event] = env.fromElements(
      Event("张三", "01", 1000L),
      Event("张三", "01", 6000L),
      Event("李四", "04", 2000L),
      Event("李四", "05", 6000L),
      Event("王五", "01", 1000L),
      Event("赵六", "03", 1000L)
    )
    //使用 user 作为分组的字段,并计算最大的时间戳
    stream.keyBy( new MyKeyb ).maxBy("timestamp").print()

    //执行
    env.execute()
  }
  //定义实现接口类
  class MyKeyb extends  KeySelector[Event,String]{
    override def getKey(in: Event): String = in.user
  }
}

        reduce (reduction aggregation)

Similar to simple aggregation, reduce() operation also converts KeyedStream to DataStream. It does not change the element data type of the stream, so the output type is the same as the input type.

         When calling the reduce() method of KeyedStream, a parameter needs to be passed in to implement the ReduceFunction interface. The definition of the interface in the source code is as follows:

case class Event(user: String, url: String, timestamp: Long)
object f6 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置全局并行度
    env.setParallelism(1)
    //创建当前样例类Event的对象
    val stream: DataStream[Event] = env.fromElements(
      Event("张三", "01", 1000L),
      Event("张三", "01", 6000L),
      Event("李四", "04", 2000L),
      Event("李四", "05", 6000L),
      Event("王五", "01", 1000L),
      Event("赵六", "03", 1000L)
    )
    //归约聚合:统计用户点击页面的数量,提取当前最活跃的用户
    stream.map( d => (d.user,1L))
      .keyBy(_._1)
      .reduce( new MyS) //统计每个用户的活跃度
      .keyBy( d => true) //将所有的数据按照同样的key分到同一个组中
      .reduce((a,b) => if(b._2 > a._2) b else a) //选取当前最活跃的用户
      .print() //输出结果
    
    //执行
    env.execute()
  }
  //定义实现接口类
  class MyS extends ReduceFunction[(String,Long)]{
    override def reduce(t: (String, Long), t1: (String, Long)): (String, Long) = (t._1,t._2+t1._2)
  }
}

Introduction to UDFs 

        function class

For most operations, it is necessary to pass in a user-defined function (UDF) to implement the interface of related operations to complete the definition of processing logic. Flink exposes the interfaces of all UDF functions, which are implemented in the form of interfaces or abstract classes, such as MapFunction, FilterFunction, and ReduceFunction. So the most simple and direct way is to customize a function class and implement the corresponding interface.

The following example implements the FilterFunction interface to filter the content containing "01" in the url:

object f7 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置全局并行度
    env.setParallelism(1)
    //创建当前样例类Event的对象
    val stream: DataStream[Event] = env.fromElements(
      Event("张三", "01", 1000L),
      Event("张三", "01", 6000L),
      Event("李四", "04", 2000L),
      Event("李四", "05", 6000L),
      Event("王五", "01", 1000L),
      Event("赵六", "03", 1000L)
    )

    //测试自定义函数类UDF的用法:筛选url中包含关键字“01”的事件
      //1、实现一个自定义的函数类
    stream.filter( new MyFFun).print()

      //2、使用匿名类 筛选url中包含关键字“04”的事件
    stream.filter( new FilterFunction[Event] {
      override def filter(t: Event): Boolean = t.url.contains("04")
    }).print()

    //3、使用匿名函数
    stream.filter(d => d.url.contains("05"))

    //执行
    env.execute()
  }
  //定义实现接口类
  class MyFFun extends FilterFunction[Event]{
    override def filter(t: Event): Boolean = t.url.contains("01")
  }
}

        Rich function class

        "Rich function class" is also an interface of a function class provided by DataStream API, and all Flink function classes have their Rich version. Rich function classes generally appear in the form of abstract classes. For example: RichMapFunction, RichFilterFunction, RichReduceFunction, etc. 

        The main difference from regular function classes is that rich function classes can obtain the context of the running environment and have some life cycle methods, so they can implement more complex functions. Typical life cycle methods are:

  • The open() method is the initialization method of Rich Function, that is, it will open the life cycle of an operator. When an operator's actual working methods such as map() or filter() are called, open() will be called first. Therefore, one-time work such as the creation of file IO streams, the creation of database connections, the reading of configuration files, etc., are all suitable for completion in the open() method.
  • The close() method is the last method called in the life cycle, similar to the destructuring method. Generally used to do some cleaning work. 
object f8 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置全局并行度
    env.setParallelism(1)
    //创建当前样例类Event的对象
    val stream: DataStream[Event] = env.fromElements(
      Event("张三", "01", 1000L),
      Event("张三", "01", 6000L),
      Event("李四", "04", 2000L),
      Event("李四", "05", 6000L),
      Event("王五", "01", 1000L),
      Event("赵六", "03", 1000L)
    )

    //自定义一个RichMapFunction 测试富函数类的功能
    stream.map(new MyRichMap).print()

    //执行
    env.execute()
  }
  //定义实现接口类
  class MyRichMap extends RichMapFunction[Event,Long]{

    override def open(parameters: Configuration): Unit = {
      println("索引号为:"+ getRuntimeContext.getIndexOfThisSubtask + "的任务开始")
    }

    override def map(in: Event): Long = in.timestamp //输出一个长整型的时间戳

    override def close(): Unit = {
      println("索引号为:"+ getRuntimeContext.getIndexOfThisSubtask + "的任务结束")
    }
  }
}

         Suitable scene

class MyFlatMap extends RichFlatMapFunction[IN,OUT]{
 
 override def open(parameters: Configuration): Unit = {
 // 做一些初始化工作
 // 例如建立一个和 MySQL 的连接
 }
 override def flatMap(value: IN, out: Collector[OUT]): Unit = {
 // 对数据库进行读写
 }
 override def close(): Unit = {
 // 清理工作,关闭和 MySQL 数据库的连接。
 
 }
}

 

Guess you like

Origin blog.csdn.net/dafsq/article/details/129695021