Flink from entry to real fragrance (22. The last part of the basics, various UDF functions)

Flink Table API and SQL provide a batch of built-in functions for data conversion, which are also the most commonly used and important points in the daily development process

Many functions are supported in SQL. Both Table API and SQL have been implemented. Basically commonly used functions have been fully covered. Generally, you do not need to write methods yourself

像sql里面比较用的: =,   <>,  >,  >=, <=,is,is not,BETWEEN,EXISTS,IN等等这种操作符基本都覆盖

逻辑类的: or,and,is FALSE
计算类的: +,-,*,/,POWER,ABS,
字符类的: || ,upper,lower,LTRIM
聚合类的: count(*),count(1),avg,sum,max,min,rank

The most complete official website has been listed, you can use it directly: https://ci.apache.org/projects/flink/flink-docs-stable/zh/dev/table/functions/systemFunctions.html

However, in some special scenarios, these built-in functions may not meet the needs. At this time, we may need to write it ourselves. At this time, Flink provides a custom function (UDF)

User Defined Function ((UDF)

User-defined functions (User-defined functions, udf) is an important feature, which significantly expands the ability to express queries
in most cases, user-defined functions must register before you can use in a query
the user by calling registerFunction The () method is registered in TableEnvironment. When a user-defined function is registered, it is inserted into the function catalog of TableEnvironment, so that the Table API or SQL parser can recognize and interpret it correctly.
Flink provides three types of built-in functions

Scalar Functions

Pass in one or more fields and return a value, similar to map operation
user-defined scalar function, you can map 0, 1 or more scalar values ​​to the new scalar value
to define the scalar function, which must be in org. Extend the base class Scalar Function in apache.flink.table.functions and implement (one or more) evaluation (eval) methods

Chestnut implementation, implemented with tableapi and sql respectively

package com.mafei.udftest

import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.ScalarFunction
import org.apache.flink.types.Row

object ScalarFunctionTest {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1) //设置1个并发

    //设置处理时间为流处理的时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)

    //    val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
    val inputStream = env.readTextFile("D:\\java2020_study\\maven\\flink1\\src\\main\\resources\\sensor.txt")
    //先转换成样例类类型
    val dataStream = inputStream
      .map(data => {
        val arr = data.split(",") //按照,分割数据,获取结果
        SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
      })

    //设置环境信息(可以不用)
    val settings = EnvironmentSettings.newInstance()
      .useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
      .inStreamingMode()
      .build()

    // 设置flink table运行环境
    val tableEnv = StreamTableEnvironment.create(env, settings)

    //流转换成表
    val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'timestamp, 'temperature, 'tp.proctime as 'ts)

    //如果要看效果,可以直接打印出来
//    sensorTables.toAppendStream[Row].print("sensorTables: ")

    //调用自定义UDF函数,对id进行hash运算
    //1. table api实现
    // 首先需要new一个实例
    val hashCode = new HashCode(1)

    val resultTable = sensorTables
      .select('id,'ts,hashCode('id))

//    resultTable.toAppendStream[Row].print("resultTable: ")
    /**输出效果:
     * resultTable: > sensor1,2020-12-13T13:53:57.630,1980364880
        resultTable: > sensor2,2020-12-13T13:53:57.632,1980364881
        resultTable: > sensor3,2020-12-13T13:53:57.632,1980364882
        resultTable: > sensor4,2020-12-13T13:53:57.632,1980364883
        resultTable: > sensor4,2020-12-13T13:53:57.632,1980364883
        resultTable: > sensor4,2020-12-13T13:53:57.633,1980364883
     */

    //2. 用sql来实现,需要先在环境中注册好udf函数

    tableEnv.createTemporaryView("sensor",sensorTables)
    tableEnv.registerFunction("hashCode", hashCode)
    val sqlResultTable = tableEnv.sqlQuery("select id, ts, hashCode(id) from sensor")

    sqlResultTable.toRetractStream[Row].print("sqlResultTable")

    env.execute()

  }

}

//自定义一个标量函数
class HashCode(factor: Int) extends  ScalarFunction{
  def eval(s :String): Int={
    s.hashCode * factor - 11111
  }
}

Code structure and operation effect

Flink from entry to real fragrance (22. The last part of the basics, various UDF functions)

Table Functions

If the scalar function is to input one line and output one value, then the table function is to input one line, and the output gets a table, one-to-many, similar to the profile function

Come on, use tableapi and sql to achieve

package com.mafei.udftest

import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.{ScalarFunction, TableFunction}
import org.apache.flink.types.Row

object TableFunctionTest {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1) //设置1个并发

    //设置处理时间为流处理的时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)

    //    val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
    val inputStream = env.readTextFile("D:\\java2020_study\\maven\\flink1\\src\\main\\resources\\sensor.txt")
    //先转换成样例类类型
    val dataStream = inputStream
      .map(data => {
        val arr = data.split(",") //按照,分割数据,获取结果
        SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
      })

    //设置环境信息(可以不用)
    val settings = EnvironmentSettings.newInstance()
      .useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
      .inStreamingMode()
      .build()

    // 设置flink table运行环境
    val tableEnv = StreamTableEnvironment.create(env, settings)

    //流转换成表
    val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'timestamp, 'temperature, 'tp.proctime as 'ts)

    //如果要看效果,可以直接打印出来
//    sensorTables.toAppendStream[Row].print("sensorTables: ")

    //调用自定义UDF函数,先实例化,定义以_为分隔符
    val split = new Split("_")
    val resultTable = sensorTables
      .joinLateral(split('id) as ('word, 'length)) //做个关联,以id作为key,拿到1个元组,定义为world和length名字
      .select('id,'ts,'word,'length)

//    resultTable.toRetractStream[Row].print("resultTable")
    /**  输出效果:
     * resultTable> (true,sensor1,2020-12-13T14:43:01.121,sensor1,7)
        resultTable> (true,sensor2,2020-12-13T14:43:01.124,sensor2,7)
        resultTable> (true,sensor3,2020-12-13T14:43:01.125,sensor3,7)
        resultTable> (true,sensor4,2020-12-13T14:43:01.125,sensor4,7)
        resultTable> (true,sensor4,2020-12-13T14:43:01.125,sensor4,7)
        resultTable> (true,sensor4,2020-12-13T14:43:01.126,sensor4,7)

     */

    //2. 用sql实现
    tableEnv.createTemporaryView("sensor", sensorTables)
    tableEnv.registerFunction("split", split)
    val sqlResultTables = tableEnv.sqlQuery(
      """
        |select
        |id,ts,word,length
        |from sensor,lateral table( split(id)) as splitid(word,length)
        |""".stripMargin)

    sqlResultTables.toRetractStream[Row].print("sqlResultTables")

    env.execute()

  }

}

//自定义一个UDF函数
//定义以传入的字符串作为分隔符,定义输出一个元祖,String和Int
class Split(separator: String) extends TableFunction[(String,Int)]{

  def eval(str:String):Unit={
    str.split(separator).foreach(
      wold => collect((wold, wold.length))
    )
  }
}

Code structure and operation effect:
Flink from entry to real fragrance (22. The last part of the basics, various UDF functions)

Aggregate Functions

User-Defined Aggregate Functions (UDAGGs) can aggregate the data in a table into a scalar value

For example, to calculate the average value of all sensors and each sensor, use tableapi and sql to realize it, and create a new AggregateFunctionTest.scala

package com.mafei.udftest

import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.AggregateFunction
import org.apache.flink.types.Row

object AggregateFunctionTest {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1) //设置1个并发

    //设置处理时间为流处理的时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)

    //    val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
    val inputStream = env.readTextFile("D:\\java2020_study\\maven\\flink1\\src\\main\\resources\\sensor.txt")
    //先转换成样例类类型
    val dataStream = inputStream
      .map(data => {
        val arr = data.split(",") //按照,分割数据,获取结果
        SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
      })

    //设置环境信息(可以不用)
    val settings = EnvironmentSettings.newInstance()
      .useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
      .inStreamingMode()
      .build()

    // 设置flink table运行环境
    val tableEnv = StreamTableEnvironment.create(env, settings)

    //流转换成表
    val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'timestamp, 'temperature, 'tp.proctime as 'ts)

    //table api实现:
    val avgTemp = new AggTemp()
    val resultTable = sensorTables
      .groupBy('id)
      .aggregate(avgTemp('temperature) as 'tempAvg)
      .select('id,'tempAvg)

    resultTable.toRetractStream[Row].print("resultTable")

    //sql实现

    //注册表
    tableEnv.createTemporaryView("sensor", sensorTables)

    //注册函数
    tableEnv.registerFunction("avgTemp", avgTemp)
    val sqlResult = tableEnv.sqlQuery(
      """
        |select id,avgTemp(temperature) as tempAvg
        |from sensor
        |group by id
        |""".stripMargin
    )

    sqlResult.toRetractStream[Row].print("sqlResult")

    env.execute()
  }

}

//定义一个类,存储聚合状态,如果不设置,在AggregateFunction 传入的第二个值就是(Double, Int)   温度的总数和温度的数量
class AggTempAcc{
  var sum: Double = 0.0
  var count: Int = 0
}
//自定义一个聚合函数,求每个传感器的平均温度值,保存状态(tempSum,tempCount)
//传入的第一个Double是最终的返回值,这里求的是平均值,所以是Double
//第二个传入的是中间状态存储的值,需要求平均值,那就需要保存所有温度加起来的总温度和温度的数量(多少个),那就是(Double,Int)
// 如果不传AggTempAcc ,那就传入(Double,Int)一样的效果
class AggTemp extends AggregateFunction[Double,AggTempAcc]{
  override def getValue(acc: AggTempAcc): Double = acc.sum / acc.count

//  override def createAccumulator(): (Double, Int) = (0.0,0)
  override def createAccumulator(): AggTempAcc = new AggTempAcc

  //还要实现一个具体的处理计算函数, accumulate(父方法),具体计算的逻辑,
  def accumulate(acc:AggTempAcc, temp:Double): Unit={
    acc.sum += temp
    acc.count += 1
  }

}

Flink from entry to real fragrance (22. The last part of the basics, various UDF functions)

#表 Aggregate Functions(Table Aggregate Functions)

User-defined table aggregation functions (User-Defined Table Aggregate Functions UDTAGGs) can aggregate data in a table into a result table with multiple rows and multiple columns.
User-defined table aggregation functions are
input implemented by inheriting the TablAggregateFunction abstract class The output and output are both a table. The application scenario can be used in scenarios like top10, where multiple rows of values ​​are to be output.
Flink from entry to real fragrance (22. The last part of the basics, various UDF functions)

The methods that AggregationFunction must implement:
---- createAccumulator()
---- accumlate()
---- emitValue()

The working principle of TableAggregateFunction:

  • First, it also needs an accumulator (Accumulator), which is a data structure that holds the intermediate results of the aggregation. An empty accumulator can be created by calling the createAccumulator() method.
  • Subsequently, the accumlate() method of the function is called for each input line to update the accumulator.
  • After processing all rows, the emitValue() method of the function is called to calculate and return the final result.

For example, use the table aggregation function to implement a top n scenario for all sensors

package com.mafei.udftest

import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableAggregateFunction
import org.apache.flink.types.Row
import org.apache.flink.util.Collector

object TableAggregateFunctionTest {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1) //设置1个并发

    //设置处理时间为流处理的时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)

        val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
//    val inputStream = env.readTextFile("D:\\java2020_study\\maven\\flink1\\src\\main\\resources\\sensor.txt")
    //先转换成样例类类型
    val dataStream = inputStream
      .map(data => {
        val arr = data.split(",") //按照,分割数据,获取结果
        SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
      })

    //设置环境信息(可以不用)
    val settings = EnvironmentSettings.newInstance()
      .useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
      .inStreamingMode()
      .build()

    // 设置flink table运行环境
    val tableEnv = StreamTableEnvironment.create(env, settings)

    //流转换成表
    val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'timestamp, 'temperature, 'tp.proctime as 'ts)

    //1、使用table api方式实现
    val top2Temp = new Top2Temp()
    val resultTable = sensorTables
      .groupBy('id)
      .flatAggregate(top2Temp('temperature) as ('temp, 'rank))
      .select('id,'temp,'rank)

//    resultTable.toAppendStream[Row].print()   //表聚合中间有更改,所以不能直接用toAppendStream

    resultTable.toRetractStream[Row].print("table aggregate")
    /**
     * 输出效果:
     * (true,sensor1,1.0,1)
        (true,sensor1,-1.7976931348623157E308,2)
        (true,sensor2,42.0,1)
        (true,sensor2,-1.7976931348623157E308,2)
        (true,sensor3,43.0,1)
        (true,sensor3,-1.7976931348623157E308,2)
        (true,sensor4,40.1,1)
        (true,sensor4,-1.7976931348623157E308,2)
        (false,sensor4,40.1,1)
        (false,sensor4,-1.7976931348623157E308,2)
        (true,sensor4,40.1,1)
        (true,sensor4,20.0,2)
        (false,sensor4,40.1,1)
        (false,sensor4,20.0,2)
        (true,sensor4,40.2,1)
        (true,sensor4,40.1,2)
     */

    env.execute("表聚合函数-取每个传感器top2")

  }

}

//定义要输出的结构
class Top2TempAcc{
  var highestTemp: Double = Double.MinValue
  var secondHighestTemp: Double = Double.MinValue
}

// 自定义表聚合函数,提取所有温度值中最高的两个温度,输出(temp,rank)
class Top2Temp extends TableAggregateFunction[(Double,Int),Top2TempAcc]{
  override def createAccumulator(): Top2TempAcc = new Top2TempAcc()

  //实现计算聚合结果的函数 accumulate
  // 第一个参数是 accumulate,第二个是当前做聚合传入的参数是什么,这里只需要把温度传入就可以(Double)
  def accumulate(acc: Top2TempAcc, temp : Double): Unit={
    // 要判断当前温度值,是否比状态中保存的温度值大
    //第一步先判断温度是不是比最大的都大
    if(temp > acc.highestTemp){
      //如果比最高温度还高,那排在第一,原来的第一高移动到第二高
      acc.secondHighestTemp = acc.highestTemp
      acc.highestTemp = temp
    }
    else if(temp > acc.secondHighestTemp){
      //这种是比最高的小,比第二高的大,那就直接把第二高换成当前温度值
      acc.secondHighestTemp = temp
    }

  }

  //再实现一个输出结果的方法,最终处理完表中所有数据时调用
  def emitValue(acc: Top2TempAcc,out: Collector[(Double, Int)]): Unit ={
    out.collect((acc.highestTemp,1))
    out.collect((acc.secondHighestTemp,2))
  }
}

sensor.txt内容:
sensor1,1603766281,1
sensor2,1603766282,42
sensor3,1603766283,43
sensor4,1603766240,40.1
sensor4,1603766284,20
sensor4,1603766249,40.2

Code structure and running effect diagram:

Flink from entry to real fragrance (22. The last part of the basics, various UDF functions)

Guess you like

Origin blog.51cto.com/mapengfei/2572888