Flink Table API and SQL provide a batch of built-in functions for data conversion, which are also the most commonly used and important points in the daily development process
Many functions are supported in SQL. Both Table API and SQL have been implemented. Basically commonly used functions have been fully covered. Generally, you do not need to write methods yourself
像sql里面比较用的: =, <>, >, >=, <=,is,is not,BETWEEN,EXISTS,IN等等这种操作符基本都覆盖
逻辑类的: or,and,is FALSE
计算类的: +,-,*,/,POWER,ABS,
字符类的: || ,upper,lower,LTRIM
聚合类的: count(*),count(1),avg,sum,max,min,rank
The most complete official website has been listed, you can use it directly: https://ci.apache.org/projects/flink/flink-docs-stable/zh/dev/table/functions/systemFunctions.html
However, in some special scenarios, these built-in functions may not meet the needs. At this time, we may need to write it ourselves. At this time, Flink provides a custom function (UDF)
User Defined Function ((UDF)
User-defined functions (User-defined functions, udf) is an important feature, which significantly expands the ability to express queries
in most cases, user-defined functions must register before you can use in a query
the user by calling registerFunction The () method is registered in TableEnvironment. When a user-defined function is registered, it is inserted into the function catalog of TableEnvironment, so that the Table API or SQL parser can recognize and interpret it correctly.
Flink provides three types of built-in functions
Scalar Functions
Pass in one or more fields and return a value, similar to map operation
user-defined scalar function, you can map 0, 1 or more scalar values to the new scalar value
to define the scalar function, which must be in org. Extend the base class Scalar Function in apache.flink.table.functions and implement (one or more) evaluation (eval) methods
Chestnut implementation, implemented with tableapi and sql respectively
package com.mafei.udftest
import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.ScalarFunction
import org.apache.flink.types.Row
object ScalarFunctionTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) //设置1个并发
//设置处理时间为流处理的时间
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
// val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
val inputStream = env.readTextFile("D:\\java2020_study\\maven\\flink1\\src\\main\\resources\\sensor.txt")
//先转换成样例类类型
val dataStream = inputStream
.map(data => {
val arr = data.split(",") //按照,分割数据,获取结果
SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
})
//设置环境信息(可以不用)
val settings = EnvironmentSettings.newInstance()
.useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
.inStreamingMode()
.build()
// 设置flink table运行环境
val tableEnv = StreamTableEnvironment.create(env, settings)
//流转换成表
val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'timestamp, 'temperature, 'tp.proctime as 'ts)
//如果要看效果,可以直接打印出来
// sensorTables.toAppendStream[Row].print("sensorTables: ")
//调用自定义UDF函数,对id进行hash运算
//1. table api实现
// 首先需要new一个实例
val hashCode = new HashCode(1)
val resultTable = sensorTables
.select('id,'ts,hashCode('id))
// resultTable.toAppendStream[Row].print("resultTable: ")
/**输出效果:
* resultTable: > sensor1,2020-12-13T13:53:57.630,1980364880
resultTable: > sensor2,2020-12-13T13:53:57.632,1980364881
resultTable: > sensor3,2020-12-13T13:53:57.632,1980364882
resultTable: > sensor4,2020-12-13T13:53:57.632,1980364883
resultTable: > sensor4,2020-12-13T13:53:57.632,1980364883
resultTable: > sensor4,2020-12-13T13:53:57.633,1980364883
*/
//2. 用sql来实现,需要先在环境中注册好udf函数
tableEnv.createTemporaryView("sensor",sensorTables)
tableEnv.registerFunction("hashCode", hashCode)
val sqlResultTable = tableEnv.sqlQuery("select id, ts, hashCode(id) from sensor")
sqlResultTable.toRetractStream[Row].print("sqlResultTable")
env.execute()
}
}
//自定义一个标量函数
class HashCode(factor: Int) extends ScalarFunction{
def eval(s :String): Int={
s.hashCode * factor - 11111
}
}
Code structure and operation effect
Table Functions
If the scalar function is to input one line and output one value, then the table function is to input one line, and the output gets a table, one-to-many, similar to the profile function
Come on, use tableapi and sql to achieve
package com.mafei.udftest
import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.{ScalarFunction, TableFunction}
import org.apache.flink.types.Row
object TableFunctionTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) //设置1个并发
//设置处理时间为流处理的时间
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
// val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
val inputStream = env.readTextFile("D:\\java2020_study\\maven\\flink1\\src\\main\\resources\\sensor.txt")
//先转换成样例类类型
val dataStream = inputStream
.map(data => {
val arr = data.split(",") //按照,分割数据,获取结果
SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
})
//设置环境信息(可以不用)
val settings = EnvironmentSettings.newInstance()
.useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
.inStreamingMode()
.build()
// 设置flink table运行环境
val tableEnv = StreamTableEnvironment.create(env, settings)
//流转换成表
val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'timestamp, 'temperature, 'tp.proctime as 'ts)
//如果要看效果,可以直接打印出来
// sensorTables.toAppendStream[Row].print("sensorTables: ")
//调用自定义UDF函数,先实例化,定义以_为分隔符
val split = new Split("_")
val resultTable = sensorTables
.joinLateral(split('id) as ('word, 'length)) //做个关联,以id作为key,拿到1个元组,定义为world和length名字
.select('id,'ts,'word,'length)
// resultTable.toRetractStream[Row].print("resultTable")
/** 输出效果:
* resultTable> (true,sensor1,2020-12-13T14:43:01.121,sensor1,7)
resultTable> (true,sensor2,2020-12-13T14:43:01.124,sensor2,7)
resultTable> (true,sensor3,2020-12-13T14:43:01.125,sensor3,7)
resultTable> (true,sensor4,2020-12-13T14:43:01.125,sensor4,7)
resultTable> (true,sensor4,2020-12-13T14:43:01.125,sensor4,7)
resultTable> (true,sensor4,2020-12-13T14:43:01.126,sensor4,7)
*/
//2. 用sql实现
tableEnv.createTemporaryView("sensor", sensorTables)
tableEnv.registerFunction("split", split)
val sqlResultTables = tableEnv.sqlQuery(
"""
|select
|id,ts,word,length
|from sensor,lateral table( split(id)) as splitid(word,length)
|""".stripMargin)
sqlResultTables.toRetractStream[Row].print("sqlResultTables")
env.execute()
}
}
//自定义一个UDF函数
//定义以传入的字符串作为分隔符,定义输出一个元祖,String和Int
class Split(separator: String) extends TableFunction[(String,Int)]{
def eval(str:String):Unit={
str.split(separator).foreach(
wold => collect((wold, wold.length))
)
}
}
Code structure and operation effect:
Aggregate Functions
User-Defined Aggregate Functions (UDAGGs) can aggregate the data in a table into a scalar value
For example, to calculate the average value of all sensors and each sensor, use tableapi and sql to realize it, and create a new AggregateFunctionTest.scala
package com.mafei.udftest
import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.AggregateFunction
import org.apache.flink.types.Row
object AggregateFunctionTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) //设置1个并发
//设置处理时间为流处理的时间
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
// val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
val inputStream = env.readTextFile("D:\\java2020_study\\maven\\flink1\\src\\main\\resources\\sensor.txt")
//先转换成样例类类型
val dataStream = inputStream
.map(data => {
val arr = data.split(",") //按照,分割数据,获取结果
SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
})
//设置环境信息(可以不用)
val settings = EnvironmentSettings.newInstance()
.useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
.inStreamingMode()
.build()
// 设置flink table运行环境
val tableEnv = StreamTableEnvironment.create(env, settings)
//流转换成表
val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'timestamp, 'temperature, 'tp.proctime as 'ts)
//table api实现:
val avgTemp = new AggTemp()
val resultTable = sensorTables
.groupBy('id)
.aggregate(avgTemp('temperature) as 'tempAvg)
.select('id,'tempAvg)
resultTable.toRetractStream[Row].print("resultTable")
//sql实现
//注册表
tableEnv.createTemporaryView("sensor", sensorTables)
//注册函数
tableEnv.registerFunction("avgTemp", avgTemp)
val sqlResult = tableEnv.sqlQuery(
"""
|select id,avgTemp(temperature) as tempAvg
|from sensor
|group by id
|""".stripMargin
)
sqlResult.toRetractStream[Row].print("sqlResult")
env.execute()
}
}
//定义一个类,存储聚合状态,如果不设置,在AggregateFunction 传入的第二个值就是(Double, Int) 温度的总数和温度的数量
class AggTempAcc{
var sum: Double = 0.0
var count: Int = 0
}
//自定义一个聚合函数,求每个传感器的平均温度值,保存状态(tempSum,tempCount)
//传入的第一个Double是最终的返回值,这里求的是平均值,所以是Double
//第二个传入的是中间状态存储的值,需要求平均值,那就需要保存所有温度加起来的总温度和温度的数量(多少个),那就是(Double,Int)
// 如果不传AggTempAcc ,那就传入(Double,Int)一样的效果
class AggTemp extends AggregateFunction[Double,AggTempAcc]{
override def getValue(acc: AggTempAcc): Double = acc.sum / acc.count
// override def createAccumulator(): (Double, Int) = (0.0,0)
override def createAccumulator(): AggTempAcc = new AggTempAcc
//还要实现一个具体的处理计算函数, accumulate(父方法),具体计算的逻辑,
def accumulate(acc:AggTempAcc, temp:Double): Unit={
acc.sum += temp
acc.count += 1
}
}
#表 Aggregate Functions(Table Aggregate Functions)
User-defined table aggregation functions (User-Defined Table Aggregate Functions UDTAGGs) can aggregate data in a table into a result table with multiple rows and multiple columns.
User-defined table aggregation functions are
input implemented by inheriting the TablAggregateFunction abstract class The output and output are both a table. The application scenario can be used in scenarios like top10, where multiple rows of values are to be output.
The methods that AggregationFunction must implement:
---- createAccumulator()
---- accumlate()
---- emitValue()
The working principle of TableAggregateFunction:
- First, it also needs an accumulator (Accumulator), which is a data structure that holds the intermediate results of the aggregation. An empty accumulator can be created by calling the createAccumulator() method.
- Subsequently, the accumlate() method of the function is called for each input line to update the accumulator.
- After processing all rows, the emitValue() method of the function is called to calculate and return the final result.
For example, use the table aggregation function to implement a top n scenario for all sensors
package com.mafei.udftest
import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableAggregateFunction
import org.apache.flink.types.Row
import org.apache.flink.util.Collector
object TableAggregateFunctionTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) //设置1个并发
//设置处理时间为流处理的时间
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
// val inputStream = env.readTextFile("D:\\java2020_study\\maven\\flink1\\src\\main\\resources\\sensor.txt")
//先转换成样例类类型
val dataStream = inputStream
.map(data => {
val arr = data.split(",") //按照,分割数据,获取结果
SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
})
//设置环境信息(可以不用)
val settings = EnvironmentSettings.newInstance()
.useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
.inStreamingMode()
.build()
// 设置flink table运行环境
val tableEnv = StreamTableEnvironment.create(env, settings)
//流转换成表
val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'timestamp, 'temperature, 'tp.proctime as 'ts)
//1、使用table api方式实现
val top2Temp = new Top2Temp()
val resultTable = sensorTables
.groupBy('id)
.flatAggregate(top2Temp('temperature) as ('temp, 'rank))
.select('id,'temp,'rank)
// resultTable.toAppendStream[Row].print() //表聚合中间有更改,所以不能直接用toAppendStream
resultTable.toRetractStream[Row].print("table aggregate")
/**
* 输出效果:
* (true,sensor1,1.0,1)
(true,sensor1,-1.7976931348623157E308,2)
(true,sensor2,42.0,1)
(true,sensor2,-1.7976931348623157E308,2)
(true,sensor3,43.0,1)
(true,sensor3,-1.7976931348623157E308,2)
(true,sensor4,40.1,1)
(true,sensor4,-1.7976931348623157E308,2)
(false,sensor4,40.1,1)
(false,sensor4,-1.7976931348623157E308,2)
(true,sensor4,40.1,1)
(true,sensor4,20.0,2)
(false,sensor4,40.1,1)
(false,sensor4,20.0,2)
(true,sensor4,40.2,1)
(true,sensor4,40.1,2)
*/
env.execute("表聚合函数-取每个传感器top2")
}
}
//定义要输出的结构
class Top2TempAcc{
var highestTemp: Double = Double.MinValue
var secondHighestTemp: Double = Double.MinValue
}
// 自定义表聚合函数,提取所有温度值中最高的两个温度,输出(temp,rank)
class Top2Temp extends TableAggregateFunction[(Double,Int),Top2TempAcc]{
override def createAccumulator(): Top2TempAcc = new Top2TempAcc()
//实现计算聚合结果的函数 accumulate
// 第一个参数是 accumulate,第二个是当前做聚合传入的参数是什么,这里只需要把温度传入就可以(Double)
def accumulate(acc: Top2TempAcc, temp : Double): Unit={
// 要判断当前温度值,是否比状态中保存的温度值大
//第一步先判断温度是不是比最大的都大
if(temp > acc.highestTemp){
//如果比最高温度还高,那排在第一,原来的第一高移动到第二高
acc.secondHighestTemp = acc.highestTemp
acc.highestTemp = temp
}
else if(temp > acc.secondHighestTemp){
//这种是比最高的小,比第二高的大,那就直接把第二高换成当前温度值
acc.secondHighestTemp = temp
}
}
//再实现一个输出结果的方法,最终处理完表中所有数据时调用
def emitValue(acc: Top2TempAcc,out: Collector[(Double, Int)]): Unit ={
out.collect((acc.highestTemp,1))
out.collect((acc.secondHighestTemp,2))
}
}
sensor.txt内容:
sensor1,1603766281,1
sensor2,1603766282,42
sensor3,1603766283,43
sensor4,1603766240,40.1
sensor4,1603766284,20
sensor4,1603766249,40.2
Code structure and running effect diagram: