四万字!掌握Flink Table一篇就够了

学习工具与软件版本:开发软件IDEA、Flink1.10.2、Kafka2.0.0、Scala2.11

本章建议有一定Flink基础的伙伴学习

  • Apache Flink介绍、架构、原理以及实现:点击这里

文章目录

一 创建Flink Table执行环境需要的依赖

	<!- 根据自己使用的版本修改对应的版本号 ->
	<dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-scala_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-scala_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-java_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-statebackend-rocksdb_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>
    <dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-table-planner_2.11</artifactId>
		<version>1.10.2</version>
	</dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-table-planner-blink_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-table-api-scala-bridge_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-csv</artifactId>
      <version>1.10.2</version>
    </dependency>

二 创建Flink Table执行环境的几种方式

	//创建Flink流处理环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
	/**
	*	以下的几种创建方式对应的结果基本相同,但高版本建议只需要blink planner即可
	*/
	
	//1.0 在刚开始的学习阶段直接create env即可
	val tableEnv = StreamTableEnvironment.create(env)
	
	//1.1基于老版本pLanner的流处理
	val settings = EnvironmentSettings.newInstance()
		.use0ldPlanner()
		.inStreamingMode()
		.build()
	val oldStreamTableEnv = StreamTableEnvironment.create(env,settings)
	
	//1.2基于老版本pLanner的批处理
	val batchEnv = ExecutionEnvironment.getExecutionEnvironment
	val oldBatchTableEnv = BatchTableEnvironment.create(batchEnv)
	
	// 1.3基于blink planner的流处理
	val blinkStreamSettings = EnvironmentSettings.newInstance()
		.useBlinkPlanner()
		.inStreamingMode()
		.build()
	val blinkStreamTableEnv = StreamTableEnvironment.create(env,blinkStreamSettings)
	
	// 1.4基于blink planner的批处理
	val blinkBatchSettings = EnvironmentSettings.newInstance()
		.useBlinkPlanner()
		.inBatchMode()
		.build()
	val blinkBatchTableEnv = TableEnvironment.create(blinkBatchSettings)

三 表(Table)的概念

  • TableEnvironment可以注册目录Catalog,并可以基于Catalog注册表
  • 表(Table) 是由一个”标识符” (identifier) 来指定的,由3部分组成:Catalog名、数据库(database) 名和对象名
  • 表可以是常规的, 也可以是虚拟的(视图,View)
  • 常规表(Table)一般可以用来描述外部数据,比如文件、数据库表或消息队列的数据,也可以直接从DataStream转换而来
  • 视图(View) 可以从现有的表中创建,通常是table API或者SQL查询的个结果集

创建表的标准格式

  • TableEnvironment 可以调用.connect()方法,连接外部系统,并调用.create TemporaryTable()方法,在Catalog中注册表
	tableEnv
		.connect(...) //定义表的数据来源,和外部系统建立连接
		.withFormat(...) //定义数据格式化方法
		.withSchema(...) //定义表结构
		.createTemporaryTable("MyTable") //创建临时表

四 更新模式

  • 对于流式查询, 需要声明如何在表和外部连接器之间执行转换

  • 与外部系统交换的消息类型, 由更新模式(Update Mode)指定

Table保存本地只支持Append模式,如果需要撤回、更新模式,就需要连接支持这两种模式的外部系统

1.追加(Append)模式

  • 表只做插入操作,和外部连接器只交换插入(insert) 消息

2.撤回(Retract) 模式

  • 表和外部连接器交换添加 (Add)和撤回(Retract) 消息

  • 插入操作(Insert)编码为Add消息;删除(Delete)编码为Retract消息;更新(Update)编码为上一条的Retract和下一条的Add消息

3.更新插入(Upsert) 模式

  • 更新和插入都被编码为 Upsert消息;删除编码为Delete消息

五 Table转换成DataStream

  • 表可以转换为DataStream或DataSet,这样自定义流处理或批处理程序就可以继续在Table API或SQL查询的结果上运行了
  • 将表转换为DataStream或DataSet时,需要指定生成的数据类型,即要将表的每一-行转换成的数据类型
  • 表作为流式查询的结果,是动态更新的
  • 转换有两种转换模式:追加(Appende) 模式和撤回(Retract) 模式

示例:追加模式(Append Mode)

  • 用于表只会被插入(Insert) 操作更改的场景
val resultStream: DataStream[ Row] = tableEnv. toAppendStream[Row] (resultTable)

示例:撤回模式(Retract Mode)

  • 用于任何场景。有些类似于更新模式中Retract模式,它只有Insert和Delete两类操作。
  • 得到的数据会增加一-个Boolean类型的标识位(返回的第一个字段) , 用它来表示到底是新增的数据(Insert) ,还是被删除的数据(Delete)
val aggResultStream: DataStream[ (Boolean, (String, Long))] = tableEnv.toRetractStream[(StringLong)](aggResultTable)

六 查看执行计划

  • Table API提供了- -种机制来解释计算表的逻 辑和优化查询计划
  • 查看执行计划,可以通过TableEnvironment.explain(table)方法或TableEnvironment.explain0方法完成,返回一个字符串,描述三个计划
    ➢优化的逻辑查询计划
    ➢优化后的逻辑查询计划
    ➢实际执行计划。
//生成执行计划
val explaination:String = tableEnv.explain(resultTable)
///输出执行计划
println(explaination)

七 流处理与关系代数的区别

在这里插入图片描述

八 动态表(Dynamic Tables)

  • 动态表是Flink对流数据的Table API和SQL支持的核心概念
  • 与表示批处理数据的静态表不同,动态表是随时间变化的
    ➢持续查询(Continuous Query)
  • 动态表可以像静态的批处理表一样进行查询,查询-一个动态表会产生持续查询(Continuous Query)
  • 连续查询永远不会终止,并会生成另-个动态表
  • 查询会不断更新其动态结果表,以反映其动态输入表上的更改

1.动态表与持续查询

在这里插入图片描述

  • 流式表查询的处理过程:
    1.流被转换为动态表
    2.对动态表计算连续查询,生成新的动态表
    3.生成的动态表被转换回流

2.将流转换成动态表

在这里插入图片描述

  • 为了处理带有关系查询的流,必须先将其转换为表
  • 从概念上讲, 流的每个数据记录,都被解释为对结果表的插入(Insert)修改操作

3.持续查询

在这里插入图片描述

  • 持续查询会在动态表上做计算处理,并作为结果生成新的动态表

4.将动态表转换成DataStream

在这里插入图片描述

  • 与常规的数据库表- 样,动态表可以通过插入(Insert) 、更新(Update)和删除(Delete)更改,进行持续的修改
  • 将动态表转换为流或将其写入外部系统时, 需要对这些更改进行编码

a.仅追加(Append-only) 流

  • 仅通过插入(Insert) 更改来修改的动态表,可以直接转换为仅追加流

b.撤回(Retract) 流

  • 撤回流是包含两类消息的流: 添加(Add) 消息和撒回(Retract) 消息

c.Upsert (更新插入)流

  • Upsert流也包含两种类型的消息: Upsert消息和删除(Delete) 消息。

九 时间特性(Time Attributes )

  • 基于时间的操作(比如Table API和SQL中窗口操作),需要定义相关的时间语义和时间数据来源的信息
  • Table 可以提供一个逻辑上的时间字段,用于在表处理程序中,指示时间和访问相应的时间戳
  • 时间属性,可以是每个表schema的一部分。一旦定义了时间属性,它就可以作为一个字段引用,并且可以在基于时间的操作中使用
  • 时间属性的行为类似于常规时间戳,可以访问,并且进行计算

定义处理时间(Processing Time)

  • 处理时间语义下,允许表处理程序根据机器的本地时间生成结果。它是时间的最简单概念。它既不需要提取时间戳,也不需要生成watermark
  • 在定义Schema期间,可以使用proctime,指定字段名定义处理时间字段
  • 这个proctime属性只能通过附加逻辑字段,来扩展物理schema。因此,只能在schema定义的末尾定义它
val sensorTable = tableEnv.fromDataStream( dataStream,'id, 'temperature, 'timestamp, 'pt.proctime )

定义事件时间(Event Time)

  • 事件时间语义,允许表处理程序根据每个记录中包含的时间生成结果。这样即使在有乱序事件或者延迟事件时,也可以获得正确的结果。
  • 为了处理无序事件,并区分流中的准时和迟到事件; Flink 需要从事件数据中,提取时间戳,并用来推进事件时间的进展
  • 定义事件时间,同样有三种方法,下面用示例来实现一下

示例:由DataStream转换成表时指定

  • 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    DataTypes, EnvironmentSettings, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{
    
    Csv, FileSystem, Rowtime, Schema}
import org.apache.flink.table.types.DataType
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object Def_Time {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //创建流执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //配置使用事件时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val settings: EnvironmentSettings = EnvironmentSettings.newInstance()
      .useBlinkPlanner()
      .inStreamingMode()
      .build()
    //创建表执行环境
    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

    //********************************由DataStream转换成表时指定时间戳******************************************
    //读取本地文件数据
    val datas: DataStream[String] = env.readTextFile("in/StringToClass.txt")
    //转换成自定义类型,并设置watermark延迟1秒
    val dataStream: DataStream[WaterSensor] = datas.filter(data=>{
    
    data.split(",")==3}).map(data => {
    
    
      val strings: Array[String] = data.split(",")
      WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
    }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
      override def extractTimestamp(t: WaterSensor): Long = t.ts*1000L
    })
    
    //将流转换成表,使用rowtime指定时间字段
    val table_DataStream: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime ,'vc)

    //输出表结构
    table_DataStream.printSchema()

	//输出结果
	table_DataStream.toAppendStream[Row].print()

	//执行
	env.execute()
  }
}

示例:定义Table Schema时指定

  • 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    DataTypes, EnvironmentSettings, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{
    
    Csv, FileSystem, Rowtime, Schema}
import org.apache.flink.table.types.DataType
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object Def_Time {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //创建流执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //配置使用事件时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val settings: EnvironmentSettings = EnvironmentSettings.newInstance()
      .useBlinkPlanner()
      .inStreamingMode()
      .build()
    //创建表执行环境
    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
    
    //************************************定义Table Schema时指定**********************************************
    tableEnv.connect(new FileSystem().path("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt"))
        .withFormat(new Csv)
        .withSchema(new Schema()
        .field("id",DataTypes.STRING())
        .field("ts",DataTypes.BIGINT())
        .rowtime(
          new Rowtime()
          	//指定时间字段
            .timestampsFromField("ts")
            //1000毫秒
            .watermarksPeriodicBounded(1000)
        )
          .field("vc",DataTypes.DOUBLE())
        ).createTemporaryTable("timewindow_to_schema")

    val table_Schema: Table = tableEnv.from("timewindow_to_schema")
    
    //输出表结构
	table_Schema.printSchema()
	
	//输出结果
	table_Schema.toAppendStream[Row].print()

	//执行
	env.execute()	
  }
}

示例:在创建表的 DDL中定义

  • 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    DataTypes, EnvironmentSettings, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{
    
    Csv, FileSystem, Rowtime, Schema}
import org.apache.flink.table.types.DataType
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object Def_Time {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //创建流执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //配置使用事件时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val settings: EnvironmentSettings = EnvironmentSettings.newInstance()
      .useBlinkPlanner()
      .inStreamingMode()
      .build()
    //创建表执行环境
    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
    
    //************************************在创建表的DDL中定义*******************************************************
    val tableDDL:String=
      """
        |create table table_DDL(
        | id varchar(20) not null,
        | ts bigint,
        | vc double,
        | rt AS TO_TIMESTAMP(FROM_UNIXTIME(ts)),
        | watermark for rt as rt - interval '1' second
        | ) with (
        | 'connector.type' = 'filesystem',
        | 'connector.path' = 'D:\ideaProject\FlinkDemo\in\StringToClass.txt',
        | 'format.type' = 'csv'
        | )
      """.stripMargin
	
	//注册表
    tableEnv.sqlUpdate(tableDDL)
    
    val table_DDL: Table = tableEnv.from("table_DDL")
    
    //查看表结构
    table_DDL.printSchema()

	//输出结果
	table_DDL.toAppendStream[Row].print()	

	//执行
    env.execute()
  }
}

十、窗口

  • 时间语义,要配合窗口操作才能发挥作用
  • 在Table API和SQL中,主要有两种窗口

1. Group Windows (分组窗口)

  • 根据时间或行计数间隔,将行聚合到有限的组(Group) 中,并对每个组的数据执行一次聚合函数

  • Group Windows是使用window (w:GroupWindow) 子句定义的,并且必须由as子句指定一个别名。

  • 为了按窗口对表进行分组,窗口的别名必须在group by子句中,像常规的分组字段一样引用

val table = input
.window([w:GroupWindow] as 'w) //定义窗口别名为w
.groupBy('w, 'a) //按照字段a和窗口w分组
.select('a,'b.sum) //聚合
  • Table API提供了一组具有特定语义的预定义Window类, 这些类会被转换为底层DataStream或DataSet的窗口操作

滚动窗口(Tumbling windows)

滚动窗口要用Tumble类来定义

  • 基于事件时间的滚动窗口
// Tumbling Event-time Window :基于事件时间的滚动窗口
.window(Tumble over 10.minutes on 'rowtime as 'w)
  • 基于处理时间的滚动窗口
// Tumbling Process ing-time Window :基于处理时间的滚动窗口
.window(Tumble over 10.minutes on 'proctime as ' w)
  • 基于计数的滚动窗口
// Tumbling Row- count Window:基于计数的滚动窗口
.window(Tumb1e over 10.rows on 'proctime as 'w)

滑动窗口(Sliding windows)

滑动窗口要用Slide类来定义

  • 基于事件时间的滚动窗口
// Sliding Event-time Window 基于事件时间的滑动窗口
.window(Slide over 10.minutes every 5.minutes on 'rowtime as 'w)
  • 基于处理时间的滚动窗口
// Sliding Processing-time window	基于处理时间的滑动窗口
.window(Slide over 10.minutes every 5.minutes on 'proctime as 'w)
  • 基于计数的滚动窗口
// Sliding Row- count window:基于计数的滑动窗口
. window(Slide over 10.rows every 5.rows on 'proctime as 'w)

会话窗口(Session windows)

会话窗口要用Session类来定义

  • 基于事件时间的滚动窗口
// Session Event-time Window  基于事件时间的会话窗口
.window(Session withGap 10.minutes on 'rowtime as 'w)
  • 基于处理时间的滚动窗口
// Session Processing-time Window  基于处理时间的会话窗口
.window(Session withGap 10.minutes on 'proctime as 'w)

示例:滚动窗口

  • 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    EnvironmentSettings, Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object TimeAndWindow {
    
    
  def main(args: Array[String]): Unit = {
    
    
      val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

      env.setParallelism(1)

      env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

      val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()

      val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

      val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")

      val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
    
    
        x.split(",").length==3
      }).map(data => {
    
    
        val strings: Array[String] = data.split(",")
        WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
      }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
        override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
      })

      val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)

       //*********************************Group winodow****************************************************
      val table_gw: Table = table
        .window(Tumble over 10.seconds on 'ts as 'tw) //设置窗口
        .groupBy('id, 'tw)
        .select('id, 'id.count,'tw.end)

      table_gw.toAppendStream[Row].print()
	
	  //启动
      env.execute()
  }
}

2. Group Windows(SQL)

  • Group Windows定义在SQL查询的Group By子句中
    ➢TUMBLE(time_ attr, interval)
  • 定义一个滚动窗口,第一个参数是时间字段,第二个参数是窗口长度
    ➢HOP(time_ attr, interval, interval)
  • 定义一个滑动窗口,第一个参数是时间字段,第二个参数是窗口滑动步长,第三个是窗口长度
    ➢SESSION(time_ attr, interval)
  • 定义一个会话窗口,第一个参数是时间字段,第二个参数是窗口间隔

示例:滚动窗口(SQL)

  • 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    EnvironmentSettings, Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object TimeAndWindow {
    
    
  def main(args: Array[String]): Unit = {
    
    
      val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

      env.setParallelism(1)

      env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

      val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()

      val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

      val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")

      val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
    
    
        x.split(",").length==3
      }).map(data => {
    
    
        val strings: Array[String] = data.split(",")
        WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
      }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
        override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
      })

      val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)

      //*********************************Group winodow(SQL)****************************************************
      tableEnv.createTemporaryView("table_view",table) //注册视图
      val table_sql_gw: Table = tableEnv.sqlQuery(
        """
          select id,count(id),sum(vc), tumble_end(ts,interval '10' second) from table_view
          group by id,tumble(ts,interval '10' second)
        """.stripMargin)

		//输出执行结果
      table_sql_gw.toAppendStream[Row].print()
	
	  //启动
      env.execute()
  }
}

3. Over Windows

  • Over window聚合是标准SQL中已有的(over 子句),可以在查询的SELECT子句中定义
  • Over window聚合,会针对每个输入行,计算相邻行范围内的聚合
  • Over windows使用window (w:overwindows*) 子句定义,并在select ()方法中通过别名来引用
val tabLe = input
	.window([w: OverWindow] as 'w)
	.select('a,'b.sum over 'W,'c.min over 'w)
  • Table API提供了Over类,来配置Over窗口的属性

无界Over Windows

可以在事件时间或处理时间,以及指定为时间间隔、或行计数的范围内,定义Over windows
无界的over window是使用常量指定的

  • 基于事件时间的over window
//无界的事件时间over window
.window(Over partitionBy 'a orderBy 'rowtime preceding UNBOUNDED_ RANGE as 'w)
  • 基于处理时间的over window
//无界的处理时间over window 
.window(Over partitionBy 'a orderBy 'proctime preceding UNBOUNDED_ RANGE as 'w)
  • 基于事件时间的计数over window
//无界的事件时间Row-count over window
.window(Over partitionBy 'a orderBy 'rowtime preceding UNBOUNDED_ROW as w)
  • 基于处理时间的计数over window
//.无界的处理时间Row-count over window
.window(Over partitionBy 'a orderBy 'proctime preceding UNBOUNDED_ROW as 'w)

有界Over Windows

有界的over window是用间隔的大小指定的

  • 基于事件时间的有界over window
//有界的事件时间的有界over window
.window(Over partitionBy 'a orderBy 'rowtime preceding 1.minutes as 'w)
  • 基于处理时间的over window
//有界的处理时间over window
.window(Over partitionBy 'a orderBy 'proctime preceding 1.minutes as 'w)
  • 基于事件时间的有界计数over window
//有界的事件时间Row-count over window
.window(Over partitionBy 'a orderBy 'rowtime preceding 10.rows as 'w)
  • 基于处理时间的有界计数over window
//有界的处理时间Row-count over window
.window(Over partitionBy 'a orderBy 'proctime preceding 10.rows as 'w)

示例:无界Over Windows

  • 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    EnvironmentSettings, Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object TimeAndWindow {
    
    
  def main(args: Array[String]): Unit = {
    
    
      val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

      env.setParallelism(1)

      env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

      val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()

      val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

      val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")

      val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
    
    
        x.split(",").length==3
      }).map(data => {
    
    
        val strings: Array[String] = data.split(",")
        WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
      }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
        override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
      })

      val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)

       //*********************************Over winodow****************************************************
      val table_ow: Table = table.window(Over partitionBy 'id orderBy 'ts preceding 2.rows as 'ow)
        .select('id, 'id.count over 'ow, 'vc.sum over 'ow)//做聚合函数后指定窗口

      //输出
      table_ow.toAppendStream[Row].print()
	
	  //执行
      env.execute()
  }
}

4. Over Windows(SQL)

  • 用Over做窗口聚合时,所有聚合必须在同一窗口.上定义,也就是说必须是相同的分区、排序和范围
  • 目前仅支持在当前行范围之前的窗口
  • ORDER BY必须在单一的时间属性上指定
SELECT COUNT(amount) OVER (
PARTITION BY user
ORDER BY proctime
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM Orders

示例:Over Windows(SQL)

  • 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    EnvironmentSettings, Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object TimeAndWindow {
    
    
  def main(args: Array[String]): Unit = {
    
    
      val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

      env.setParallelism(1)

      env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

      val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()

      val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

      val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")

      val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
    
    
        x.split(",").length==3
      }).map(data => {
    
    
        val strings: Array[String] = data.split(",")
        WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
      }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
        override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
      })

      val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)

      //*********************************Over winodow(SQL)****************************************************
      val table_sql_ow=tableEnv.sqlQuery(
        """
          |select id,count(id) over ow ,sum(vc) over ow from table_view
          |window ow as(
          |partition by id
          |order by ts
          |rows between 2 preceding and current row
          |)
        """.stripMargin)

      table_sql_ow.toAppendStream[Row].print()
	
	  //执行
      env.execute()
  }
}

十一 函数(Functions)

  • Flink Table API和SQL为用户提供了组用于数据转换的内置函数
  • SQL中支持的很多函数,Table API和SQL都已经做了实现

比较函数

SQL:
value1 = value2
value1 > value2

Table API:
ANY1=== ANY2
ANY1 > ANY2

逻辑函数

SQL:
boolean1 OR boolean2
boolean IS FALSE
NOT boolean

TABLE API:
BOOLEAN1||BOOLEAN2
BOOL EAN.isFalse
!BOOLEAN

算数函数

SQL:
numeric1 + numeric2
POWER(numeric1, numeric2)

Table API:
NUMERIC1 + NUMERIC2
NUMERIC1.power(NUMERIC2)

字符串函数

SQL:
string1 || string2
UPPER(string)
CHAR_LENGTH(string)

Table API:
STRING1 + STRING2
STRING.upperCase()
STRING.charLength()

时间函数

SQL:
DATE string
TIMESTAMP string
CURRENT TIME
INTERVAL string range

Table API:
STRING.toDate
STRING.toTimestamp
currentTime()
NUMERIC.days
NUMERIC.minutes

聚合函数

SQL:
COUNT(*)
SUM(expression)
RANK()
ROW_NUMBER()

Table API:
FIELD.count
FIELD.sum

十二 自定义函数(UDF)

  • 用户定义函数(User-defined Functions, UDF)是一个重要的特性,它们显著地扩展了查询的表达能力
  • 在大多数情况下,用户定义的函数必须先注册,然后才能在查询中使用
  • 函数通过调用 registerFunction ()方法在TableEnvironment中注册。当用户定义的函数被注册时,它被插入到TableEnvironment的函数目录中,这样Table API或SQL解析器就可以识别并正确地解释它

标量函数(Scalar Functions)

  • 用户定义的标量函数,可以将0、1或多个标量值,映射到新的标量值
  • 为了定义标量函数,必须在org.apache.flink table.functions中扩展基类Scalar Function,并实现(一个或多个)求值(eval) 方法
  • 标量函数的行为由求值方法决定,求值方法必须公开声明并命名为eval
  • 实现的是一进一出,类似于map
class HashCode( factor: Int ) extends ScalarFunction {
    
    
	def eva1( s: String ): Int = {
    
    
		s.hashCode * factor
}}

示例:Scalar Functions

测试数据

ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0

代码实现

import TableAPI.WaterSensor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.ScalarFunction
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object Test_utf {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setParallelism(1)

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()

    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

    val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")

    val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
    
    
      x.split(",").length==3
    }).map(data => {
    
    
      val strings: Array[String] = data.split(",")
      WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
    }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
      override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
    })

    val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)

    //实例化UDF对象
    val hashCode = new HashCode(10)

    //****************************Table API:使用自定义ScalarFunction***************************
    val table_tf: Table = table.select('id,hashCode('id))

    table_tf.toAppendStream[Row].print("table_tf")


    //********************************SQL:使用自定义ScalarFunction******************************
    //注册视图
    tableEnv.createTemporaryView("table_view",table)
    //注册UDF
    tableEnv.registerFunction("hashCode",hashCode)
    //使用UDF
    val table_sql: Table = tableEnv.sqlQuery(
      """
        select id,hashCode(id) from table_view
      """.stripMargin)

    table_sql.toAppendStream[Row].print("table_sql")

    env.execute()

  }

}

//自定义UDF,继承ScalarFunction
//ScalarFunction:实现的是一进一出,类似于map
class HashCode(factor:Int) extends ScalarFunction{
    
    
  //必须实现eval方法,而且是public
    def eval(s:String):Int={
    
    
      //具体实现需求
      s.hashCode * factor - 10000
    }
}

表函数(Table Functions)

  • 用户定义的表函数,也可以将0、1或多个标量值作为输入参数;与标量函数不同的是,它可以返回任意数量的行作为输出,而不是单个值
  • 为了定义一个表函数,必须扩展org.apache.flink.table.functions中的基类TableFunction并实现(一个或多个) 求值方法
  • 表函数的行为由其求值方法决定,求值方法必须是public的,并命名为eval
class Split(separator: String) extends TableFunction[(String, Int)]{
    
    
	def eval(str: String): Unit = {
    
    
		str.split(separator).foreach(word => collect((word, word.length)))
}}

示例:Table Functions

测试数据

ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0

代码实现

import TableAPI.WaterSensor
import org.apache.commons.math3.geometry.spherical.oned.ArcsSet.Split
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableFunction
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object UDF_table {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setParallelism(1)

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()

    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

    val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")

    val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
    
    
      x.split(",").length==3
    }).map(data => {
    
    
      val strings: Array[String] = data.split(",")
      WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
    }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
      override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
    })

    val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)

    //**********************************Table API:使用自定义TableFunction**********************************
    //创建TableFunction实例
    val spilt = new Spilt("_")
    val table_tf: Table = table
      //使用侧输出流,关联经过UDF处理返回的数据
      .joinLateral(spilt('id) as ('word,'length))
      //在select中可以直接调用测输出流设置的字段
      .select('id,'ts,'vc,'word,'length)

    //转换成流输出
    table_tf.toAppendStream[Row].print("table_tf")

    //**********************************SQL:使用自定义TableFunction**********************************
    //注册视图
    tableEnv.createTemporaryView("table_view",table)
    //注册UDF
    tableEnv.registerFunction("split",spilt)
    //使用UDF
    //使用lateral table()来指定使用UDF后返回的数据信息,并未返回的信息,设置字段信息
    val table_sql: Table = tableEnv.sqlQuery(
      """
        select id, ts, vc, word, length from table_view,
        lateral table( split(id) ) as splitId(word,length)
      """.stripMargin)

    //转换成流输出
    table_sql.toAppendStream[Row].print("table_sql")

    env.execute()


  }

}

//自定义表函数,继承TableFunction,并设置最后返回的数据类型
//TableFunction:实现的是一进多出,类似于flatMap
class Spilt(separator:String) extends TableFunction[(String,Int)]{
    
    
  //实现eavl方法,并指定,输入的数据类型
  def eval(s:String): Unit ={
    
    
    //具体实现操作
    //示例:将传入的string类型进行拆分,然后foreach遍历
    s.split(separator).foreach(word=>{
    
    
     //通过collect发送出去传入的字符串以及字符串长度
      collect((word,word.length))
    })
  }
}

聚合函数(Aggregate Functions)

  • 用户自定义聚合函数(User-Defined Aggregate Functions, UDAGGs)可以把一个表中的数据,聚合成-个标量值

  • 用户定义的聚合函数,是通过继承AggregateFunction抽象类实现的

  • AggregationFunction要求必须实现的方法:
    createAccumulator()
    accumulate()
    getValue()

  • AggregateFunction的工作原理如下:
    a. 首先,它需要一个累加器(Accumulator) ,用来保存聚合中间结果的数据结构;可以通过调用createAccumulator()方法创建空累加器
    b. 随后,对每个输入行调用函数的accumulate()方法来更新累加器
    c. 处理完所有行后,将调用函数的getValue()方法来计算并返回最终结果

示例:Aggregate Functions
测试数据

ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0

实现代码

import TableAPI.WaterSensor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.AggregateFunction
import org.apache.flink.types.Row

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object UDF_AggregateFunctions {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setParallelism(1)

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()

    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

    val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")

    val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
    
    
      x.split(",").length==3
    }).map(data => {
    
    
      val strings: Array[String] = data.split(",")
      WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
    }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
      override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
    })

    val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)

    val avgTemp = new AvgTemp()

    //****************************Table API:使用自定义AggregateFunction***************************
    val table_avg: Table = table
      .groupBy('id)
      //使用自定义聚合函数,并对返回的数据重命名
      .aggregate(avgTemp('vc) as 'avgTemp)
      .select('id, 'avgTemp)

    table_avg.toRetractStream[Row].print("table api->")

    //********************************SQL:使用自定义AggregateFunction******************************
    tableEnv.createTemporaryView("table_view",table)
    //注册UDF
    tableEnv.registerFunction("avgTemp",avgTemp)
    //使用UDF
    val table_sql: Table = tableEnv.sqlQuery(
      """
        select id, avgTemp(vc) from table_view group by id
      """.stripMargin)

    table_sql.toRetractStream[Row].print("table sql->")

    env.execute()


  }
}

//定义一个类型
class AvgTempAcc{
    
     var sum:Double=0.0; var count:Int=0 }


//自定义聚合函数,继承AggregateFunction,并设置最终返回类型
//AggregateFunction:实现的是多进一出,类似于聚合函数
class AvgTemp extends AggregateFunction[Double,AvgTempAcc] {
    
    

  //实现对传入的数据进行求平均数
  override def getValue(accumulator: AvgTempAcc): Double = accumulator.sum/accumulator.count

  //设置初始值
  override def createAccumulator(): AvgTempAcc = new AvgTempAcc

  //还需要实现一个具体的处理计算,accumulate,完成当传入一个数据应该执行的操作
  def accumulate(accumulator:AvgTempAcc, temp:Double):Unit={
    
    
    accumulator.sum+=temp
    accumulator.count+=1
  }
}

表聚合函数(Table Aggregate Functions)

  • 用户定义的表聚 合函数(User-Defined Table Aggregate Functions,UDTAGGs),可以把一个表中数据,聚合为具有多行和多列的结果表
  • 用户定义表聚合函数, 是通过继承TableAggregateFunction抽象类来实现的

示例:Table Aggregate Functions
测试数据

ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0

代码实现

import TableAPI.WaterSensor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{
    
    EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableAggregateFunction
import org.apache.flink.types.Row
import org.apache.flink.util.Collector

//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object UDF_TableAggregateFunctions {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setParallelism(1)

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()

    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)

    val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")

    val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
    
    
      x.split(",").length==3
    }).map(data => {
    
    
      val strings: Array[String] = data.split(",")
      WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
    }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
    
    
      override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
    })

    val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)

    //**********************************Table API:使用TableAggregateFunction**********************************
    //实例化一个表聚合函数的对象
    val top2Temp = new Top2Temp
    val table_api: Table = table
      .groupBy('id)
      //调用表聚合函数,指定传入的字段,并定义返回数据的字段名称
      .flatAggregate(top2Temp('vc) as('temp, 'rank))
      .select('id, 'temp, 'rank)

    table_api.toRetractStream[Row].print("table_api")

    env.execute()
	}
}

//定义一个类,来表示聚合函数的状态
class  Top2TempAcc{
    
    
  var highestTemp:Double=Double.MinValue
  var secondHighestTemp:Double=Double.MinValue
}

//自定义表聚合函数
class Top2Temp extends TableAggregateFunction[(Double,Int),Top2TempAcc]{
    
    

  //实例化一个状态类
  override def createAccumulator(): Top2TempAcc = new Top2TempAcc

  //实时计算聚合结果的函数accumulate
  def accumulate(acc:Top2TempAcc,temp: Double):Unit={
    
    
    //判断当前温度值,是否比状态中的值大
    if(temp>acc.highestTemp){
    
    
      //将原先的第一名退至第二名,第二名直接丢弃
      acc.secondHighestTemp=acc.highestTemp
      //有新传入的值当第一位
      acc.highestTemp=temp
    }else if (temp>acc.secondHighestTemp){
    
    //如果传入的值只比第二名高,则只替换第二名即可
      acc.secondHighestTemp=temp
    }
  }


  //实现一个输出结果的方法,最终处理完表中所有数据时调用,方法名只能是emitValue,不可变
  def emitValue(acc:Top2TempAcc,out:Collector[(Double,Int)]):Unit={
    
    
    out.collect((acc.highestTemp,1))
    out.collect((acc.secondHighestTemp,2))
  }
}

实践:Flink Table读取本地数据文件

老版本可以直接用OldCsv方法格式化数据,但高版本Flink,就需要单独引入格式化类型的依赖
示例:Csv格式
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-csv</artifactId>
<version>1.10.2</version>
</dependency>

测试数据

ws_001,1577844001,35.0
ws_002,1577844015,43.0
ws_003,1577844020,72.0
ws_001,1577844001,45.0
ws_002,1577844015,73.0

实现代码

import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors._
import org.apache.flink.table.factories.TableSourceFactory

object TableApiTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //创建流处理环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //创建表执行环境
    val tableEnv = StreamTableEnvironment.create(env)

    //读取本地CSV文件,并存入表中
    tableEnv.connect(new FileSystem().path("D:\\...\\InputTableCsv.txt"))
      .withFormat(new Csv) //指定读取数据文件格式
      .withSchema(new Schema()  //设置数据字段信息
      	  .field("id",DataTypes.STRING()) //第一列字段名:id 类型:String
          .field("ts",DataTypes.BIGINT()) //第二列字段名:ts 类型:Long
          .field("vc",DataTypes.DOUBLE()) //第三列字段名:vc 类型:Double
      ).createTemporaryTable("inputTable")  //创建一张表来保存读取得数据

    //根据表名生成Table类型得实例对象
    val table: Table = tableEnv.from("inputTable")
    //将Table实例对象输出(使用toAppendStream,需要引入import org.apache.flink.table.api.scala._)
    table.toAppendStream[(String,Long,Double)].print()
	
	//启动
    env.execute()
  }
}

运行结果

在这里插入图片描述

实践:Flink Table读取Kafka Topic中的数据

添加依赖

	<!- 根据自己使用的版本进行修改  ->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-kafka_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>

创建kafka Topic

kafka-topics.sh --create --zookeeper 192.168.**.**:2181 --topic tableTest --partitions 1 --replication-factor 1

创建生产者

kafka-console-producer.sh --topic tableTest --broker-list 192.168.**.**:9092

实现代码

import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors._
import org.apache.flink.table.factories.TableSourceFactory

object TableApiTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //创建流处理环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //创建表执行环境
    val tableEnv = StreamTableEnvironment.create(env)

    //2.2从Kafka中读取数据
    tableEnv.connect(new Kafka()
        .version("universal") //设置kafka版本:universal表示使用通用连接器,自动匹配最新版本,兼容0.10之后的版本
        .topic("tableTest") //指定消费的Topic
        .property("zookeeper.connect","192.168.95.99:2181") //指定zookeeper地址
        .property("bootstrap.servers","192.168.95.99:9092")	//指定kafka地址
    ).withFormat(new Csv())	//指定格式化标准
        .withSchema(new Schema()	//配置字段信息
            .field("id",DataTypes.STRING())
            .field("ts",DataTypes.BIGINT())
            .field("vc",DataTypes.DOUBLE())
        ).createTemporaryTable("kafkaInputTable")	//注册成表


    //根据表名生成Table类型得实例对象
    val table: Table = tableEnv.from("kafkaInputTable")
    //将Table实例对象输出(使用toAppendStream,需要加上import org.apache.flink.table.api.scala._)
    table.toAppendStream[(String,Long,Double)].print()

    env.execute()
  }
}

生产者输入数据

在这里插入图片描述

启动程序后接收到的数据

在这里插入图片描述

实践:DataStream转Table

测试数据

ws_001,1577844001,35.0
ws_002,1577844015,43.0
ws_003,1577844020,72.0
ws_001,1577844001,45.0
ws_002,1577844015,73.0

实现代码

import Source.WaterSensor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala._

//定义样例类
case class WaterSensor(id:String,ts:Long,vc:Double)
object Example {
    
    
  def main(args: Array[String]): Unit = {
    
    
 	 //创建流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

	//读取本地数据文件,生成DataStream
    val dataStream: DataStream[String] = env.readTextFile("D:\\...\\inputTableCsv.txt")
    
    //转换读取的数据为WaterSensor类型
    val dataStream2: DataStream[WaterSensor] = dataStream.filter(data => {
    
    
      val strings: Array[String] = data.split(",")
      strings.length == 3
    }).map(data => {
    
    
      val strings: Array[String] = data.split(",")
      WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
    })

    //首先创建表执行环境
    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)

    //方式一:基于流创建一张表类型得对象
    val dataTable: Table = tableEnv.fromDataStream(dataStream2)
    //调用Table API进行转换
    val resultTable: Table = dataTable.select("id,vc").filter("id='ws_003'")

    //方式二:之间用SQL实现分析数据
    //注册视图
    tableEnv.createTemporaryView("dataTable",resultTable)
    val sqlTable: Table = tableEnv.sqlQuery("select id,vc from dataTable where id='ws_003'")

    //将通过API得到的表转换成流输出
    resultTable.toAppendStream[(String,Double)].print("table api")
    
    //将通过sql得到的表转换成流输出
    sqlTable.toAppendStream[(String,Double)].print("sql")
    
    env.execute()
  }
}

控制台输出数据

在这里插入图片描述

实践:将Table 数据输出至本地

示例代码

import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{
    
    DataTypes, EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{
    
    Csv, FileSystem, Schema}

//定义样例类
case class WaterSensor(id:String,ts:Long,vc:Double)
object TableOutCsv {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //创建流执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //创建表执行环境
    val table: EnvironmentSettings = EnvironmentSettings.newInstance()
      .useBlinkPlanner()
      .inStreamingMode()
      .build()
    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,table)
    //读取本地数据文件,并转换成WaterSonser类型
    val dataStream: DataStream[WaterSensor] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
      .map(a=>{
    
    
        val strings: Array[String] = a.split(",")
        WaterSensor(strings(0),strings(1).toLong,strings(2).toDouble)
      })
    //根据流创建一张Table类型得得对象
    val dataTable: Table = tableEnv.fromDataStream(dataStream)
    //调用Table API查询
    val dataTable2: Table = dataTable.select("id,vc")

    //注册输出表
    val outputPath:String="D:\\ideaProject\\FlinkDemo\\in\\outputTable.txt"

    tableEnv.connect(new FileSystem().path(outputPath))
      .withFormat(new Csv())
      .withSchema(
      new Schema()
        .field("id",DataTypes.STRING())
        .field("vc",DataTypes.DOUBLE())
    ).createTemporaryTable("outputTable")

    //将查询的Table insert到输出表
    dataTable2.insertInto("outputTable")
    //启动执行
    env.execute()
  }
}

测试数据

ws_001,1577844001,24.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0

执行前

在这里插入图片描述

执行后

在这里插入图片描述

注意:CsvTableSink只实现了BatchTableSink批处理表SInk和AppendStreamTableSink追加表Sink,如果使用了聚合运算等操作就无法保存使用CsvTableSink保存至本地

实践:将Table数据输出至Kafka

代码示例

import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{
    
    DataTypes, EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{
    
    Csv, Kafka, Schema}

object TableOutKafka {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //创建流执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //创建表执行环境
    val tableStream: EnvironmentSettings = EnvironmentSettings.newInstance()
      .useBlinkPlanner()
      .inStreamingMode()
      .build()
    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env, tableStream)

    //创建kafka输入表
    tableEnv.connect(new Kafka()
      .version("universal")
      .topic("inTable") //指定Topic
      .property("zookeeper.connect","192.168.**.**:2181")
      .property("bootstrap.servers","192.168.**.**:9092")
    ).withFormat(new Csv())
      .withSchema(new Schema()
      .field("id",DataTypes.STRING())
      .field("ts",DataTypes.BIGINT())
      .field("vc",DataTypes.DOUBLE())
    ).createTemporaryTable("kafkaInputTable")

    //根据表名生成Table类型得实例对象
    val table: Table = tableEnv.from("kafkaInputTable")
    //查询
    val resultTable: Table = table.select('id,'vc).filter('id === "ws_003")

    //创建Kafka输出表
    tableEnv.connect(new Kafka()
      .version("universal")
      .topic("outTable")
      .property("zookeeper.connect","192.168.95.99:2181")
      .property("bootstrap.servers","192.168.95.99:9092")
    ).withFormat(new Csv())//符合rfc格式得csv
      .withSchema(new Schema()
      .field("id",DataTypes.STRING())
      .field("vc",DataTypes.DOUBLE())
    ).createTemporaryTable("kafkaOutputTable")

    //查询结果插入输出表
    resultTable.insertInto("kafkaOutputTable")
    
    env.execute()
  }
}

创建对应的kafka生产者、消费者

  • 生产者:kafka-console-producer.sh --topic inTable --broker-list 192.168.**.**:9092
  • 消费者:kafka-console-consumer.sh --topic outTable--bootstrap-server 192.168.**.**:9092

在这里插入图片描述

启动程序

  • 在生产者中输入数据,进行测试:ws_003,1577844020,32.0

在这里插入图片描述

这样就实现了kafka进kafka出的一个管道测试,但是注意,kafkaTableSink依旧只实现了Append追加模式

实践:将Table 数据输出至ES

添加依赖

	<!- 修改成自己使用的版本 ->
	<dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-elasticsearch-base_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>

测试数据

ws_001,1577844001,24.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0

代码示例

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{
    
    DataTypes, EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors._


object TableOutES {
    
    
  def main(args: Array[String]): Unit = {
    
    
  //创建流执行环境
  val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(1)
  //创建表执行环境
  val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
  //2.1读取CSV文件,并存入表中
  tableEnv.connect(new FileSystem().path("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt"))
    .withFormat(new Csv) //指定读取数据文件格式
    .withSchema(new Schema()  //设置数据字段信息
    .field("id",DataTypes.STRING()) //第一列字段名:id 类型:String
    .field("ts",DataTypes.BIGINT()) //第二列字段名:ts 类型:Long
    .field("vc",DataTypes.DOUBLE()) //第三列字段名:vc 类型:Double
  ).createTemporaryTable("inputTable")  //创建一张表来保存读取得数据

  //根据表名生成Table类型得实例对象
  val tableData: Table = tableEnv.from("inputTable")
  //调用Table API聚合查询
  val dataTable2: Table = tableData.groupBy('id).select('id,'id.count as 'count)

//配置ES连接器,并注册表
  tableEnv.connect(new Elasticsearch()
  .version("6")
  .host("192.168.**.**",9200,"http")
    .index("sensor")
    .documentType("temperature")
  ).inUpsertMode() //ES支持Upsert模式
    .withFormat(new Json())   //需要引入Json依赖包
    .withSchema(new Schema()
  .field("id",DataTypes.STRING())
  .field("count",DataTypes.BIGINT())
  ).createTemporaryTable("esOutputTable")

	//将聚合的数据insert到ES注册的表中
  dataTable2.insertInto("esOutputTable")

  env.execute("es out put test")
}
}

查看ES

  • curl "192.168.**.**:9200/_cat/indices?v"

在这里插入图片描述
查看数据

  • curl "192.168.**.**:9200/sensor/_search?pretty"

在这里插入图片描述

实践:将Table 数据输出至MySQL

添加依赖

	<!- 根据自己使用的版本进行修改  ->
	<dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-jdbc_2.11</artifactId>
      <version>1.10.2</version>
    </dependency>

测试数据

ws_001,1577844001,24.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0

实现代码

import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{
    
    Csv, FileSystem, Schema}

object TableOutMySQL {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //创建流执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //创建表执行环境
    val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
    //2.1读取CSV文件,并存入表中
    tableEnv.connect(new FileSystem().path("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt"))
      .withFormat(new Csv) //指定读取数据文件格式
      .withSchema(new Schema()  //设置数据字段信息
      .field("id",DataTypes.STRING()) //第一列字段名:id 类型:String
      .field("ts",DataTypes.BIGINT()) //第二列字段名:ts 类型:Long
      .field("vc",DataTypes.DOUBLE()) //第三列字段名:vc 类型:Double
    ).createTemporaryTable("inputTable")  //创建一张表来保存读取得数据

    //根据表名生成Table类型得实例对象
    val tableData: Table = tableEnv.from("inputTable")
    //调用Table API聚合查询
    val dataTable2: Table = tableData.groupBy('id).select('id,'id.count as 'count)

    //配置连接MySQL的DDL
    //OutputToMySQLTables是在catalog中注册的表
    //flink_to_mysql才是mysql中的目标表名
    val sinkDDL=
      """
        |create table OutputToMySQLTable(
        | id varchar(20) not null,
        | cnt bigint not null
        | ) with (
        | 'connector.type' = 'jdbc',
        | 'connector.url' = 'jdbc:mysql://192.168.95.99:3306/test',
        | 'connector.table' = 'flink_to_mysql',
        | 'connector.driver' = 'com.mysql.jdbc.Driver',
        | 'connector.username' = 'root',
        | 'connector.password' = 'root123'
        | )
      """.stripMargin
		
	//执行DDL创建表
    tableEnv.sqlUpdate(sinkDDL) 
	
	//将查询的数据insert到Mysql中
    dataTable2.insertInto("OutputToMySQLTable")

    env.execute()
  }


}

Mysql中创建表

  • 创建表: create table flink_to_mysql(id varchar(20),cnt bigint);
  • 查看表结构:desc flink_to_mysql;

在这里插入图片描述
启动程序

执行效果

  • 查看数据:select * from flink_to_mysql;
    在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weixin_38468167/article/details/112348368