Flink's Table API and SQL

Table API is a general relational API for stream processing and batch processing. Table API can be run based on stream input or batch input without any modification. Table API is a superset of SQL language and designed specifically for Apache Flink. Table API is an integrated API of Scala and Java languages. Different from specifying the query as a string in the regular SQL language, the Table API query is defined in the language embedded style in Java or Scala, with IDE support such as: 自动完成and 语法检测.

1. Pom dependencies that need to be introduced

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-table-planner_2.12</artifactId>
	<version>1.10.1</version>
</dependency>
<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-table-api-scala-bridge_2.12</artifactId>
	<version>1.10.1</version>
</dependency>

Two, a simple understanding of TableAPI

def main(args: Array[String]): Unit = {
    
    
  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(1)
  val inputStream = env.readTextFile("..\\sensor.txt")
  val dataStream = inputStream
    .map( data => {
    
    
      val dataArray = data.split(",")
      SensorReading(dataArray(0).trim, dataArray(1).trim.toLong,
        dataArray(2).trim.toDouble)
    }
    )
// 基于 env 创建 tableEnv
val settings: EnvironmentSettings =
  EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build()
  val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,
    settings)
  // 从一条流创建一张表
  val dataTable: Table = tableEnv.fromDataStream(dataStream)
  // 从表里选取特定的数据
  val selectedTable: Table = dataTable.select('id, 'temperature)
    .filter("id = 'sensor_1'")
  val selectedStream: DataStream[(String, Double)] = selectedTable
    .toAppendStream[(String, Double)]
  selectedStream.print()
  env.execute("table test")
}

2.1 Dynamic table

If the data type in the stream is a case class, you can directly generate a table based on the structure of the case class

tableEnv.fromDataStream(dataStream)

Or individually named according to the order of the fields

tableEnv.fromDataStream(dataStream,’id,’timestamp .......)

The final dynamic table can be converted to a stream for output

table.toAppendStream[(String,String)]

2.2 Field

Use a single quote in front of the field to identify the field name, such as'name,'id ,'amount, etc.

3. Window aggregation operation of Table API

3.1 Learn about TableAPI through an example

// 统计每 10 秒中每个传感器温度值的个数
def main(args: Array[String]): Unit = {
    
    
  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(1)
  env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
  val inputStream = env.readTextFile("..\\sensor.txt")
  val dataStream = inputStream
    .map( data => {
    
    
      val dataArray = data.split(",")
      SensorReading(dataArray(0).trim, dataArray(1).trim.toLong,
        dataArray(2).trim.toDouble)
    }
    )
    .assignTimestampsAndWatermarks(new
        BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(1)) {
    
    
      override def extractTimestamp(element: SensorReading): Long =
        element.timestamp * 1000L
    })
  // 基于 env 创建 tableEnv
  val settings: EnvironmentSettings =
    EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build()
  val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,
    settings)
  // 从一条流创建一张表,按照字段去定义,并指定事件时间的时间字段
  val dataTable: Table = tableEnv.fromDataStream(dataStream, 'id,
    'temperature, 'ts.rowtime)
  // 按照时间开窗聚合统计
  val resultTable: Table = dataTable
    .window( Tumble over 10.seconds on 'ts as 'tw )
    .groupBy('id, 'tw)
    .select('id, 'id.count)
  val selectedStream: DataStream[(Boolean, (String, Long))] = resultTable
    .toRetractStream[(String, Long)]
  selectedStream.print()
  env.execute("table window test")
}

3.2 About group by

(1) If you use groupby, you can only use toRetractDstream when converting table to stream

val dataStream: DataStream[(Boolean, (String, Long))] = table
.toRetractStream[(String,Long)]

(2) The first boolean field identifier obtained by toRetractDstream is true is the latest data (Insert), and false is the expired old data (Delete)

val dataStream: DataStream[(Boolean, (String, Long))] = table
.toRetractStream[(String,Long)]
dataStream.filter(_._1).print()

(3) If the api used includes a time window, the field of the window must appear in groupBy.

val resultTable: Table = dataTable
.window( Tumble over 10.seconds on 'ts as 'tw )
.groupBy('id, 'tw)
.select('id, 'id.count)

3.3 About time window

(1) When the time window is used, the time field must be declared in advance. If it is processTime, it can be added directly when creating the dynamic table.

val dataTable: Table = tableEnv.fromDataStream(dataStream, 'id,
'temperature, 'ps.proctime)

(2) If it is EventTime, declare when creating a dynamic table

val dataTable: Table = tableEnv.fromDataStream(dataStream, 'id,
'temperature, 'ts.rowtime)

(3) The rolling window can be represented by Tumble over 10000.millis on

val resultTable: Table = dataTable
.window( Tumble over 10.seconds on 'ts as 'tw )
.groupBy('id, 'tw)
.select('id, 'id.count)

Fourth, how to write SQL

// 统计每 10 秒中每个传感器温度值的个数
def main(args: Array[String]): Unit = {
    
    
  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(1)
  env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
  val inputStream = env.readTextFile("..\\sensor.txt")
  val dataStream = inputStream
    .map( data => {
    
    
      val dataArray = data.split(",")
      SensorReading(dataArray(0).trim, dataArray(1).trim.toLong,
        dataArray(2).trim.toDouble)
    }
    )
    .assignTimestampsAndWatermarks(new
        BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(1)) {
    
    
      override def extractTimestamp(element: SensorReading): Long =
        element.timestamp * 1000L
    })
  // 基于 env 创建 tableEnv
  val settings: EnvironmentSettings =
    EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build(
    )
  val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,
    settings)
  // 从一条流创建一张表,按照字段去定义,并指定事件时间的时间字段
  val dataTable: Table = tableEnv.fromDataStream(dataStream, 'id,
    'temperature, 'ts.rowtime)
  // 直接写 sql 完成开窗统计
  val resultSqlTable: Table = tableEnv.sqlQuery("select id, count(id) from "
    + dataTable + " group by id, tumble(ts, interval '15' second)")
  val selectedStream: DataStream[(Boolean, (String, Long))] =
    resultSqlTable.toRetractStream[(String, Long)]
  selectedStream.print()
  env.execute("table window test")
}

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108689967