Flink Table API and SQL programming arrangement

Flink APIIt is divided into 4layers, which are mainly Table APIused for organizing.

Table APIIt is a relational type common to stream processing and batch processing API, and Table APIcan be run based on stream input or batch input without any modification. Table APIIt is SQLa superset of the language and is specifically Apache Flinkdesigned to be integrated with the language . Instead of specifying queries as strings in regular languages, queries are defined in a language-embedded style in or with support for things like autocomplete and syntax detection. The dependencies that need to be introduced are as follows:Table APIScalaJavaAPISQLTable APIJavaScalaIDEpom

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table_2.12</artifactId>
    <version>1.7.2</version>
</dependency>

Table API & SQL

TableAPI: WordCount Case

tab.groupBy("word").select("word,count(1) as count")

SQL: WordCount Case

SELECT word,COUNT(*) AS cnt FROM MyTable GROUP BY word

[1] Declarative: users only care about what to do, not how to do it;
[2] High performance: supports query optimization and can obtain better execution performance, because it has an optimizer at the bottom, which SQLis the same as having an optimizer at the bottom. the same.
[3] Flow-batch unification: The same statistical logic can be run in flow model or batch mode;
[4] Stable standard: The semantics follow SQLthe standard and are not easy to change. When upgrading and other underlying modifications, there is no need to consider APIcompatibility issues;
[5] Easy to understand: clear semantics, what you see is what you get;

Table API features

Table APIMakes multi-statement data processing easier to write.

1 #例如,我们将a<10的数据过滤插入到xxx表中
2 table.filter(a<10).insertInto("xxx")
3 #我们将a>10的数据过滤插入到yyy表中
4 table.filter(a>10).insertInto("yyy")

TalbeIt is Flinkone of its own APIthat makes it easier to extend the standard SQL(if and only when needed). The relationship between the two is as follows:
Insert image description here

Table API Programming

WordCountProgramming examples

package org.apache.flink.table.api.example.stream;

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.descriptors.FileSystem;
import org.apache.flink.table.descriptors.OldCsv;
import org.apache.flink.table.descriptors.Schema;
import org.apache.flink.types.Row;

public class JavaStreamWordCount {
    
    

    public static void main(String[] args) throws Exception {
    
    
                //获取执行环境:CTRL + ALT + V
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
                //指定一个路径
        String path = JavaStreamWordCount.class.getClassLoader().getResource("words.txt").getPath();
                //指定文件格式和分隔符,对应的Schema(架构)这里只有一列,类型是String
        tEnv.connect(new FileSystem().path(path))
            .withFormat(new OldCsv().field("word", Types.STRING).lineDelimiter("\n"))
            .withSchema(new Schema().field("word", Types.STRING))
            .inAppendMode()
            .registerTableSource("fileSource");//将source注册到env中
                //通过 scan 拿到table,然后执行table的操作。
        Table result = tEnv.scan("fileSource")
            .groupBy("word")
            .select("word, count(1) as count");
                //将table输出
        tEnv.toRetractStream(result, Row.class).print();
                //执行
        env.execute();
    }
}

How to define a Table

Table myTable = tableEnvironment.scan("myTable") It all comes Environmentout of it scan. And this myTable is what we registered for. The question is what are the ways to register Table.
【1】Table descriptor: Similar to the above WordCount, specify a file system fs, or kafkaetc., and some formats and etc. are also required Schema.

tEnv.connect(new FileSystem().path(path))
            .withFormat(new OldCsv().field("word", Types.STRING).lineDelimiter("\n"))
            .withSchema(new Schema().field("word", Types.STRING))
            .inAppendMode()
            .registerTableSource("fileSource");//将source注册到env中

【2】Customize a Table source:Table source and then register your own .

TableSource csvSource = new CsvTableSource(path,new String[]{
    
    "word"},new TypeInformation[]{
    
    Types.STRING});
tEnv.registerTableSource("sourceTable2", csvSource);

【3】Register a DataStream: For example, of the following Stringtype DataStream, there is only one column named myTable3corresponding to it .schemaword

DataStream<String> stream = ...
// register the DataStream as table " myTable3" with 
// fields "word"
tableEnv.registerDataStream("myTable3", stream, "word");

dynamic table

If the data type in the stream is a structure that case classcan be directly generated based oncase classtable

tableEnv.fromDataStream(ecommerceLogDstream)

Or name them individually according to the order of the fields: use a single quote in front of the field to identify the field name.

tableEnv.fromDataStream(ecommerceLogDstream,'mid,'uid ......)

The final dynamic table can be converted to a stream for output. If it is not a simple insert, usetoRetractStream

table.toAppendStream[(String,String)]

How to output a table

When we obtain a structure table ( tabletype) execute insertIntothe target table:resultTable.insertInto("TargetTable");

【1】Table descriptor: Similar to injection, Sink is finally used for output. For example, the following output is output to targetTable, mainly the difference in the last paragraph.

tEnv
.connect(new FileSystem().path(path)).withFormat(new OldCsv().field("word", Types.STRING)
.lineDelimiter("\n")).withSchema(new Schema()
.field("word", Types.STRING))
.registerTableSink("targetTable");

【2】Customize a Table sink: output to your own sinkTable2 and register it.

TableSink csvSink = new CsvTableSink(path,new String[]{
    
    "word"},new TypeInformation[]{
    
    Types.STRING});
tEnv.registerTableSink("sinkTable2", csvSink);

【3】Output a DataStream: For example, one is generated below RetractStream, corresponding to Tuple2the relationship to be given. BooleanThis line is addstill recorded delete. If used groupby, tableit can only be used when converting to a stream toRetractStream. booleanThe first type field identifier obtained trueis the latest data ( Insert), falsewhich means expired old data ( Delete). If an included time window is used api, the window field must appear in groupBy.

// emit the result table to a DataStream
DataStream<Tuple2<Boolean, Row>> stream = tableEnv.toRetractStream(resultTable, Row.class)
stream.filter(_._1).print()

Case code:

package com.zzx.flink

import java.util.Properties

import com.alibaba.fastjson.JSON
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.table.api.java.Tumble
import org.apache.flink.table.api.{
    
    StreamTableEnvironment, Table, TableEnvironment}



object FlinkTableAndSql {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置 时间特定为 EventTime
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    //读取数据  MyKafkaConsumer 为自定义的 kafka 工具类,并传入 topic
    val dstream: DataStream[String] = env.addSource(MyKafkaConsumer.getConsumer("FLINKTABLE&SQL"))

    //将字符串转换为对象
    val ecommerceLogDstream:DataStream[SensorReding] = dstream.map{
    
    
     /* 引入如下依赖
      <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.36</version>
      </dependency>*/
          //将 String 转换为 SensorReding
      jsonString => JSON.parseObject(jsonString,classOf[SensorReding])
    }

    //告知 watermark 和 evetTime如何提取
    val ecommerceLogWithEventTimeDStream: DataStream[SensorReding] =ecommerceLogDstream.assignTimestampsAndWatermarks(
          new BoundedOutOfOrdernessTimestampExtractor[SensorReding](Time.seconds(0)) {
    
    
      override def extractTimestamp(t: SensorReding): Long = {
    
    
        t.timestamp
      }
    })
    //设置并行度
    ecommerceLogDstream.setParallelism(1)

    //创建 Table 执行环境
    val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(env)
    var ecommerceTable: Table = tableEnv.fromTableSource(ecommerceLogWithEventTimeDStream ,'mid,'uid,'ch,'ts.rowtime)

    //通过 table api进行操作
    //每10秒统计一次各个渠道的个数 table api解决
    //groupby window=滚动式窗口 用 eventtime 来确定开窗时间
    val resultTalbe: Table = ecommerceTable.window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch,'tt).select('ch,'ch.count)
    var ecommerceTalbe: String = "xxx"
    //通过 SQL 执行
    val resultSQLTable: Table = tableEnv.sqlQuery("select ch,count(ch) from "+ ecommerceTalbe +"group by ch,Tumble(ts,interval '10' SECOND")

    //把 Table 转化成流输出
    //val appstoreDStream: DataStream[(String,String,Long)] = appstoreTable.toAppendStream[(String,String,Long)]
    val resultDStream: DataStream[(Boolean,(String,Long))] = resultSQLTable.toRetractStream[(String,Long)]
    //过滤
    resultDStream.filter(_._1)
    env.execute()
  }
}
object MyKafkaConsumer {
    
    
  def getConsumer(sourceTopic: String): FlinkKafkaConsumer011[String] ={
    
    
  val bootstrapServers = "hadoop1:9092"
  // kafkaConsumer 需要的配置参数
  val props = new Properties
  // 定义kakfa 服务的地址,不需要将所有broker指定上
  props.put("bootstrap.servers", bootstrapServers)
  // 制定consumer group
  props.put("group.id", "test")

  // 是否自动确认offset
  props.put("enable.auto.commit", "true")
  // 自动确认offset的时间间隔
  props.put("auto.commit.interval.ms", "1000")
  // key的序列化类
  props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
  // value的序列化类
  props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
  //从kafka读取数据,需要实现 SourceFunction 他给我们提供了一个
  val consumer = new FlinkKafkaConsumer011[String](sourceTopic, new SimpleStringSchema, props)
  consumer
  }
}

About time window

【1】When using the time window, the time field must be declared in advance. If it is processTimeadded directly when creating the dynamic table, it can be added. As follows ps.proctime.

val ecommerceLogTable: Table = tableEnv
    .fromDataStream( ecommerceLogWithEtDstream,
        `mid,`uid,`appid,`area,`os,`ps.proctime )

【2】If it is EventTimeto be declared when creating a dynamic table. As follows ts.rowtime.

val ecommerceLogTable: Table = tableEnv
    .fromDataStream( ecommerceLogWithEtDstream,
        'mid,'uid,'appid,'area,'os,'ts.rowtime)

【3】Scrolling window can be Tumble over 10000.millis onrepresented by

val table: Table = ecommerceLogTable.filter("ch = 'appstore'")
    .window(Tumble over 10000.millis on 'ts as 'tt)
    .groupBy('ch,'tt)
    .select("ch,ch.count")

How to query a table

In order to have GroupedTableetc., in order to increase the limit, write the correct one API.
Insert image description here

Table API operation classification

1. sqlAlignment operations, select, , as, filteretc.;
2. Table APIOperations to improve ease of use;
- Columns OperationEase of use: Suppose there is a 100table with one column, and we need to remove a column. What operation is required? The third one APIcan do it for you. We first get everything in the table Column, and then dropColumnremove the unnecessary columns. Mainly an Tableupper operator.

Operators Examples
AddColumns Table orders = tableEnv.scan("Orders"); Table result = orders.addColumns("concat(c,'sunny')as desc"); When adding a new column, the requirement is that the column name cannot be repeated.
addOrReplaceColumns Table orders = tableEnv.scan("Orders"); Table result = order.addOrReplaceColumns("concat(c,'sunny') as desc");Add columns, overwrite if existing
DropColumns Table orders = tableEnv.scan(“Orders”); Table result = orders.dropColumns(“b c”);
RenameColumns Table orders = tableEnv.scan(“Orders”); Table result = orders.RenameColumns("b as b2,c as c2);列重命名

——Ease Columns Functionof use: Suppose there is a table and we need to get the first 20-80column. How to get it. Similar to a function, it can be used anywhere in column selection, such as: Table.select(withColumns(a,1 to 10)), GroupByetc.

grammar describe
withColumns(…) Select the column you specified
withoutColumns(…) Invert the selection of the columns you specify

Insert image description here
Column operation syntax (suggestion): as follows, they are all relationships in which the upper level contains the lower level.

columnOperation:
    withColumns(columnExprs) / withoutColumns(columnExprs) #可以接收多个参数 columnExpr
columnExprs:
    columnExpr [, columnExpr]*  #可以分为如下三种情况
columnExpr:
    columnRef | columnIndex to columnIndex | columnName to columnName #1 cloumn引用  2下标范围操作  3名字的范围操作
columnRef:
    columnName(The field name that exists in the table) | columnIndex(a positive integer starting at 1)

Example: withColumns(a, b, 2 to 10, w to z)

Row based operation/ Map operationease of use:

//方法签名: 接收一个 scalarFunction 参数,返回一个 Table
def map(scalarFunction: Expression): Table

class MyMap extends ScalarFunction {
    
    
    var param : String = ""

    //eval 方法接收一些输入
    def eval([user defined inputs]): Row = {
    
    
        val result = new Row(3)
        // Business processing based on data and parameters
        // 根据数据和参数进行业务处理,返回最终结果
        result
    }
    //指定结果对应的类型,例如这里 Row的类型,Row有三列
    override def getResultType(signature: Array[Class[_]]):
    TypeInformation[_] = {
    
    
        Types.ROW(Types.STRING, Types.INT, Types.LONG)
    }
}

//使用 fun('e) 得到一个 Row 并定义名称 abc 然后获取 ac列
val res = tab
.map(fun('e)).as('a, 'b, 'c)
.select('a, 'c)

//好处:当你的列很多的时候,并且每一类都需要返回一个结果的时候
table.select(udf1(), udf2(), udf3().)
VS
table.map(udf())

MapIt is the ease of use of one input and one output
Insert image description here
FlatMap operation:

//方法签名:出入一个tableFunction
def flatMap(tableFunction: Expression): Table
#tableFunction 实现的列子,返回一个 User类型,是一个 POJOs类型,Flink能够自动识别类型。
case class User(name: String, age: Int)
class MyFlatMap extends TableFunction[User] {
    
    
    def eval([user defined inputs]): Unit = {
    
    
        for(..){
    
    
            collect(User(name, age))
        }
    }
}

//使用
val res = tab
.flatMap(fun('e,'f)).as('name, 'age)
.select('name, 'age)
Benefit

//好处
table.joinLateral(udtf) VS table.flatMap(udtf())

FlatMapIs the functionality of inputting one line and outputting multiple lines :
Insert image description here
FlatAggregate operation

#方法签名:输入 tableAggregateFunction 与 AggregateFunction 很相似
def flatAggregate(tableAggregateFunction: Expression): FlatAggregateTable
class FlatAggregateTable(table: Table, groupKey: Seq[Expression], tableAggFun: Expression)
class TopNAcc {
    
    
    var data: MapView[JInt, JLong] = _ // (rank -> value)
        ...
    }
    class TopN(n: Int) extends TableAggregateFunction[(Int, Long), TopNAccum] {
    
    
        def accumulate(acc: TopNAcc, [user defined inputs]) {
    
    
        ...
    }
        #可以那多 column,进行多个输出
    def emitValue(acc: TopNAcc, out: Collector[(Int, Long)]): Unit = {
    
    
        ...
    }
    ...retract/merge
}

#用法
val res = tab
.groupBy(‘a)
.flatAggregate(
flatAggFunc(‘e,’f) as (‘a, ‘b, ‘c))
.select(‘a, ‘c)

#好处
新增了一种agg,输出多行

FlatAggregate operationThe difference between inputting multiple lines and outputting multiple lines
Insert image description here
Aggregateand FlatAggregate: Comparing the difference between and when using Maxand . As shown below, there is an input table with three columns ( , , ), and then the maximum index sum is found . The operation is the blue line, first create the accumulator, and then operate on the accumulator , for example, 6 in the past is 6, 3 in the past is not 6 or 6, etc. We end up with a result of 8. When operating the red line, first create an accumulator, and then operate on the accumulator . For example, 6 used to be 6, and 3 used to be two elements, so 3 was also saved. When 5 came over, compared with the smallest one, 3 was eliminated, etc. wait. We end up with 8 and 6. Summarize:Top2AggregateFlatAggregateIDNAMEPRICEPriceTop2
Maxaccumulate
TOP2accumulate
Insert image description here

Insert image description here

Guess you like

Origin blog.csdn.net/zhengzhaoyang122/article/details/135119090