Flink API
It is divided into 4
layers, which are mainly Table API
used for organizing.
Table API
It is a relational type common to stream processing and batch processing API
, and Table API
can be run based on stream input or batch input without any modification. Table API
It is SQL
a superset of the language and is specifically Apache Flink
designed to be integrated with the language . Instead of specifying queries as strings in regular languages, queries are defined in a language-embedded style in or with support for things like autocomplete and syntax detection. The dependencies that need to be introduced are as follows:Table API
Scala
Java
API
SQL
Table API
Java
Scala
IDE
pom
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table_2.12</artifactId>
<version>1.7.2</version>
</dependency>
Table API & SQL
TableAPI: WordCount
Case
tab.groupBy("word").select("word,count(1) as count")
SQL: WordCount
Case
SELECT word,COUNT(*) AS cnt FROM MyTable GROUP BY word
[1] Declarative: users only care about what to do, not how to do it;
[2] High performance: supports query optimization and can obtain better execution performance, because it has an optimizer at the bottom, which SQL
is the same as having an optimizer at the bottom. the same.
[3] Flow-batch unification: The same statistical logic can be run in flow model or batch mode;
[4] Stable standard: The semantics follow SQL
the standard and are not easy to change. When upgrading and other underlying modifications, there is no need to consider API
compatibility issues;
[5] Easy to understand: clear semantics, what you see is what you get;
Table API features
Table API
Makes multi-statement data processing easier to write.
1 #例如,我们将a<10的数据过滤插入到xxx表中
2 table.filter(a<10).insertInto("xxx")
3 #我们将a>10的数据过滤插入到yyy表中
4 table.filter(a>10).insertInto("yyy")
Talbe
It is Flink
one of its own API
that makes it easier to extend the standard SQL
(if and only when needed). The relationship between the two is as follows:
Table API Programming
WordCount
Programming examples
package org.apache.flink.table.api.example.stream;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.descriptors.FileSystem;
import org.apache.flink.table.descriptors.OldCsv;
import org.apache.flink.table.descriptors.Schema;
import org.apache.flink.types.Row;
public class JavaStreamWordCount {
public static void main(String[] args) throws Exception {
//获取执行环境:CTRL + ALT + V
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
//指定一个路径
String path = JavaStreamWordCount.class.getClassLoader().getResource("words.txt").getPath();
//指定文件格式和分隔符,对应的Schema(架构)这里只有一列,类型是String
tEnv.connect(new FileSystem().path(path))
.withFormat(new OldCsv().field("word", Types.STRING).lineDelimiter("\n"))
.withSchema(new Schema().field("word", Types.STRING))
.inAppendMode()
.registerTableSource("fileSource");//将source注册到env中
//通过 scan 拿到table,然后执行table的操作。
Table result = tEnv.scan("fileSource")
.groupBy("word")
.select("word, count(1) as count");
//将table输出
tEnv.toRetractStream(result, Row.class).print();
//执行
env.execute();
}
}
How to define a Table
Table myTable = tableEnvironment.scan("myTable")
It all comes Environment
out of it scan
. And this myTable
is what we registered for. The question is what are the ways to register Table
.
【1】Table descriptor: Similar to the above WordCount
, specify a file system fs
, or kafka
etc., and some formats and etc. are also required Schema
.
tEnv.connect(new FileSystem().path(path))
.withFormat(new OldCsv().field("word", Types.STRING).lineDelimiter("\n"))
.withSchema(new Schema().field("word", Types.STRING))
.inAppendMode()
.registerTableSource("fileSource");//将source注册到env中
【2】Customize a Table source:Table source
and then register your own .
TableSource csvSource = new CsvTableSource(path,new String[]{
"word"},new TypeInformation[]{
Types.STRING});
tEnv.registerTableSource("sourceTable2", csvSource);
【3】Register a DataStream: For example, of the following String
type DataStream
, there is only one column named myTable3
corresponding to it .schema
word
DataStream<String> stream = ...
// register the DataStream as table " myTable3" with
// fields "word"
tableEnv.registerDataStream("myTable3", stream, "word");
dynamic table
If the data type in the stream is a structure that case class
can be directly generated based oncase class
table
tableEnv.fromDataStream(ecommerceLogDstream)
Or name them individually according to the order of the fields: use a single quote in front of the field to identify the field name.
tableEnv.fromDataStream(ecommerceLogDstream,'mid,'uid ......)
The final dynamic table can be converted to a stream for output. If it is not a simple insert, usetoRetractStream
table.toAppendStream[(String,String)]
How to output a table
When we obtain a structure table ( table
type) execute insertInto
the target table:resultTable.insertInto("TargetTable");
【1】Table descriptor: Similar to injection, Sink is finally used for output. For example, the following output is output to targetTable
, mainly the difference in the last paragraph.
tEnv
.connect(new FileSystem().path(path)).withFormat(new OldCsv().field("word", Types.STRING)
.lineDelimiter("\n")).withSchema(new Schema()
.field("word", Types.STRING))
.registerTableSink("targetTable");
【2】Customize a Table sink: output to your own sinkTable2 and register it.
TableSink csvSink = new CsvTableSink(path,new String[]{
"word"},new TypeInformation[]{
Types.STRING});
tEnv.registerTableSink("sinkTable2", csvSink);
【3】Output a DataStream: For example, one is generated below RetractStream
, corresponding to Tuple2
the relationship to be given. Boolean
This line is add
still recorded delete
. If used groupby
, table
it can only be used when converting to a stream toRetractStream
. boolean
The first type field identifier obtained true
is the latest data ( Insert
), false
which means expired old data ( Delete
). If an included time window is used api
, the window field must appear in groupBy
.
// emit the result table to a DataStream
DataStream<Tuple2<Boolean, Row>> stream = tableEnv.toRetractStream(resultTable, Row.class)
stream.filter(_._1).print()
Case code:
package com.zzx.flink
import java.util.Properties
import com.alibaba.fastjson.JSON
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.table.api.java.Tumble
import org.apache.flink.table.api.{
StreamTableEnvironment, Table, TableEnvironment}
object FlinkTableAndSql {
def main(args: Array[String]): Unit = {
//执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//设置 时间特定为 EventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//读取数据 MyKafkaConsumer 为自定义的 kafka 工具类,并传入 topic
val dstream: DataStream[String] = env.addSource(MyKafkaConsumer.getConsumer("FLINKTABLE&SQL"))
//将字符串转换为对象
val ecommerceLogDstream:DataStream[SensorReding] = dstream.map{
/* 引入如下依赖
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.36</version>
</dependency>*/
//将 String 转换为 SensorReding
jsonString => JSON.parseObject(jsonString,classOf[SensorReding])
}
//告知 watermark 和 evetTime如何提取
val ecommerceLogWithEventTimeDStream: DataStream[SensorReding] =ecommerceLogDstream.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor[SensorReding](Time.seconds(0)) {
override def extractTimestamp(t: SensorReding): Long = {
t.timestamp
}
})
//设置并行度
ecommerceLogDstream.setParallelism(1)
//创建 Table 执行环境
val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(env)
var ecommerceTable: Table = tableEnv.fromTableSource(ecommerceLogWithEventTimeDStream ,'mid,'uid,'ch,'ts.rowtime)
//通过 table api进行操作
//每10秒统计一次各个渠道的个数 table api解决
//groupby window=滚动式窗口 用 eventtime 来确定开窗时间
val resultTalbe: Table = ecommerceTable.window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch,'tt).select('ch,'ch.count)
var ecommerceTalbe: String = "xxx"
//通过 SQL 执行
val resultSQLTable: Table = tableEnv.sqlQuery("select ch,count(ch) from "+ ecommerceTalbe +"group by ch,Tumble(ts,interval '10' SECOND")
//把 Table 转化成流输出
//val appstoreDStream: DataStream[(String,String,Long)] = appstoreTable.toAppendStream[(String,String,Long)]
val resultDStream: DataStream[(Boolean,(String,Long))] = resultSQLTable.toRetractStream[(String,Long)]
//过滤
resultDStream.filter(_._1)
env.execute()
}
}
object MyKafkaConsumer {
def getConsumer(sourceTopic: String): FlinkKafkaConsumer011[String] ={
val bootstrapServers = "hadoop1:9092"
// kafkaConsumer 需要的配置参数
val props = new Properties
// 定义kakfa 服务的地址,不需要将所有broker指定上
props.put("bootstrap.servers", bootstrapServers)
// 制定consumer group
props.put("group.id", "test")
// 是否自动确认offset
props.put("enable.auto.commit", "true")
// 自动确认offset的时间间隔
props.put("auto.commit.interval.ms", "1000")
// key的序列化类
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
// value的序列化类
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
//从kafka读取数据,需要实现 SourceFunction 他给我们提供了一个
val consumer = new FlinkKafkaConsumer011[String](sourceTopic, new SimpleStringSchema, props)
consumer
}
}
About time window
【1】When using the time window, the time field must be declared in advance. If it is processTime
added directly when creating the dynamic table, it can be added. As follows ps.proctime
.
val ecommerceLogTable: Table = tableEnv
.fromDataStream( ecommerceLogWithEtDstream,
`mid,`uid,`appid,`area,`os,`ps.proctime )
【2】If it is EventTime
to be declared when creating a dynamic table. As follows ts.rowtime
.
val ecommerceLogTable: Table = tableEnv
.fromDataStream( ecommerceLogWithEtDstream,
'mid,'uid,'appid,'area,'os,'ts.rowtime)
【3】Scrolling window can be Tumble over 10000.millis on
represented by
val table: Table = ecommerceLogTable.filter("ch = 'appstore'")
.window(Tumble over 10000.millis on 'ts as 'tt)
.groupBy('ch,'tt)
.select("ch,ch.count")
How to query a table
In order to have GroupedTable
etc., in order to increase the limit, write the correct one API
.
Table API operation classification
1. sql
Alignment operations, select
, , as
, filter
etc.;
2. Table API
Operations to improve ease of use;
- Columns Operation
Ease of use: Suppose there is a 100
table with one column, and we need to remove a column. What operation is required? The third one API
can do it for you. We first get everything in the table Column
, and then dropColumn
remove the unnecessary columns. Mainly an Table
upper operator.
Operators | Examples |
---|---|
AddColumns | Table orders = tableEnv.scan("Orders"); Table result = orders.addColumns("concat(c,'sunny')as desc"); When adding a new column, the requirement is that the column name cannot be repeated. |
addOrReplaceColumns | Table orders = tableEnv.scan("Orders"); Table result = order.addOrReplaceColumns("concat(c,'sunny') as desc");Add columns, overwrite if existing |
DropColumns | Table orders = tableEnv.scan(“Orders”); Table result = orders.dropColumns(“b c”); |
RenameColumns | Table orders = tableEnv.scan(“Orders”); Table result = orders.RenameColumns("b as b2,c as c2);列重命名 |
——Ease Columns Function
of use: Suppose there is a table and we need to get the first 20-80
column. How to get it. Similar to a function, it can be used anywhere in column selection, such as: Table.select(withColumns(a,1 to 10))
, GroupBy
etc.
grammar | describe |
---|---|
withColumns(…) | Select the column you specified |
withoutColumns(…) | Invert the selection of the columns you specify |
Column operation syntax (suggestion): as follows, they are all relationships in which the upper level contains the lower level.
columnOperation:
withColumns(columnExprs) / withoutColumns(columnExprs) #可以接收多个参数 columnExpr
columnExprs:
columnExpr [, columnExpr]* #可以分为如下三种情况
columnExpr:
columnRef | columnIndex to columnIndex | columnName to columnName #1 cloumn引用 2下标范围操作 3名字的范围操作
columnRef:
columnName(The field name that exists in the table) | columnIndex(a positive integer starting at 1)
Example: withColumns(a, b, 2 to 10, w to z)
Row based operation
/ Map operation
ease of use:
//方法签名: 接收一个 scalarFunction 参数,返回一个 Table
def map(scalarFunction: Expression): Table
class MyMap extends ScalarFunction {
var param : String = ""
//eval 方法接收一些输入
def eval([user defined inputs]): Row = {
val result = new Row(3)
// Business processing based on data and parameters
// 根据数据和参数进行业务处理,返回最终结果
result
}
//指定结果对应的类型,例如这里 Row的类型,Row有三列
override def getResultType(signature: Array[Class[_]]):
TypeInformation[_] = {
Types.ROW(Types.STRING, Types.INT, Types.LONG)
}
}
//使用 fun('e) 得到一个 Row 并定义名称 abc 然后获取 ac列
val res = tab
.map(fun('e)).as('a, 'b, 'c)
.select('a, 'c)
//好处:当你的列很多的时候,并且每一类都需要返回一个结果的时候
table.select(udf1(), udf2(), udf3()….)
VS
table.map(udf())
Map
It is the ease of use of one input and one output
FlatMap operation
:
//方法签名:出入一个tableFunction
def flatMap(tableFunction: Expression): Table
#tableFunction 实现的列子,返回一个 User类型,是一个 POJOs类型,Flink能够自动识别类型。
case class User(name: String, age: Int)
class MyFlatMap extends TableFunction[User] {
def eval([user defined inputs]): Unit = {
for(..){
collect(User(name, age))
}
}
}
//使用
val res = tab
.flatMap(fun('e,'f)).as('name, 'age)
.select('name, 'age)
Benefit
//好处
table.joinLateral(udtf) VS table.flatMap(udtf())
FlatMap
Is the functionality of inputting one line and outputting multiple lines :
FlatAggregate operation
#方法签名:输入 tableAggregateFunction 与 AggregateFunction 很相似
def flatAggregate(tableAggregateFunction: Expression): FlatAggregateTable
class FlatAggregateTable(table: Table, groupKey: Seq[Expression], tableAggFun: Expression)
class TopNAcc {
var data: MapView[JInt, JLong] = _ // (rank -> value)
...
}
class TopN(n: Int) extends TableAggregateFunction[(Int, Long), TopNAccum] {
def accumulate(acc: TopNAcc, [user defined inputs]) {
...
}
#可以那多 column,进行多个输出
def emitValue(acc: TopNAcc, out: Collector[(Int, Long)]): Unit = {
...
}
...retract/merge
}
#用法
val res = tab
.groupBy(‘a)
.flatAggregate(
flatAggFunc(‘e,’f) as (‘a, ‘b, ‘c))
.select(‘a, ‘c)
#好处
新增了一种agg,输出多行
FlatAggregate operation
The difference between inputting multiple lines and outputting multiple lines
Aggregate
and FlatAggregate
: Comparing the difference between and when using Max
and . As shown below, there is an input table with three columns ( , , ), and then the maximum index sum is found . The operation is the blue line, first create the accumulator, and then operate on the accumulator , for example, 6 in the past is 6, 3 in the past is not 6 or 6, etc. We end up with a result of 8. When operating the red line, first create an accumulator, and then operate on the accumulator . For example, 6 used to be 6, and 3 used to be two elements, so 3 was also saved. When 5 came over, compared with the smallest one, 3 was eliminated, etc. wait. We end up with 8 and 6. Summarize:Top2
Aggregate
FlatAggregate
ID
NAME
PRICE
Price
Top2
Max
accumulate
TOP2
accumulate