Detailed analysis of Flink SQL that must be mastered in big data

content

1. Flink SQL common operators

2. Flink SQL practical case


Flink SQL is a set of development language that conforms to standard SQL semantics designed by Flink real-time computing to simplify the computing model and lower the threshold for users to use real-time computing. Since 2015, Alibaba has begun to investigate open source stream computing engines, and finally decided to build a new generation of computing engines based on Flink, optimize and improve Flink's shortcomings, and open the final code in early 2019, which is known as Blink. One of the most significant contributions of Blink to the original Flink is the implementation of Flink SQL.

Flink SQL is a user-oriented API layer. In our traditional streaming computing field, such as Storm and Spark Streaming, some Function or Datastream APIs are provided. Users can write business logic through Java or Scala. Although this method is flexible, it has some shortcomings. , for example, there are certain thresholds and tuning is difficult. With the continuous update of versions, there are also many incompatibilities in the API.

In this context, there is no doubt that SQL has become our best choice. The reason why we choose SQL as the core API is that it has several very important characteristics:

  • SQL is a set-up language, users only need to express their needs clearly, and they do not need to understand the specific methods;

  • SQL can be optimized, with multiple built-in query optimizers, these query optimizers can translate the optimal execution plan for SQL;

  • SQL is easy to understand, understood by people in different industries and fields, and the learning cost is low;

  • SQL is very stable, and in the more than 30-year history of the database, SQL itself has changed less;

  • Unification of streams and batches, the underlying runtime of Flink itself is an engine for unification of streams and batches, and SQL can achieve the unification of streams and batches at the API layer.

1. Flink SQL common operators

SELECT

SELECT is used to select data from a DataSet/DataStream, used to filter out certain columns.

Example:

SELECT * FROM Table; // get all the columns in the table

SELECT name,age FROM Table; // Get the name and age columns from the table

At the same time, functions and aliases can be used in SELECT statements, such as the WordCount we mentioned above:

SELECT word, COUNT(word) FROM table GROUP BY word;

WHERE

WHERE is used to filter data from a dataset/stream, and used together with SELECT to do horizontal splitting of relationships based on certain conditions, that is, to select records that meet the conditions.

Example:

SELECT name,age FROM Table where name LIKE ‘% 小明 %’;

SELECT * FROM Table WHERE age = 20;

WHERE is to filter from the original data, then in the WHERE condition, Flink SQL also supports  the combination of expressions =、<、>、<>、>=、<=and  AND、OR other expressions, and finally the data that meets the filter conditions will be selected. And WHERE can be used in combination with IN and NOT IN. for example:

SELECT name, age
FROM Table
WHERE name IN (SELECT name FROM Table2)

DISTINCT

DISTINCT is used to deduplicate from the dataset/stream according to the result of the SELECT.

Example:

SELECT DISTINCT name FROM Table;

For streaming queries, the state required to calculate the query result may grow infinitely, and users need to control the state range of the query to prevent the state from becoming too large.

GROUP BY

GROUP BY is to group data. For example, we need to calculate the total score of each student in the grade schedule.

Example:

SELECT name, SUM(score) as TotalScore FROM Table GROUP BY name;

UNION and UNION ALL :

UNION is used to combine two result sets, and the fields of the two result sets are required to be exactly the same, including the field type and field order. Unlike UNION ALL, UNION deduplicates the resulting data.

Example:

SELECT * FROM T1 UNION (ALL) SELECT * FROM T2;

JOIN

JOIN is used to combine data from two tables to form a result table. The JOIN types supported by Flink include:

JOIN - INNER JOIN

LEFT JOIN - LEFT OUTER JOIN

RIGHT JOIN - RIGHT OUTER JOIN

FULL JOIN - FULL OUTER JOIN

The semantics of JOIN here are consistent with the JOIN semantics we use in relational databases.

Example:

JOIN (Associate the order table data with the product table)

SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id

The difference between LEFT JOIN and JOIN is that when the right table has no JOIN data with the left side, the corresponding field on the right side is filled with NULL for output. RIGHT JOIN is equivalent to LEFT JOIN where the left and right tables interact. FULL JOIN is equivalent to UNION ALL operation after RIGHT JOIN and LEFT JOIN.

Example:

SELECT * FROM Orders LEFT JOIN Product ON Orders.productId = Product.id
SELECT * FROM Orders RIGHT JOIN Product ON Orders.productId = Product.id
SELECT * FROM Orders FULL OUTER JOIN Product ON Orders.productId = Product.id

Group Window

According to the different division of window data, Apache Flink currently has the following three types of Bounded Window:

Tumble, rolling window , the window data has a fixed size, and the window data is not superimposed;

Hop, sliding window , the window data has a fixed size, and there is a fixed window reconstruction frequency, and the window data is superimposed;

Session, session window , and window data do not have a fixed size. The windows are divided according to the activity level of the window data, and the window data is not superimposed.

Tumble Window

The Tumble scrolling window has a fixed size, and the window data does not overlap. The specific semantics are as follows:

The syntax corresponding to the Tumble scrolling window is as follows:

SELECT
    [gk],
    [TUMBLE_START(timeCol, size)],
    [TUMBLE_END(timeCol, size)],
    agg1(col1),
    ...
    aggn(colN)
FROM Tab1
GROUP BY [gk], TUMBLE(timeCol, size)

in:

[gk] determines whether to aggregate by field;

TUMBLE_START represents the window start time;

TUMBLE_END represents the end time of the window;

timeCol is the time field in the flow table;

size indicates the size of the window, such as seconds, minutes, hours, days.

For example, if we want to calculate the number of orders per person per day, aggregate and group by user:

SELECT user,
      TUMBLE_START(rowtime, INTERVAL ‘1’ DAY) as wStart,
      SUM(amount)
FROM Orders
GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ DAY), user;

Hop Window

Hop sliding window is similar to rolling window. The window has a fixed size. Unlike the rolling window, the sliding window can control the new frequency of the sliding window through the slide parameter. Therefore, when the slide value is less than the value of the window size, multiple sliding windows will overlap. The specific semantics are as follows:

The corresponding syntax of Hop sliding window is as follows:

SELECT
    [gk],
    [HOP_START(timeCol, slide, size)] ,
    [HOP_END(timeCol, slide, size)],
    agg1(col1),
    ...
    aggN(colN)
FROM Tab1
GROUP BY [gk], HOP(timeCol, slide, size)

The meaning of each field is similar to the Tumble window:

[gk] determines whether to aggregate by field;

HOP_START indicates the window start time;

HOP_END indicates the end time of the window;

timeCol represents the time field in the flow table;

slide indicates the size of each window slide;

size indicates the size of the entire window, such as seconds, minutes, hours, days.

As an example, we want to calculate the sales of each item in the past 24 hours every hour:

SELECT product,
      SUM(amount)
FROM Orders
GROUP BY HOP(rowtime, INTERVAL '1' HOUR, INTERVAL '1' DAY), product

Session Window

Session time windows do not have a fixed duration, but their bounds are defined by the interval inactivity time, i.e. the session window closes if no events occur during the defined interval.

The corresponding syntax of the Seeeion session window is as follows:

SELECT
    [gk],
    SESSION_START(timeCol, gap) AS winStart,
    SESSION_END(timeCol, gap) AS winEnd,
    agg1(col1),
     ...
    aggn(colN)
FROM Tab1
GROUP BY [gk], SESSION(timeCol, gap)

[gk] determines whether to aggregate by field;

SESSION_START represents the window start time;

SESSION_END indicates the end time of the window;

timeCol represents the time field in the flow table;

gap indicates the duration of the inactive period of window data.

For example, we need to calculate the order volume for each user's visit time within 12 hours:

SELECT user,
      SESSION_START(rowtime, INTERVAL ‘12’ HOUR) AS sStart,
      SESSION_ROWTIME(rowtime, INTERVAL ‘12’ HOUR) AS sEnd,
      SUM(amount)
FROM Orders
GROUP BY SESSION(rowtime, INTERVAL ‘12’ HOUR), user

The Table API and SQL are bundled in the flink-table Maven artifact. The following dependencies must be added to your project to use the Table API and SQL:

 <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table_2.11</artifactId>
    <version>${flink.version}</version>
</dependency>

Also, you need to add dependencies for Flink's Scala batch or streaming API. For bulk queries you need to add:

<dependency>
  <groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
  <version>${flink.version}</version>
</dependency>

2. Flink SQL practical case

1) Batch data SQL

usage:

  1. Build the Table runtime environment

  2. Register the DataSet as a table

  3. Use the sqlQuery method of the Table runtime environment to execute SQL statements

Example: Use Flink SQL to count the total amount, maximum amount, minimum amount, and total number of orders of user consumption orders .

order id username order date Amount of consumption
1 Zhangsan 2018-10-20 15:30 358.5

Test data (order ID, username, order date, order amount):

Order(1, "zhangsan", "2018-10-20 15:30", 358.5),
Order(2, "zhangsan", "2018-10-20 16:30", 131.5),
Order(3, "lisi", "2018-10-20 16:30", 127.5),
Order(4, "lisi", "2018-10-20 16:30", 328.5),
Order(5, "lisi", "2018-10-20 16:30", 432.5),
Order(6, "zhaoliu", "2018-10-20 22:30", 451.0),
Order(7, "zhaoliu", "2018-10-20 22:30", 362.0),
Order(8, "zhaoliu", "2018-10-20 22:30", 364.0),
Order(9, "zhaoliu", "2018-10-20 22:30", 341.0)

step:

  1. Get a batch runtime environment

  2. Get a Table runtime environment

  3. Create a sample class Order to map data (order name, user name, order date, order amount)

  4. Create a DataSet source based on the local Order collection

  5. Register the DataSet as a table using the Table runtime environment

  6. Use SQL statements to manipulate data (to count the total amount, maximum amount, minimum amount, and total number of orders of user consumption orders)

  7. Convert Table to DataSet using TableEnv.toDataSet

  8. print test

Sample code :

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.{Table, TableEnvironment}
import org.apache.flink.table.api.scala.BatchTableEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.types.Row
/**
 * 使用Flink SQL统计用户消费订单的总金额、最大金额、最小金额、订单总数。
 */
object BatchFlinkSqlDemo {
  //3. 创建一个样例类 Order 用来映射数据(订单名、用户名、订单日期、订单金额)
  case class Order(id:Int, userName:String, createTime:String, money:Double)
  def main(args: Array[String]): Unit = {
    /**
     * 实现思路:
     * 1. 获取一个批处理运行环境
     * 2. 获取一个Table运行环境
     * 3. 创建一个样例类 Order 用来映射数据(订单名、用户名、订单日期、订单金额)
     * 4. 基于本地 Order 集合创建一个DataSet source
     * 5. 使用Table运行环境将DataSet注册为一张表
     * 6. 使用SQL语句来操作数据(统计用户消费订单的总金额、最大金额、最小金额、订单总数)
     * 7. 使用TableEnv.toDataSet将Table转换为DataSet
     * 8. 打印测试
     */
    //1. 获取一个批处理运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 获取一个Table运行环境
    val tabEnv: BatchTableEnvironment = TableEnvironment.getTableEnvironment(env)
    //4. 基于本地 Order 集合创建一个DataSet source
    val orderDataSet: DataSet[Order] = env.fromElements(
      Order(1, "zhangsan", "2018-10-20 15:30", 358.5),
      Order(2, "zhangsan", "2018-10-20 16:30", 131.5),
      Order(3, "lisi", "2018-10-20 16:30", 127.5),
      Order(4, "lisi", "2018-10-20 16:30", 328.5),
      Order(5, "lisi", "2018-10-20 16:30", 432.5),
      Order(6, "zhaoliu", "2018-10-20 22:30", 451.0),
      Order(7, "zhaoliu", "2018-10-20 22:30", 362.0),
      Order(8, "zhaoliu", "2018-10-20 22:30", 364.0),
      Order(9, "zhaoliu", "2018-10-20 22:30", 341.0)
    )

    //5. 使用Table运行环境将DataSet注册为一张表
    tabEnv.registerDataSet("t_order", orderDataSet)

    //6. 使用SQL语句来操作数据(统计用户消费订单的总金额、最大金额、最小金额、订单总数)
    //用户消费订单的总金额、最大金额、最小金额、订单总数。
    val sql =
      """
        | select
        |   userName,
        |   sum(money) totalMoney,
        |   max(money) maxMoney,
        |   min(money) minMoney,
        |   count(1) totalCount
        |  from t_order
        |  group by userName
        |""".stripMargin  //在scala中stripMargin默认是“|”作为多行连接符

    //7. 使用TableEnv.toDataSet将Table转换为DataSet
    val table: Table = tabEnv.sqlQuery(sql)
    table.printSchema()
    tabEnv.toDataSet[Row](table).print()
  }
}

2) Streaming data SQL

SQL can also be supported in stream processing. But the following points need to be noted:

  1. To use streaming SQL, you must add watermark time

  2. When using the registerDataStream registry, use ' to specify the field

  3. When registering, a rowtime must be specified, otherwise the window cannot be used in SQL

  4. The implicit parameter of import org.apache.flink.table.api.scala._ must be imported

  5. Use trumble(time column name, interval 'time' sencond) to define the window in SQL

Example: Use Flink SQL to count the total number of orders, the maximum order amount, and the minimum order amount of the user within 5 seconds .

step

  1. Get the stream processing runtime environment

  2. Get Table Runtime Environment

  3. Set the processing time to EventTime

  4. Create an order sample class Order with four fields (order ID, user ID, order amount, timestamp)

  5. Create a custom data source
    • Generate 1000 orders using for loop

    • Randomly generated order ID (UUID)

    • Randomly generated user ID (0-2)

    • Randomly generated order amount (0-100)

    • Timestamp is the current system time

    • Generate an order every 1 second

  6. Add watermark, allow 2 seconds delay

  7. import import org.apache.flink.table.api.scala._ implicit parameter

  8. Use the registerDataStream registry, and specify the fields separately, and also specify the rowtime field

  9. When writing SQL statements to count the total number of user orders, the maximum amount, and the minimum amount grouping, use tumble(time column, interval 'window time' second) to create a window

  10. Execute sql statement using tableEnv.sqlQuery

  11. Convert the execution result of SQL into DataStream and print it out

  12. start stream handler

Sample code :

import java.util.UUID
import java.util.concurrent.TimeUnit
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.{Table, TableEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.types.Row
import scala.util.Random
/**
 * 需求:
 *  使用Flink SQL来统计5秒内 用户的 订单总数、订单的最大金额、订单的最小金额
 *
 *  timestamp是关键字不能作为字段的名字(关键字不能作为字段名字)
 */
object StreamFlinkSqlDemo {
    /**
     *  1. 获取流处理运行环境
     * 2. 获取Table运行环境
     * 3. 设置处理时间为 EventTime
     * 4. 创建一个订单样例类 Order ,包含四个字段(订单ID、用户ID、订单金额、时间戳)
     * 5. 创建一个自定义数据源
     *    使用for循环生成1000个订单
     *    随机生成订单ID(UUID)
     *    随机生成用户ID(0-2)
     *    随机生成订单金额(0-100)
     *    时间戳为当前系统时间
     *    每隔1秒生成一个订单
     * 6. 添加水印,允许延迟2秒
     * 7. 导入 import org.apache.flink.table.api.scala._ 隐式参数
     * 8. 使用 registerDataStream 注册表,并分别指定字段,还要指定rowtime字段
     * 9. 编写SQL语句统计用户订单总数、最大金额、最小金额
     * 分组时要使用 tumble(时间列, interval '窗口时间' second) 来创建窗口
     * 10. 使用 tableEnv.sqlQuery 执行sql语句
     * 11. 将SQL的执行结果转换成DataStream再打印出来
     * 12. 启动流处理程序
     */
    // 3. 创建一个订单样例类`Order`,包含四个字段(订单ID、用户ID、订单金额、时间戳)
    case class Order(orderId:String, userId:Int, money:Long, createTime:Long)
    def main(args: Array[String]): Unit = {
      // 1. 创建流处理运行环境
      val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
      // 2. 设置处理时间为`EventTime`
      env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
      //获取table的运行环境
      val tableEnv = TableEnvironment.getTableEnvironment(env)
      // 4. 创建一个自定义数据源
      val orderDataStream = env.addSource(new RichSourceFunction[Order] {
        var isRunning = true
        override def run(ctx: SourceFunction.SourceContext[Order]): Unit = {
          // - 随机生成订单ID(UUID)
          // - 随机生成用户ID(0-2)
          // - 随机生成订单金额(0-100)
          // - 时间戳为当前系统时间
          // - 每隔1秒生成一个订单
          for (i <- 0 until 1000 if isRunning) {
            val order = Order(UUID.randomUUID().toString, Random.nextInt(3), Random.nextInt(101),
              System.currentTimeMillis())
            TimeUnit.SECONDS.sleep(1)
            ctx.collect(order)
          }
        }
        override def cancel(): Unit = { isRunning = false }
      })
      // 5. 添加水印,允许延迟2秒
      val watermarkDataStream = orderDataStream.assignTimestampsAndWatermarks(
        new BoundedOutOfOrdernessTimestampExtractor[Order](Time.seconds(2)) {
          override def extractTimestamp(element: Order): Long = {
            val eventTime = element.createTime
            eventTime
          }
        }
      )
      // 6. 导入`import org.apache.flink.table.api.scala._`隐式参数
      // 7. 使用`registerDataStream`注册表,并分别指定字段,还要指定rowtime字段
      import org.apache.flink.table.api.scala._
      tableEnv.registerDataStream("t_order", watermarkDataStream, 'orderId, 'userId, 'money,'createTime.rowtime)
      // 8. 编写SQL语句统计用户订单总数、最大金额、最小金额
      // - 分组时要使用`tumble(时间列, interval '窗口时间' second)`来创建窗口
      val sql =
      """
        |select
        | userId,
        | count(1) as totalCount,
        | max(money) as maxMoney,
        | min(money) as minMoney
        | from
        | t_order
        | group by
        | tumble(createTime, interval '5' second),
        | userId
      """.stripMargin
      // 9. 使用`tableEnv.sqlQuery`执行sql语句
      val table: Table = tableEnv.sqlQuery(sql)

      // 10. 将SQL的执行结果转换成DataStream再打印出来
      table.toRetractStream[Row].print()
      env.execute("StreamSQLApp")
    }
}

Guess you like

Origin blog.csdn.net/helloHbulie/article/details/121161552