Flink from entry to proficiency series (9)

11.7, function

The functions in Flink SQL can be divided into two categories: one is the built-in system function in SQL, which can be called directly through the function name, and can realize some common conversion operations, such as COUNT() and CHAR_LENGTH() we used before , UPPER(), etc.; and another type of function is a user-defined function (UDF), which needs to be registered in the table environment before it can be used.

11.7.1. System functions

System Functions, also called Built-in Functions, are pre-implemented functional modules in the system. We can directly call through the fixed function name to achieve the desired conversion operation.

Flink SQL provides a large number of system functions and supports almost all operations in standard SQL, which provides great convenience for us to use SQL to write stream processing programs. The system functions in Flink SQL can be mainly divided into two categories: scalar functions (Scalar Functions) and aggregate functions (Aggregate Functions).

11.7.1.1. Scalar Functions

The so-called "scalar" refers to a quantity that has only a numerical value and no direction; so a scalar function refers to a function that only performs conversion operations on input data and returns a value. The input data here corresponds to the table, which is generally one or more fields in a row of data, so this operation is a bit like the map in the stream processing conversion operator. In addition, some functions that have no input parameters and can directly obtain a unique result are also scalar functions.

Scalar functions are the most common and simplest type of system functions, and the number is very large, many of which are also defined in standard SQL. So here we only list some functions for some common types, and make a brief overview. For specific applications, you can view the complete function list on the official website.

  • Comparison Functions (Comparison Functions)
    The comparison function is actually a comparison expression, which is used to judge the relationship between two values ​​and return a Boolean value. This comparison expression can be used to connect two values ​​with symbols such as <, >, =, or it can be a certain judgment defined by keywords. For example:
    • value1 = value2 judges that two values ​​are equal;
    • value1 <> value2 judges that two values ​​​​are not equal
    • value IS NOT NULL Determines that value is not empty
  • Logical Functions (Logical Functions)
    Logical functions are a logical expression, that is, use AND (AND), or (OR), not (NOT) to connect Boolean values, and judgment statements (IS, IS NOT) can also be used Make a truth judgment; return a Boolean value. For example:
    • boolean1 OR boolean2 Boolean value boolean1 and Boolean value boolean2 take logical OR
    • boolean IS FALSE Determine whether the Boolean value boolean is false
    • NOT boolean Boolean boolean takes logical NOT
  • Arithmetic Functions¶

Functions that perform arithmetic calculations, including operations linked with arithmetic symbols, and complex mathematical operations. For example:

  • numeric1 + numeric2 add two numbers
  • POWER(numeric1, numeric2) exponentiation, take the number numeric1 to the power of numeric2
  • RAND() returns a pseudo-random number of type double in the interval (0.0, 1.0)
  • String Functions¶

Functions for string manipulation. For example:

  • string1 || string2 concatenation of two strings
  • UPPER(string) Convert the string string to all uppercase
  • CHAR_LENGTH(string) calculates the length of the string string
  • Temporal Functions

Functions that perform time-related operations. For example:

  • DATE string Parse the string string according to the format "yyyy-MM-dd", and the return type is SQL Date
  • TIMESTAMP string is parsed according to the format "yyyy-MM-dd HH:mm:ss[.SSS]", and the return type is SQL timestamp
  • CURRENT_TIME returns the current time in the local time zone, the type is SQL time (equivalent to LOCALTIME)
  • INTERVAL string range Returns a time interval. string represents a value; range can be a unit such as DAY, MINUTE, DAT TO HOUR, or a compound unit such as YEAR TO MONTH. For example, "2 years and 10 months" can be written as:INTERVAL '2-10' YEAR TO MONTH

11.7.1.2. Aggregate Functions

An aggregate function is a function that takes multiple rows in a table as input, extracts fields for aggregation operations, and returns a unique aggregated value as a result. Aggregation functions are widely used. Regardless of group aggregation, window aggregation, or window (Over) aggregation, the aggregation operations on data can be defined with the same function.
Common aggregate functions in standard SQL are supported by Flink SQL, and are currently being expanded to provide more powerful functions for stream processing applications. For example:

  • COUNT(*) returns the number of all rows, statistics
  • SUM([ ALL | DISTINCT ] expression) Sums a field. By default, the keyword ALL is omitted, indicating that all rows are summed; if DISTINCT is specified, the data will be deduplicated, and each value will only be superimposed once.
  • RANK() returns the rank of the current value in a set of values
  • ROW_NUMBER() After sorting a set of values, returns the row number of the current value. Similar to the function of RANK(), RANK() and ROW_NUMBER() are generally used in the OVER window.

11.7.1.2, user-defined function (UDF)

Flink's Table API and SQL provide interfaces for various custom functions, which are defined in the form of abstract classes. The current UDF mainly has the following categories:

  • Scalar Functions: convert an input scalar value into a new scalar value;
  • Table Functions: convert scalar values ​​into one or more new row data, that is, expand into a table;
  • Aggregate Functions: Convert scalar values ​​in multiple rows of data into a new scalar value;
  • Table Aggregate Functions: Convert scalar values ​​in multiple rows of data into one or more new rows of data.
11.7.1.2.1, the overall call process

To use a custom function in the code, we need to customize the implementation of the corresponding UDF abstract class first, and register this function in the table environment, and then it can be called in the Table API and SQL.

  1. register function

When registering a function, you need to call the createTemporarySystemFunction() method of the table environment, and pass in the registered function name and the Class object of the UDF class:

// 注册函数
tableEnv.createTemporarySystemFunction("MyFunction", MyFunction.class);

Our custom UDF class is called MyFunction, and it should be a concrete implementation of one of the above four UDF abstract classes; register it as a function named MyFunction in the environment.

The createTemporarySystemFunction() method here means to create a "temporary system function", so the function name MyFunction is global and can be used as a system function; we can also use the createTemporaryFunction() method, and the registered function depends on the current Database (database) and catalog (catalog), so this is not a system function, but a "catalog function". Its full name should include the database and catalog it belongs to. In general, we directly use createTemporarySystemFunction( ) method to register UDF as a system function.

  1. Using the Table API to call functions
    In the Table API, you need to use the call() method to call custom functions:
tableEnv.from("MyTable").select(call("MyFunction", $("myField")));

Here the call() method has two parameters, one is the registered function name MyFunction, and the other is the parameter itself when the function is called. Here we define that when MyFunction is called, the parameter that needs to be passed in is the myField field.

In addition, in the Table API, you can directly call the UDF in the "inline" mode without registering the function:

tableEnv.from("MyTable").select(call(SubstringFunction.class, $("myField")));

The only difference is that the first parameter of the call() method is no longer the registered function name, but directly the Class object of the function class.

  1. Call the function in SQL

When we register the function as a system function, the call in SQL is exactly the same as the built-in system function:

tableEnv.sqlQuery("SELECT MyFunction(myField) FROM MyTable");

It can be seen that the calling method of SQL is more convenient, and we will still use SQL as an example to introduce the usage of UDF in the future.

11.7.1.2.2. Scalar Functions

A custom scalar function can convert 0, 1 or more scalar values ​​into a scalar value, and its corresponding input is a field in a row of data, and the output is a unique value. Therefore, from the perspective of the corresponding relationship between the row data in the input and output tables, the scalar function is a "one-to-one" conversion.

To implement a custom scalar function, we need to define a class to inherit the abstract class ScalarFunction and implement an evaluation method called eval(). The behavior of scalar functions depends on the definition of the evaluation method, which must be public (public), and the name must be eval. The evaluation method eval can be overloaded multiple times, and any data type can be used as the parameter and return value type of the evaluation method.

What needs to be specially explained here is that the eval() method is not defined in the ScalarFunction abstract class, so we cannot directly override it in the code; but the bottom layer of the Table API framework requires that the evaluation method must be named eval() .
ScalarFunction and all other UDF interfaces are in org.apache.flink.table.functions. Let's look at a concrete example. We implement a custom hash function HashFunction to find the hash value of the incoming object.

public static class HashFunction extends ScalarFunction {
    
    
 // 接受任意类型输入,返回 INT 型输出
 public int eval(@DataTypeHint(inputGroup = InputGroup.ANY) Object o) {
    
    
 return o.hashCode();
 }
}
// 注册函数
tableEnv.createTemporarySystemFunction("HashFunction", HashFunction.class);
// 在 SQL 里调用注册好的函数
tableEnv.sqlQuery("SELECT HashFunction(myField) FROM MyTable");

Here we have customized a ScalarFunction, implemented the eval() evaluation method, passed in any type of object, and returned an Int type hash value. Of course, the specific hash operation is omitted, and the hashCode() method of the object can be called directly.

Also note that since the Table API needs to extract the type reference of the evaluation method parameter when parsing the function, we use DataTypeHint(inputGroup = InputGroup.ANY) to mark the type of the input parameter, indicating that the parameter of eval can be of any type .

11.7.1.2.3. Table Functions

Like scalar functions, table functions can take zero, one, or more scalar values ​​as their input arguments; the difference is that they can return any number of rows. "Multi-row data" actually constitutes a table, so "table function" can be regarded as a function that returns a table, which is a "one-to-many" conversion relationship.

Similarly, to implement a custom table function, a custom class is required to inherit the abstract class TableFunction, and an evaluation method called eval must be implemented internally. Different from the scalar function, the TableFunction class itself has a generic parameter T, which is the type of data returned by the table function; and the eval() method has no return type, and there is no return statement inside, which is obtained by calling the collect() method To send the row data you want to output.

FlatMapFunction and ProcessFunction in the DataStream API, their flatMap and processElement methods also have no return value, and also send data downstream through out.collect().

To call a table function in SQL, you need to use LATERAL TABLE() to generate an extended "side table", and then join with the original table. The Join operation here can be a direct cross join (cross join), just separate the two tables with a comma after FROM; it can also be a left join (LEFT JOIN) with ON TRUE as the condition.

The following is a concrete example of a table function. We have implemented a function SplitFunction that separates strings, which can convert a string into a (string, length) tuple.

// 注意这里的类型标注,输出是 Row 类型,Row 中包含两个字段:word 和 length。
@FunctionHint(output = @DataTypeHint("ROW<word STRING, length INT>"))
public static class SplitFunction extends TableFunction<Row> {
    
    
	 public void eval(String str) {
    
    
		 for (String s : str.split(" ")) {
    
    
			 // 使用 collect()方法发送一行数据
			 collect(Row.of(s, s.length()));
		 }
	 }
}
// 注册函数
tableEnv.createTemporarySystemFunction("SplitFunction", SplitFunction.class);
// 在 SQL 里调用注册好的函数
// 1. 交叉联结
tableEnv.sqlQuery( "SELECT myField, word, length " +
 "FROM MyTable, LATERAL TABLE(SplitFunction(myField))");
// 2. 带 ON TRUE 条件的左联结
tableEnv.sqlQuery(
	 "SELECT myField, word, length " +
	 "FROM MyTable " +
	 "LEFT JOIN LATERAL TABLE(SplitFunction(myField)) ON TRUE");
	// 重命名侧向表中的字段
tableEnv.sqlQuery(
	 "SELECT myField, newWord, newLength " +
	 "FROM MyTable " +
 	"LEFT JOIN LATERAL TABLE(SplitFunction(myField)) AS T(newWord, newLength) ON TRUE");

Here we directly define the output type of the table function as ROW, which is the data type in the obtained lateral table; each row of data has only one row after conversion. We have used two methods of cross join and left join to call in SQL, and we can also rename the fields in the side table.

11.7.1.2.4. Aggregate Functions

A User Defined AGGregate function (UDAGG) aggregates one or more rows of data (that is, a table) into a scalar value. This is a standard "many-to-one" conversion. We have encountered the concept of aggregate functions many times before, such as SUM(), MAX(), MIN(), AVG(), and COUNT() are common systems Built-in aggregate functions. And if some requirements cannot be solved by calling system functions directly, we must customize aggregation functions to realize the functions.

A custom aggregate function needs to inherit the abstract class AggregateFunction. AggregateFunction has two generic parameters <T, ACC>, T represents the result type of the aggregation output, and ACC represents the intermediate state type of the aggregation. Aggregate functions in Flink SQL work as follows:

  • First, it needs to create an accumulator (accumulator), used to store the intermediate results of the aggregation. This is very similar to the AggregateFunction in the DataStream API, and the accumulator can be regarded as an aggregate state. Call the createAccumulator() method to create an empty accumulator.
  • For each row of input data, the accumulate() method is called to update the accumulator, which is the core process of aggregation.
  • When all the data has been processed, the final result is calculated and returned by calling the getValue() method.

Therefore, each AggregateFunction must implement the following methods:

  • createAccumulator()
    This is the method to create the accumulator. With no input parameters, the return type is accumulator type ACC.

  • accumulate()
    is the core method for aggregate calculation, and it will be called for each row of data. Its first parameter is definite, that is, the current accumulator, whose type is ACC, indicating the intermediate state of the current aggregation; the latter parameter is the parameter passed in when the aggregation function is called, there can be multiple, and the types can also be different . This method is mainly to update the aggregate state, so there is no return type. It should be noted that accumulate() is similar to the previous evaluation method eval(), and it is also required by the underlying architecture. It must be public, and the method name must be accumulate, and it cannot be directly overridden and can only be implemented manually.

  • getValue()
    This is the method to get the final return result. The input parameter is an accumulator of type ACC and the output type is T. When encountering complex types, Flink's type deduction may not get correct results. So AggregateFunction can also specifically declare the type of the accumulator and the returned result, which is specified by the two methods getAccumulatorType() and getResultType().

In addition to the above methods, several methods are optional. Some of these methods can make queries more efficient, and some must be implemented in certain scenarios. For example, if the session window is aggregated, the merge() method must be implemented. It will define the merge operation of the accumulator, and this method is also useful for the optimization of some scenarios; and if the aggregation function is used in the OVER window aggregation , the retract() method must be implemented to ensure that the data can be withdrawn;

The resetAccumulator() method resets the accumulator, which is useful in some batch processing scenarios. All methods of AggregateFunction must be public (public), not static (static), and the names must be exactly as written above. The methods createAccumulator , getValue , getResultType and getAccumulatorType are defined in the abstract class AggregateFunction and can be overridden; while others are methods agreed by the underlying architecture.

For example, we want to calculate the weighted average score of each student from the student's score table ScoreTable. To calculate a weighted average, two values ​​should be extracted from each row of input as parameters: the score to be calculated, and its weight. In the aggregation process, the accumulator (accumulator) needs to store the current weighted sum sum and the current number of data count. This can be represented by a two-tuple, or you can define a class WeightedAvgAccum separately, which contains two attributes of sum and count, and use its object instance as the aggregate accumulator. The specific code is as follows:

// 累加器类型定义
public static class WeightedAvgAccumulator {
    
    
 public long sum = 0; // 加权和
 public int count = 0; // 数据个数
}
// 自定义聚合函数,输出为长整型的平均值,累加器类型为 WeightedAvgAccumulator
public static class WeightedAvg extends AggregateFunction<Long, WeightedAvgAccumulator> {
    
    
	 @Override
	 public WeightedAvgAccumulator createAccumulator() {
    
    
	 		return new WeightedAvgAccumulator(); // 创建累加器
	 }
	 @Override
	 public Long getValue(WeightedAvgAccumulator acc) {
    
    
		 if (acc.count == 0) {
    
    
		 	return null; // 防止除数为 0
		 } else {
    
    
			 return acc.sum / acc.count; // 计算平均值并返回
		 }
 	}
 // 累加计算方法,每来一行数据都会调用
 public void accumulate(WeightedAvgAccumulator acc, Long iValue, Integer iWeight) {
    
    
			 acc.sum += iValue * iWeight;
			 acc.count += iWeight;
		 }
	}
// 注册自定义聚合函数
tableEnv.createTemporarySystemFunction("WeightedAvg", WeightedAvg.class);
// 调用函数计算加权平均值
Table result = tableEnv.sqlQuery("SELECT student, WeightedAvg(score, weight) FROM ScoreTable GROUP BY student" );

The accumulate() method of an aggregate function has three input parameters. The first is an accumulator of type WeightedAvgAccum; the other two are fields input when the function is called: the value to be calculated ivalue and the corresponding weight iweight.

11.7.1.2.5. Table Aggregate Functions

A custom table aggregation function needs to inherit the abstract class TableAggregateFunction. The structure and principle of TableAggregateFunction are very similar to AggregateFunction. It also has two generic parameters <T, ACC>, and uses an ACC type accumulator (accumulator) to store the intermediate results of the aggregation. The three methods that must be implemented in the aggregation function must also be implemented in TableAggregateFunction:

  • createAccumulator()
    is the method of creating an accumulator, which is the same as that used in AggregateFunction.
  • accumulate()
    is the core method of aggregation calculation, which is the same as the usage in AggregateFunction.
  • emitValue()
    is a method to output the final calculation result after all input rows have been processed. This method corresponds to the getValue() method in AggregateFunction; the difference is that emitValue has no output type, and there are two input parameters: the first is an accumulator of ACC type, and the second is a "collector" for output data out, its type is Collect. So obviously, the output data of the table aggregation function is not directly return, but calls the out.collect() method, and multiple rows of data can be output by calling it multiple times; this is very similar to the table function. In addition, emitValue() is not defined in the abstract class, so it cannot be overridden and must be implemented manually.

What the table aggregation function gets is a table; if you do continuous query in stream processing, you should recalculate the output of this table every time. If after inputting a piece of data, only one or several rows in the result table are updated (Update), then it is obviously not efficient enough for us to recalculate the entire table and output all of them. In order to improve processing efficiency, TableAggregateFunction also provides an emitUpdateWithRetract() method, which can update incrementally by "retracting" old data and sending new data when the result table changes. If both emitValue() and emitUpdateWithRetract() methods are defined, emitUpdateWithRetract() will be called first when updating.

Table aggregate functions are relatively complex, and a typical application scenario is Top N queries. For example, we want to select the top two after sorting a set of data, which is the simplest TOP-2 query. If there is no thread system function, then we can customize a table aggregation function to achieve this function. The accumulator should be able to save the two largest current values. Whenever a new piece of data comes, compare and update it in the accumulate() method, and finally call out.collect() twice in emitValue() to collect the top two data output. The specific code is as follows:

// 聚合累加器的类型定义,包含最大的第一和第二两个数据
public static class Top2Accumulator {
    
    
 public Integer first;
 public Integer second;
}
// 自定义表聚合函数,查询一组数中最大的两个,返回值为(数值,排名)的二元组
public static class Top2 extends TableAggregateFunction<Tuple2<Integer, Integer>, 
Top2Accumulator> {
    
    
	 @Override
	 public Top2Accumulator createAccumulator() {
    
    
		 Top2Accumulator acc = new Top2Accumulator();
		 acc.first = Integer.MIN_VALUE; // 为方便比较,初始值给最小值
		 acc.second = Integer.MIN_VALUE;
		 return acc;
	 }
 // 每来一个数据调用一次,判断是否更新累加器
 public void accumulate(Top2Accumulator acc, Integer value) {
    
    
	 if (value > acc.first) {
    
    
		 acc.second = acc.first;
		 acc.first = value;
	 } else if (value > acc.second) {
    
    
		 acc.second = value;
	 }
 }
 // 输出(数值,排名)的二元组,输出两行数据
 public void emitValue(Top2Accumulator acc, Collector<Tuple2<Integer, Integer>> out) {
    
    
		 if (acc.first != Integer.MIN_VALUE) {
    
    
		 	out.collect(Tuple2.of(acc.first, 1));
		 }
		 if (acc.second != Integer.MIN_VALUE) {
    
    
		 	out.collect(Tuple2.of(acc.second, 2));
		 }
	 }
}

Currently, there is no way to directly use table aggregation functions in SQL, so you need to use Table API to call:

// 注册表聚合函数函数
tableEnv.createTemporarySystemFunction("Top2", Top2.class);
// 在 Table API 中调用函数
tableEnv.from("MyTable")
 .groupBy($("myField"))
 .flatAggregate(call("Top2", $("value")).as("value", "rank"))
 .select($("myField"), $("value"), $("rank"));

The flatAggregate() method is used here, which is an interface specially used to call table aggregation functions. Group and aggregate the data in MyTable according to the myField field, count the two with the largest value; rename the two fields of the aggregation result to value and rank, and then use select() to extract them.

11.9. Connecting to external systems

11.9.1、Kafka

Kafka's SQL connector can read data from Kafka's topic (topic) into a table, and can also write table data into Kafka's topic. In other words, if you specify the connector as Kafka when creating a table, the table can be used as both an input table and an output table.

11.9.1.1. Introducing dependencies

To use the Kafka connector in a Flink program, the following dependencies need to be introduced:

<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flink-connector-kafka_${
    
    scala.binary.version}</artifactId>
 <version>${
    
    flink.version}</version>
</dependency>

The Flink and Kafka connectors we introduce here are the same as the connectors introduced in the previous DataStream API. If you want to use the Kafka connector in the SQL client, you also need to download the corresponding jar package and put it in the lib directory. In addition, Flink provides a series of "table formats" for various connectors, such as CSV, JSON, Avro, Parquet, etc. These table formats define the conversion method between the binary data stored at the bottom layer and the columns of the table, which is equivalent to the serialization tool of the table. For Kafka, major formats such as CSV, JSON, and Avro are supported. Depending on the format configured in the Kafka connector, we may need to introduce corresponding dependency support. Take CSV as an example:

<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flink-csv</artifactId>
 <version>${
    
    flink.version}</version>
</dependency>

Since the SQL client has built-in support for CSV and JSON, there is no need to introduce it when using it; for formats without built-in support (such as Avro), you still need to download the corresponding jar package.

11.9.1.2. Create a table connected to Kafka

To create a connection to a Kafka table, you need to specify the connector as Kafka in the WITH clause in the DDL of CREATE TABLE and define the necessary configuration parameters. Here is a concrete example:

CREATE TABLE KafkaTable (
`user` STRING,
 `url` STRING,
 `ts` TIMESTAMP(3) METADATA FROM 'timestamp'
) WITH (
 'connector' = 'kafka',
 'topic' = 'events',
 'properties.bootstrap.servers' = 'localhost:9092',
 'properties.group.id' = 'testGroup',
 'scan.startup.mode' = 'earliest-offset',
 'format' = 'csv'
)

This defines the topic (topic), Kafka server, consumer group ID, consumer start mode and table format corresponding to the Kafka connector. It should be noted that there is a ts in the field of KafkaTable, and METADATA FROM is used in its declaration, which means a "metadata column" (metadata column), which is the metadata "timestamp" of the Kafka connector Generated. The timestamp here is actually the timestamp that comes with the data in Kafka. We extract it directly as metadata and convert it into a new field ts.

11.9.1.3、Upsert Kafka

Under normal circumstances, Kafka is a message queue that maintains the order of data. Both reading and writing should be streaming data, corresponding to the append-only mode in the table. If we want to write the result table with an update operation (such as grouping and aggregation) to Kafka, it will cause an exception because Kafka cannot recognize the retract (retract) or update (upsert) message.

In order to solve this problem, Flink specially adds an "Upsert Kafka" (Upsert Kafka) connector. This connector supports reading and writing data to Kafka's topic in the form of update insert (UPSERT). Specifically, the Upsert Kafka connector handles change log (changlog) streams. If it is used as a TableSource, the connector will interpret the read data (key, value) in the topic as an update (UPDATE) of the data value of the current key, that is, to find a row of data corresponding to the key in the dynamic table, and convert the value Update to the latest value; because it is an Upsert operation, if there is no row corresponding to the key, the INSERT operation will also be performed. In addition, if the value is empty (null), the connector interprets this piece of data as a DELETE operation on the row corresponding to the key.

If used as a TableSink, the Upsert Kafka connector will convert the result table of the update operation into a changelog stream. If you encounter data that is inserted (INSERT) or updated (UPDATE_AFTER), it corresponds to an add (add) message, then it is directly written to the Kafka topic normally; if it is deleted (DELETE) or data before the update, it corresponds to a Withdraw (retract) the message, then write the data whose value is empty (null) to Kafka. Since Flink partitions the data according to the value of the key (key), it can ensure that the update and delete messages on the same key will fall into the same partition.
Here is an example of creating and using an Upsert Kafka table:

CREATE TABLE pageviews_per_region (
 user_region STRING,
 pv BIGINT,
 uv BIGINT,
 PRIMARY KEY (user_region) NOT ENFORCED
) WITH (
 'connector' = 'upsert-kafka',
 'topic' = 'pageviews_per_region',
 'properties.bootstrap.servers' = '...',
 'key.format' = 'avro',
 'value.format' = 'avro'
);
CREATE TABLE pageviews (
 user_id BIGINT,
 page_id BIGINT,
 viewtime TIMESTAMP,
 user_region STRING,
 WATERMARK FOR viewtime AS viewtime - INTERVAL '2' SECOND
) WITH (
 'connector' = 'kafka',
 'topic' = 'pageviews',
 'properties.bootstrap.servers' = '...',
 'format' = 'json'
);
-- 计算 pv、uv 并插入到 upsert-kafka 表中
INSERT INTO pageviews_per_region
SELECT
 user_region,
 COUNT(*),
 COUNT(DISTINCT user_id)
FROM pageviews
GROUP BY user_region;

Here we read data from the Kafka table pageviews, and count the PV (all views) and UV (deduplication of users) of each area. This is an update query for grouping and aggregation, and the resulting table will continuously update the data .

In order to write the result table to Kafka's pageviews_per_region topic, we define an Upsert Kafka table, which needs to use PRIMARY KEY to specify the primary key in its field, and specify the serialization format of key and value in the WITH clause.

11.9.2, file system

Another very common type of external system is the file system (File System). Flink provides a file system connector that supports reading and writing data from local or distributed file systems. This connector is built into Flink, so no additional dependencies are required to use it.
Here is an example of connecting to a filesystem:

CREATE TABLE MyTable (
 column_name1 INT,
 column_name2 STRING,
 ...
 part_name1 INT,
 part_name2 STRING
) PARTITIONED BY (part_name1, part_name2) WITH (
 'connector' = 'filesystem', -- 连接器类型
 'path' = '...', -- 文件路径
 'format' = '...' -- 文件格式
)

Here PARTITIONED BY is used before WITH to partition the data. File system connectors support access to partition files.

11.9.3、JDBC

The relational data table itself is where SQL is initially applied, so we also hope to be able to read and write table data directly to the relational database. The JDBC connector provided by Flink can read and write data to any relational database through the JDBC driver (driver), such as MySQL, PostgreSQL, Derby, etc.
When writing data to the database as a TableSink, the mode of operation depends on whether the DDL that creates the table defines a primary key (primary key). If there is a primary key, the JDBC connector will run in Upsert mode, and can send UPDATE and DELETE operations to the external database according to the specified key (key); if there is no primary key defined, then it will Runs in Append mode, does not support update and delete operations.

11.9.3.1. Introducing dependencies

To use the JDBC connector in a Flink program, the following dependencies need to be introduced:

<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flink-connector-jdbc_${
    
    scala.binary.version}</artifactId>
 <version>${
    
    flink.version}</version>
</dependency>

In addition, in order to connect to a specific database, we also import related driver dependencies, such as MySQL:

<dependency>
 <groupId>mysql</groupId>
 <artifactId>mysql-connector-java</artifactId>
 <version>5.1.38</version>
</dependency>

The driver version introduced here is 5.1.38, readers can choose according to their own MySQL version.

11.9.3.2. Create JDBC table

The method of creating a JDBC table is similar to the previous Upsert Kafka. Here is a concrete example:

-- 创建一张连接到 MySQL 的 表
CREATE TABLE MyTable (
 id BIGINT,
 name STRING,
 age INT,
 status BOOLEAN,
 PRIMARY KEY (id) NOT ENFORCED
) WITH (
 'connector' = 'jdbc',
 'url' = 'jdbc:mysql://localhost:3306/mydatabase',
 'table-name' = 'users'
);
-- 将另一张表 T 的数据写入到 MyTable 表中
INSERT INTO MyTable
SELECT id, name, age, status FROM T;

Here, the primary key is defined in the DDL of the table creation, so the data will be written to the MySQL table in Upsert mode; and the connection to MySQL is defined through the url in the WITH clause. It should be noted that the real table name written in MySQL is users, and MyTable is a table registered in the Flink table environment.

11.9.4、Elasticsearch

As a distributed search and analysis engine, Elasticsearch has many scenarios in big data applications. The Elasticsearch SQL connector provided by Flink can only be used as a TableSink, which can write table data into the Elasticsearch index (index). The use of the Elasticsearch connector is very similar to that of the JDBC connector. The mode of writing data is also determined by whether there is a primary key definition in the DDL for creating the table.

11.9.4.1. Introducing dependencies

To use the Elasticsearch connector in a Flink program, you need to introduce the corresponding dependencies. The specific dependencies are related to the version of the Elasticsearch server. For version 6.x, the dependencies are introduced as follows:

<dependency>
 <groupId>org.apache.flink</groupId> 
<artifactId>flink-connector-elasticsearch6_${
    
    scala.binary.version}</artifactId>
<version>${
    
    flink.version}</version>
</dependency>

For versions above Elasticsearch 7, the dependencies introduced are:

<dependency>
 <groupId>org.apache.flink</groupId> 
<artifactId>flink-connector-elasticsearch7_${
    
    scala.binary.version}</artifactId>
<version>${
    
    flink.version}</version>
</dependency>

11.9.4.2. Create a table connected to Elasticsearch

The method of creating an Elasticsearch table is basically the same as that of a JDBC table. Here is a concrete example:

-- 创建一张连接到 Elasticsearch 的 表
CREATE TABLE MyTable (
 user_id STRING,
 user_name STRING
 uv BIGINT,
 pv BIGINT,
 PRIMARY KEY (user_id) NOT ENFORCED
) WITH (
 'connector' = 'elasticsearch-7',
 'hosts' = 'http://localhost:9200',
 'index' = 'users'
);

The primary key is defined here, so data will be written to Elasticsearch in Upsert mode.

11.9.5、HBase

As a high-performance, scalable distributed column storage database, HBase is a very important tool in big data analysis. The HBase connector provided by Flink supports read and write operations for HBase clusters.

In the stream processing scenario, when the connector writes data to HBase as a TableSink, it always adopts the update-insert (Upsert) mode. In other words, HBase requires that the connector must pass the defined primary key (primary key) to send the update log changelog). So in the DDL of creating a table, we must define the row key (rowkey) field and declare it as the primary key; if the primary key is not declared with the PRIMARY KEY clause, the connector will default to rowkey as the primary key.

11.9.5.1. Introducing dependencies

If you want to use the HBase connector in the Flink program, you need to introduce the corresponding dependencies. At present, Flink only provides connector support for HBase 1.4.x and 2.2.x versions, and the introduced dependencies should also be related to specific HBase versions. For version 1.4, the dependencies introduced are as follows:

<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flink-connector-hbase-1.4_${
    
    scala.binary.version}</artifactId>
 <version>${
    
    flink.version}</version>
</dependency>

For HBase 2.2 version, the introduced dependencies are:

<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flink-connector-hbase-2.2_${
    
    scala.binary.version}</artifactId>
 <version>${
    
    flink.version}</version>
</dependency>

11.9.5.2. Create a table connected to HBase

Since HBase is not a relational database, it will be a little troublesome to convert it to a table in Flink SQL. In the HBase table created by DDL, all column families (column family) must be declared as ROW type, occupying a field in the table; and the column (column qualifier) ​​in each family corresponds to the nesting in ROW field. We don't need to declare all the family and qualifier in HBase in the Flink SQL table, just declare those used in the query. In addition to all ROW type fields (corresponding to the family in HBase), there should be an atomic type field in the table, which will be recognized as the rowkey of HBase. In the table, this field can be named arbitrarily, and it does not have to be called rowkey.

Here is a concrete example:

-- 创建一张连接到 HBase 的 表
CREATE TABLE MyTable (
rowkey INT,
family1 ROW<q1 INT>,
family2 ROW<q2 STRING, q3 BIGINT>,
family3 ROW<q4 DOUBLE, q5 BOOLEAN, q6 STRING>,
PRIMARY KEY (rowkey) NOT ENFORCED
) WITH (
'connector' = 'hbase-1.4',
'table-name' = 'mytable',
'zookeeper.quorum' = 'localhost:2181'
);

-- 假设表 T 的字段结构是 [rowkey, f1q1, f2q2, f2q3, f3q4, f3q5, f3q6]
INSERT INTO MyTable
SELECT rowkey, ROW(f1q1), ROW(f2q2, f2q3), ROW(f3q4, f3q5, f3q6) FROM T;

We extract the data from another T, and use the ROW() function to construct the corresponding column family, and finally write it into the table named mytable in HBase.

Guess you like

Origin blog.csdn.net/prefect_start/article/details/129570461