Flink UDF

This article will mainly speak three udf:

  • ScalarFunction

  • TableFunction

  • AggregateFunction

    User-defined function is a very important feature, because he greatly expanded the ability to express queries. This article describes the addition of these three udf outside, it will eventually introduce a redis udf case as an interactive data source.

Sign up user-defined functions

   In most scenarios, user-defined functions before use is necessary to be registered. For Scala's Table API, udf do not need to be registered.

   The call registerFunction TableEnvironment () method to achieve registration. Udf After successful registration, you will be inserted function catalog TableEnvironment, so table API and will be able to resolve the sql him.

1.Scalar Functions scalar function

   Scalar function refers to a function returns a value. Scalar functions are implemented in the 0,1, or a plurality of scalar values ​​is converted to a new value.

   A scalar function to achieve inheritance need ScalarFunction, and implement one or more evaluation methods. Behavior of scalar functions is through the evaluation methods to achieve. evaluation method must be defined as public, named eval. The method of evaluation of the input parameters and the type of the return value of type determines the type of input parameters and the scalar function return type. evaluation method may be overloaded to achieve a plurality of eval. Meanwhile evaluation method supports variable parameters such as: eval (String ... strs).

The following example is given a scalar function. Examples of the method implementation is a hashcode.

public class HashCode extends ScalarFunction { private int factor = 12; public HashCode(int factor) { this.factor = factor; } public int eval(String s) { return s.hashCode() * factor; } } BatchTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env); // register the function tableEnv.registerFunction("hashCode", new HashCode(10)); // use the function in Java Table API myTable.select("string, string.hashCode(), hashCode(string)"); // use the function in SQL API tableEnv.sqlQuery("SELECT string, HASHCODE(string) FROM MyTable"); 

   Return Value is the default extraction tool determination method by the evaluation flink type. For simple and basic type of POJOS is sufficient, but more complex types, custom types, a combination of the type of error. In this case, the return type of TypeInformation, need to manually specify, is overloaded ScalarFunction # getResultType ().

   The following give an example, the replication ScalarFunction # getResultType (), the return value of the long type Types.TIMESTAMP when translated into code generation.

public static class TimestampModifier extends ScalarFunction { public long eval(long t) { return t % 1000; } public TypeInformation<?> getResultType(signature: Class<?>[]) { return Types.TIMESTAMP; } } 

2.Table Functions Table Functions

   Scalar function input may be similar 0,1, or more parameters, but different from the number of output lines can be any number. The return line may also contain one or more columns.

   In order to customize the table function, inheritance need TableFunction, implement one or more evaluation methods. Table behavior of the function defined inside these evaluation methods, and have a function called eval is public. TableFunction can override plurality eval method. Input Parameter Type Evaluation method determines the input type of the function table. Evaluation method also supports variable parameters, such as: eval (String ... strs). Return Type TableFunction table depends on the type of base. Evaluation method collect (T) transmitting the output rows.

   In Table API, a table function used in the scala languages ​​as follows: .join (Expression) or .leftOuterJoin (Expression), a method using the java language as follows: .join (String) or .leftOuterJoin (String).

  • Each row Join operator uses the operator (operator table operation right) table function generated by all rows (cross) join the outer table (left operator operating table).
  • Case leftOuterJoin operation Operators use the table function (operation Operator right table) generated by all rows (cross) join the outer table (operation Operator table on the left) each row, and the table function returns a null table It retains all the outer rows.

Little difference in sql syntax:

  • Usage is a cross join LATERAL TABLE (<TableFunction>).
  • LEFT JOIN ON TRUE usage is added in the join condition.

The following example of how to use the table talking function values.

// The generic type "Tuple2<String, Integer>" determines the schema of the returned table as (String, Integer).

public class Split extends TableFunction<Tuple2<String, Integer>> { private String separator = " "; public Split(String separator) { this.separator = separator; } public void eval(String str) { for (String s : str.split(separator)) { // use collect(...) to emit a row collect(new Tuple2<String, Integer>(s, s.length())); } } } BatchTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env); Table myTable = ... // table schema: [a: String] // Register the function. tableEnv.registerFunction("split", new Split("#")); // Use the table function in the Java Table API. "as" specifies the field names of the table. myTable.join("split(a) as (word, length)").select("a, word, length"); myTable.leftOuterJoin("split(a) as (word, length)").select("a, word, length"); // Use the table function in SQL with LATERAL and TABLE keywords. // CROSS JOIN a table function (equivalent to "join" in Table API). tableEnv.sqlQuery("SELECT a, word, length FROM MyTable, LATERAL TABLE(split(a)) as T(word, length)"); // LEFT JOIN a table function (equivalent to "leftOuterJoin" in Table API). tableEnv.sqlQuery("SELECT a, word, length FROM MyTable LEFT JOIN LATERAL TABLE(split(a)) as T(word, length) ON TRUE"); 

   Note that the order of fields need not PROJO a certain type. It means you can not change the name of the field pojo table function returns using as.

   TableFunction default return value type is extracted by flink type tool determination. For simple and basic type of POJOS is sufficient, but more complex types, custom types, a combination of the type of error. In this case, the return type of TypeInformation, need to manually specify, is overloaded TableFunction # getResultType ().

The following example, we return type so that the table is RowTypeInfo (String, Integer) by replication TableFunction # getResultType () method.

public class CustomTypeSplit extends TableFunction<Row> { public void eval(String str) { for (String s : str.split(" ")) { Row row = new Row(2); row.setField(0, s); row.setField(1, s.length); collect(row); } } @Override public TypeInformation<Row> getResultType() { return Types.ROW(Types.STRING(), Types.INT()); } } 

3.Aggregation Functions aggregate function

   User-defined aggregate functions polymerization a table (one or more rows, one row includes one or more attribute) is a scalar value.
[Picture upload failed ... (image-f5e972-1542542047386)]
figure above is talking about a drink table this table there are five elements that field data, need to do now is to find the highest price of all drinks.

   Aggregate functions need to inherit AggregateFunction. Aggregation function works as follows:

  • First, the ACC with a need, that is to save intermediate results aggregate data structures. AggregateFunction call function createAccumulator () method to create an empty accumulator.

  • Subsequently, each input line will call accumulate () method to update the accumulator. Once all rows are processed, the getValue () method is called, and returns the final result is calculated.

For each AggregateFunction, than the following three methods are essential:

createAccumulator()

accumulate()

getValue()

   flink extraction mechanism can not recognize the type of complex data types, such as data type or base type is not simple pojos type. Therefore, similar to ScalarFunction and TableFunction, AggregateFunction provided a method to specify the type of result returned TypeInformation, using a AggregateFunction # getResultType (). Accumulator type using a AggregateFunction # getAccumulatorType ().

   In addition to the above method, there are some alternative methods. Some of these methods is to make the system more efficient query execution, while some in a particular scene is a must. For example, merge () is set to be in a session window (session group window) context. When a row of data is regarded as the time window associated with the two answer, accumulators two sessions of the window needs to be join.

AggregateFunction following several methods, depending on usage scenarios need to be implemented:

  • retract (): it is the need to implement the polymerization process window bounded OVER.
  • merge (): In many polymerization and batch polymerization conversation window is required.
  • resetAccumulator (): In most batch polymerization it is a must.

AggregateFunction all methods are to be declared as public, rather than static. Defined aggregate function requires simultaneously achieve org.apache.flink.table.functions.AggregateFunction need to implement one or more methods accumulate. This method can be overloaded for different data types, and support for variable argument.

   In order to calculate the weighted average, a weighted sum, and accumulators need to store accumulated count of all the data. In the definition of a class WeightedAvgAccum chestnut as accumulator. Although, retract (), merge (), and resetAccumulator () method is not required in many types of polymerization, chestnut given here.

/**
* Accumulator for WeightedAvg.
*/
public static class WeightedAvgAccum { public long sum = 0; public int count = 0; } /** * Weighted Average user-defined aggregate function. */ public static class WeightedAvg extends AggregateFunction<Long, WeightedAvgAccum> { @Override public WeightedAvgAccum createAccumulator() { return new WeightedAvgAccum(); } @Override public Long getValue(WeightedAvgAccum acc) { if (acc.count == 0) { return null; } else { return acc.sum / acc.count; } } public void accumulate(WeightedAvgAccum acc, long iValue, int iWeight) { acc.sum += iValue * iWeight; acc.count += iWeight; } public void retract(WeightedAvgAccum acc, long iValue, int iWeight) { acc.sum -= iValue * iWeight; acc.count -= iWeight; } public void merge(WeightedAvgAccum acc, Iterable<WeightedAvgAccum> it) { Iterator<WeightedAvgAccum> iter = it.iterator(); while (iter.hasNext()) { WeightedAvgAccum a = iter.next(); acc.count += a.count; acc.sum += a.sum; } } public void resetAccumulator(WeightedAvgAccum acc) { acc.count = 0; acc.sum = 0L; } } // register function StreamTableEnvironment tEnv = ... tEnv.registerFunction("wAvg", new WeightedAvg()); // use function tEnv.sqlQuery("SELECT user, wAvg(points, level) AS avgPoints FROM userScores GROUP BY user"); 

4.udf best practices

4.1 Table API和SQL

   Internal code generator will try to use as much of the original value. User-defined functions may be created through the object, cast (casting) and entry boxes ((un) boxing) introduced a lot of overhead. Therefore, it is strongly recommended parameters and return types defined as the value of their native type rather than the type of packaging (boxing class). Types.DATE Types.TIME and may be replaced with int. Types.TIMESTAMP can be replaced with long.

   We recommend the use of user-defined functions written in java instead of writing scala, scala because there may be a type that is incompatible flink type extractor.

4.2 Runtime integrated with UDFs

   Sometimes udf global runtime needs to obtain information or making settings and do some cleanup before the actual work, for example, open and close the database links database links. Providing Udf open (), and close () method, it can be overwritten, and functionally similar RichFunction Dataset method of DataStream API.

   Open () method is called once before the evaluation method call. Close () is called after the last call evaluation method. Open () method to a total of a FunctionContext, FunctionContext context udf comprising an execution environment, for example, metric group, a distributed cache file, the global job parameters.

   FunctionContext through the relevant method calls, you can get to the relevant information:

  • getMetricGroup () index group of parallel subtasks;
  • Local copy getCachedFile (name) distributed cache file;
  • getJobParameter (name, defaultValue) given global key job parameters;

   The example given is to obtain the parameters in the global job by a scalar function FunctionContext. It is to achieve acquisition redis configuration, and then resume redis links, implementation of interactive redis.

import org.apache.flink.table.functions.FunctionContext;
import org.apache.flink.table.functions.ScalarFunction;
import redis.clients.jedis.Jedis;
public class HashCode extends ScalarFunction { private int factor = 12; Jedis jedis = null; public HashCode() { super(); } @Override public void open(FunctionContext context) throws Exception { super.open(context); String redisHost = context.getJobParameter("redis.host","localhost"); int redisPort = Integer.valueOf(context.getJobParameter("redis.port","6379")); jedis = new Jedis(redisHost,redisPort); } @Override public void close() throws Exception { super.close(); jedis.close(); } public HashCode(int factor) { this.factor = factor; } public int eval(int s) { s = s % 3; if(s == 2) return Integer.valueOf(jedis.get(String.valueOf(s))); else return 0; } } ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); BatchTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env); // set job parameter Map<String,String> hashmap = new HashMap<>(); hashmap.put("redis.host","localhost"); hashmap.put("redis.port","6379"); ParameterTool parameter = ParameterTool.fromMap(hashmap); exeEnv.getConfig().setGlobalJobParameters(parameter); // register the function tableEnv.registerFunction("hashCode", new HashCode()); // use the function in Java Table API myTable.select("string, string.hashCode(), hashCode(string)"); // use the function in SQL tableEnv.sqlQuery("SELECT string, HASHCODE(string) FROM MyTable");



Author: Albert _Yume
link: https: //www.jianshu.com/p/5dc2cab91c78

Guess you like

Origin www.cnblogs.com/leon0/p/11122205.html