Case analysis丨Spark Hive custom function application

Abstract: Spark currently supports three types of custom functions : UDF, UDTF, and UDAF.

1 Introduction

Spark currently supports three types of custom functions: UDF, UDTF, and UDAF. UDF usage scenario: input one line and return one result, one-to-one, such as defining a function, the function is to input an IP address and return a corresponding province. UDTF usage scenarios: input one line, return multiple lines (hive), one-to-many, but there is no UDTF in sparkSQL, flatMap in spark can achieve this function. UDAF: Enter multiple rows, return one row, aggregate (mainly used for aggregation functions, such as groupBy, count, sum), these are the aggregation functions that spark comes with, but they are complex and relatively complicated.

The bottom layer of Spark actually encapsulates a function in the CatalogFunction structure, where FunctionIdentifier describes basic information such as the function name, FunctionResource describes the file type (jar or file) and file path; Spark's SessionCatalog provides a series of interfaces for function registration, deletion, and acquisition. When Spark's Executor receives a function execution sql request, it finds the corresponding jar address and ClassName in the CatalogFunction through the cached CatalogFunction information, and the JVM dynamically loads the jar, and executes the corresponding function through ClassName reflection.

Figure 1. CatalogFunction structure

Figure 2. Register load function logic

Hive's HiveSessionCatalog is inherited from Spark's SessionCatalog, and it decorates the basic functions of Spark to adapt to the basic functions of Hive, including functional functions. HiveSimpleUDF corresponds to UDF, HiveGenericUDF corresponds to GenericUDF, HiveUDAFFunction corresponds to AbstractGenericUDAFResolve and UDAF, and HiveGenericUDTF corresponds to GenericUDTF

Figure 3. Hive decoration spark function logic

2. UDF

UDF is the most commonly used function. It is relatively simple to use. It is mainly divided into two types of UDF: simple data types, inheriting UDF interfaces; complex data types, such as Map, List, Struct and other data types, inheriting GenericUDF interfaces.

When a simple type implements UDF, you can customize several methods whose names evaluate to, and set the parameters and return types according to your needs. Because the UDF interface uses DefaultUDFMethodResolver to obtain the method by the method parser by default, the parser reflects the method metadata based on the user input parameters and the hard-coded name evaluate. Of course, users can also customize the parser analysis method.

Figure 4. Simple example of a custom UDF

Figure 5. Default UDF method parser

3. UDAF

UDAF is an aggregate function. There are currently three main ways to implement it: implement the UDAF interface, the older short answer implementation method, which has been abandoned; implement UserDefinedAggregateFunction, currently uses a more common way to implement interface aggregation by stages; implement AbstractGenericUDAFResolver, implement relative The UserDefinedAggregateFunction method is a little more complicated, and a calculator Evaluator (such as GenericUDAFEvaluator) needs to be implemented. The logic processing of UDAF mainly occurs in the Evaluator.

UserDefinedAggregateFunction defines the input and output data structure, realizes the initialization buffer (initialize), aggregates a single data (update), aggregates the buffer area (merge) and calculates the final result (evaluate).

Figure 6. Simple example of custom UDAF

4. UDTF

The simple and rude understanding of UDTF is an automatic function that generates multiple rows in one row, which can generate multiple rows and multiple columns, which is also called a table generation function. The current implementation method is to implement the GenericUDTF interface, implement two interfaces, initialize interface parameter verification, column definition, process interface accepts a row of data, and cut data.

Figure 7. Simple example of custom UDTF

 

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/108594535