Hive user-defined functions
I. Introduction
1.1 Introduction
Hive comes with some functions, such as max/min, etc., but for some special services, you may not be able to use the built-in functions well. You can easily extend it through custom UDF. When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined functions (UDF: user-defined functions) at this time.
According to the user-defined function category, it is divided into the following three types:
(1)UDF(User-Defined-Function)
一进一出
(2)UDAF(User-Defined Aggregation Function)
多进一出,聚合函数,类似于:count/max/min
(3)UDTF(User-Defined Table-Generating Functions)
一进多出,表生成函数,如 lateral view explore()
1.2. Preparation
Create a maven project, add dependencies
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0-cdh5.14.2</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.1.0-cdh5.14.2</version>
</dependency>
</dependencies>
Add packaging plugin:
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
</build>
Two, classification
2.1 、 UDF
2.1.1, programming steps
UDF Must have a return type, Can return null, but the return type cannot be void
- Inherit org.apache.hadoop.hive.ql.UDF
- Need to implement evaluate function; evaluate function supports overloading
- Create function in hive command line window
a) add jar
add jar linux_jar_path;
b) create function
create [temporary] function [dbname.] function_name AS class_name; - Call functions
2.1.2, case
Write a function here that can convert the input string into lowercase letters
(1) to create a class
package cn.kgc.hive.func;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class MyUDF extends UDF {
//重写evaluate方法
public Text evaluate(Text s){
if (s==null){
return null;
}
//把字符串s转换成小写字母
return new Text(s.toString().toLowerCase());
}
}
(2) Type into a jar package
(3) Rename and put it in the local Linux directory
(4) Add the jar package to the classpath of hive
0: jdbc:hive2://single:10000> add jar /root/jar/low.jar;
No rows affected (0.029 seconds)
(5) Create a temporary function to associate with the developed java class
0: jdbc:hive2://single:10000> create temporary function myLow as "cn.kgc.hive.func.MyUDF";
INFO : Compiling command(queryId=root_20201215173535_6269f4db-af00-4053-9c3a-b0e27c03157a): create temporary function myLow as "cn.kgc.hive.func.MyUDF"
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=root_20201215173535_6269f4db-af00-4053-9c3a-b0e27c03157a); Time taken: 0.001 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20201215173535_6269f4db-af00-4053-9c3a-b0e27c03157a): create temporary function myLow as "cn.kgc.hive.func.MyUDF"
INFO : Starting task [Stage-0:FUNC] in serial mode
INFO : Completed executing command(queryId=root_20201215173535_6269f4db-af00-4053-9c3a-b0e27c03157a); Time taken: 0.001 seconds
INFO : OK
No rows affected (0.015 seconds)
(6) Use custom functions
2.2 、 UDAF
2.3、UDTF
Three, expansion
3.1, add jar package method
(1) This method must be re-joined every time Hive is started, and it will become invalid after exiting hive, and it will be executed under hive
add jar /home/hadoop/DefTextInputFormat.jar;
Added [/home/hdfs/DefTextInputFormat.jar] to class path
Added resources: [/home/hdfs/DefTextInputFormat.jar]
(2) hive-site.xml file configuration hive.aux.jars.path
This method does not need to start Hive every time to execute the command to join, it needs a configuration file
<property>
<name>hive.aux.jars.path</name>
<value>file:///home/hadoop/DefTextInputFormat.jar,file:///jarpath/test.jar</value>
</property>
(3) Create a folder
Create a folder auxlib under the root directory ${HIVE_HOME}, and then put the custom jar file into this folder.
This method is convenient and fast, but it cannot be executed for the client operating environment
3.2, custom temporary/permanent functions
(1) Temporary function-only valid for the current session (black window)
Once the HIVE client is exited, the custom temporary function will be deleted.
- Add jar package
hive> add jar /root/jar/low.jar; - Create a custom temporary function
hive> create temporary function myLow as “cn.kgc.hive.func.MyUDF”; - Use custom temporary function
hive> select myLow("aBcDeF"); - Here we switch the database and it can still be used normally
- Delete custom temporary function
hive> drop temporary function myLow;
(2) Permanent function
- Upload to HDFS first
[root@single jar]# hdfs dfs -put low.jar /UDF
- Create permanent function
create function myLow as 'cn.kgc.hive.func.MyUDF' using jar 'hdfs:///UDF/low.jar'
- Unlike custom temporary functions, permanent functions can only be used in the current database
- test
- Switch database test
0: jdbc:hive2://single:10000> use demo;
0: jdbc:hive2://single:10000> select myLow('aB');
Error: Error while compiling statement: FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'myLow' (state=42000,code=10011)
- Delete custom permanent function
0: jdbc:hive2://single:10000> drop function myLow;