UDF? UDAF? UDTF? You will understand after reading it!

I. Introduction

1.1 Introduction

Hive comes with some functions, such as max/min, etc., but for some special services, you may not be able to use the built-in functions well. You can easily extend it through custom UDF. When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined functions (UDF: user-defined functions) at this time.

According to the user-defined function category, it is divided into the following three types:

(1)UDF(User-Defined-Function)
一进一出
(2)UDAF(User-Defined Aggregation Function)
多进一出,聚合函数,类似于:count/max/min
(3)UDTF(User-Defined Table-Generating Functions)
一进多出,表生成函数,如 lateral view explore()

1.2. Preparation

Create a maven project, add dependencies

  <dependencies>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.0-cdh5.14.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>1.1.0-cdh5.14.2</version>
    </dependency>
  </dependencies>

Add packaging plugin:

<build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
      <plugins>
        <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.1.0</version>
        </plugin>
        <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.8.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.22.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-install-plugin</artifactId>
          <version>2.5.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-deploy-plugin</artifactId>
          <version>2.8.2</version>
        </plugin>
        <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
        <plugin>
          <artifactId>maven-site-plugin</artifactId>
          <version>3.7.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-project-info-reports-plugin</artifactId>
          <version>3.0.0</version>
        </plugin>
      </plugins>
    </pluginManagement>
</build>

Two, classification

2.1 、 UDF

2.1.1, programming steps

UDF Must have a return type, Can return null, but the return type cannot be void

  1. Inherit org.apache.hadoop.hive.ql.UDF
  2. Need to implement evaluate function; evaluate function supports overloading
  3. Create function in hive command line window
    a) add jar
    add jar linux_jar_path;
    b) create function
    create [temporary] function [dbname.] function_name AS class_name;
  4. Call functions

2.1.2, case

Write a function here that can convert the input string into lowercase letters
(1) to create a class

package cn.kgc.hive.func;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class MyUDF extends UDF {
    
    
    //重写evaluate方法
    public Text evaluate(Text s){
    
    
        if (s==null){
    
    
            return null;
        }
        //把字符串s转换成小写字母
        return new Text(s.toString().toLowerCase());
    }
}

(2) Type into a jar package
Insert picture description here
(3) Rename and put it in the local Linux directory
Insert picture description here
Insert picture description here
(4) Add the jar package to the classpath of hive

0: jdbc:hive2://single:10000> add jar /root/jar/low.jar;
No rows affected (0.029 seconds)

(5) Create a temporary function to associate with the developed java class

0: jdbc:hive2://single:10000> create temporary function myLow as "cn.kgc.hive.func.MyUDF";
INFO  : Compiling command(queryId=root_20201215173535_6269f4db-af00-4053-9c3a-b0e27c03157a): create temporary function myLow as "cn.kgc.hive.func.MyUDF"
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20201215173535_6269f4db-af00-4053-9c3a-b0e27c03157a); Time taken: 0.001 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20201215173535_6269f4db-af00-4053-9c3a-b0e27c03157a): create temporary function myLow as "cn.kgc.hive.func.MyUDF"
INFO  : Starting task [Stage-0:FUNC] in serial mode
INFO  : Completed executing command(queryId=root_20201215173535_6269f4db-af00-4053-9c3a-b0e27c03157a); Time taken: 0.001 seconds
INFO  : OK
No rows affected (0.015 seconds)

(6) Use custom functions
Insert picture description here


2.2 、 UDAF


2.3、UDTF


Three, expansion

3.1, add jar package method

(1) This method must be re-joined every time Hive is started, and it will become invalid after exiting hive, and it will be executed under hive

add jar /home/hadoop/DefTextInputFormat.jar;
Added [/home/hdfs/DefTextInputFormat.jar] to class path
Added resources: [/home/hdfs/DefTextInputFormat.jar]

(2) hive-site.xml file configuration hive.aux.jars.path

This method does not need to start Hive every time to execute the command to join, it needs a configuration file

<property>
<name>hive.aux.jars.path</name>
<value>file:///home/hadoop/DefTextInputFormat.jar,file:///jarpath/test.jar</value>
</property>

(3) Create a folder

Create a folder auxlib under the root directory ${HIVE_HOME}, and then put the custom jar file into this folder.
This method is convenient and fast, but it cannot be executed for the client operating environment


3.2, custom temporary/permanent functions

(1) Temporary function-only valid for the current session (black window)

Once the HIVE client is exited, the custom temporary function will be deleted.

  1. Add jar package
    hive> add jar /root/jar/low.jar;
  2. Create a custom temporary function
    hive> create temporary function myLow as “cn.kgc.hive.func.MyUDF”;
  3. Use custom temporary function
    hive> select myLow("aBcDeF");
  4. Here we switch the database and it can still be used normally
  5. Delete custom temporary function
    hive> drop temporary function myLow;

(2) Permanent function

  1. Upload to HDFS first
[root@single jar]# hdfs dfs -put low.jar /UDF
  1. Create permanent function
create function myLow as 'cn.kgc.hive.func.MyUDF' using jar 'hdfs:///UDF/low.jar'
  1. Unlike custom temporary functions, permanent functions can only be used in the current database
  2. test
    Insert picture description here
  3. Switch database test
0: jdbc:hive2://single:10000> use demo;
0: jdbc:hive2://single:10000> select myLow('aB');
Error: Error while compiling statement: FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'myLow' (state=42000,code=10011)
  1. Delete custom permanent function
0: jdbc:hive2://single:10000> drop function myLow;

Guess you like

Origin blog.csdn.net/weixin_48482704/article/details/111190847