Getting Started with Hive UDF <User-Defined Function>

1. Introduction

Hive has three types of UDFs: (plain) UDFs , user-defined aggregate functions ( UDAFs ), and user-defined table-generating functions ( UDTFs ).

  • UDF: The operation acts on a single data row and produces a data row as output. Most functions, such as math and string functions, fall into this category.
  • UDAF: Accepts multiple input data rows and produces one output data row. For example, the COUNT and MAX functions.
  • UDTF: Acts on a single data row and generates multiple data rows (ie, a table) as output. For example, alteral view and explode function.

Two, write UDF

Before developing Hive UDF, we need to introduce a jar package: hive-exec.

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>2.1.1</version>
    <scope>provided</scope>
 </dependency>

The default scope is compile, which means that the project needs this jar package in the classpath during compilation, testing and running. The specified scope here is provided, which means that the Jar package is only required in the compilation and testing phases. Because at runtime, hive already has the Jar package, so there is no need to repeat dependencies.

Next, we need to implement the UDF interface. Currently, Hive's UDF is mainly divided into two types of interfaces: UDF and GenericUDF.

  • UDF: It is a relatively simple class, and the inherited base class is org.apache.hadoop.hive.ql.exec.UDF
  • GenericUDF: Relatively complex, mainly for better control of type checking. The inherited base class is org.apache.hadoop.hive.ql.udf.generic.GenericUDF.

Next, let's look at a simple UDF implementation class: Strip

package com.scb.dss.udf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;

@Description(name = "strip",
        value = "_FUNC_(str) - Removes the leading and trailing space characters from str.")
public class Strip extends UDF {

    // 去除str开头以及结尾的空格
    public String evaluate(String str) {
        if (str == null) {
            return null;
        }
        return StringUtils.strip(str);
    }

    // 去除str开头以及结尾的stripChars中该指定字符集中的任意字符
    public String evaluate(String str, String stripChars) {
        if (str == null) {
            return null;
        }
        return StringUtils.strip(str, stripChars);
    }
}

The Strip class has two evaluate methods. The first method will remove the spaces at the beginning and end of str, while the second method will remove any characters in the specified character set in the stripChars at the beginning and end of str.

The @Description annotation indicates the documentation about the UDF function, which can be viewed later through the desc function <UDF> command. The annotation has three attributes: name, value, extended.

  • name: indicates the name of the function
  • value: describes the function of the function, where _FUNC_ is a macro definition, which will be replaced by the actual name of the function in desc
  • extended: Mainly used to write function usage examples

Other points to note:

  • UDF names are not case sensitive
  • Hive supports the use of Java's primitive data types (as well as types like java.util.map and java.util.list) in UDFs.
  • Hive also supports Hadoop basic data types, such as Text. It is recommended to use Hadoop basic data types, which can take advantage of object reuse and increase efficiency and save money.

Next, make a simple UT

package com.scb.dss.udf;

import org.junit.Assert;
import org.junit.Test;

import static org.junit.Assert.*;

public class StripTest {
    private Strip strip = new Strip();

    @Test
    public void evaluate() {
        System.out.println(strip.evaluate(" a b c "));
        Assert.assertEquals("a b c", strip.evaluate(" a b c "));
        System.out.println(strip.evaluate(" a b c a", "a"));
        Assert.assertEquals(" a b c ", strip.evaluate(" a b c a", "a"));
        System.out.println(strip.evaluate(" a b c a", "a "));
        Assert.assertEquals("b c", strip.evaluate(" a b c a", "a "));
    }
}

3. Deploy UDF

1. pack

mvn clean package

2. Upload the Jar package to HDFS

hdfs dfs -put hive-udf.jar /user/hive/

3. Connect to HIVE through beeline

beeline -u jdbc:hive2://host:10000/default -n username -p 'password'

4. Create function

create function strip as 'com.scb.dss.udf.UDFStrip' using jar 'hdfs:///user/hive/hive-udf.jar';

5. use

Use desc function <UDF> to view UDF description information.

select strip('0a0', '0'); // remove the first 0

4. Others

  1. Drop function: drop function <udf>
  2. Create a temporary function: create temporary function <udf> as <udf.class.path>

Guess you like

Origin blog.csdn.net/qq_37771475/article/details/121636984