Hive函数08_自定义函数（UDF、UDAF、UDTF）

一、系统内置函数

几个常用命令

--（1）查看系统 内置 函数
hive>show functions;
--（2）显示内置函数用法
hive>desc function upper;
--（3）详细显示内置函数用法
hive>desc function extended upper;

二、自定义函数

1.简介

Hive 自带了一些函数，比如： max/min 等，但是数量有限，自己可以通过自定义 UDF
来方便的扩展。当 Hive 提供的内置函数无法满足你的业务处理需要时，此时就可以考虑使用用户自定义
函数（ UDF：user defined function ）。官方文档：
https://cwiki.apache.org/confluence/display/Hive/HivePlugins

注意：在SQL语句中使用自定义函数时，可能提示内存不足。这是由于计算的数据量过大并且存在倾斜，导致任务超出默认分配的内存。这种情况下，您可以通过set odps.sql.udf.joiner.jvm.memory=xxxx;set odps.sql.udf.joiner.jvm.memory=xxxx;语句手动设置内存。

2.分类

根据用户自定义函数类别分为以下三种：

（1）UDF：User Defined Function

一进一出，如如：upper、substr函数

（2） UDAF：User Defined Aggregation Function

聚集函数，多进一出
类似于： count/max/min

（3） UDTF:User Defined Table Generating Functions

一进多出
如 lateral view 与explore()

3.步骤

（1）继承 org.apache.hadoop.hive.ql.UDF
（2）需要实现 evaluate 函数； evaluate 函数支持重载；
（3）在 hive 的命令行窗口创建函数

a）添加jar
add jar linux_jar_path
b）创建function，
create [temporary]function [dbname.]function_name AS class_name;

（4）在hive 的命令行窗口删除函数

Drop [temporary] function [if exists] [dbname.]function_name;

注意：UDF 必须要有返回类型，可以返回null，但是返回类型不能为void

三、使用案例

案例1

（1）、创建一个Maven 工程Hive
（2）、导入依赖

<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.3</version>
</dependency>

（3）、创建一个类

package com.atguigu.hive;
import org.apache.hadoop.hive.ql.exec.UDF;
public class Lower extends UDF {
public String evaluate (final String s) {
if (s == null) {
return null;
}
return s.toString().toLowerCase();
}
}

（4）、打成jar 包上传到服务器/opt/module/jars/udf.jar
打包之前大家一定要注意你这个工程所用的jdk是什么版本的，是否与Hadoop集群使用的jdk是一个版本，如果版本不一致的话可能会导致问题。因此为了安全起见，大家最好使用一样的版本。

注意：如果jar包是上传到$HIVE_HOME/lib/目录以下，就不需要执行下面的add命令了
（5）、将jar 包添加到hive 的classpath

hive (default)> add jar /opt/module/datas/udf.jar;

（6）、创建临时函数与开发好的java class 关联

hive (default)> create temporary function udf_lower as "com.atguigu.hive.Lower";

（7）、即可在hql 中使用自定义的函数strip

hive (default)> select ename, udf_lower(ename) lowername from emp;

使用案例2

第一步：创建Hive工程
创建maven项目，依赖为

<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.3</version>
</dependency>

第二步：写UDF代码

package UDF;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class LowerUDF extends UDF{
    /**
     * 1. Implement one or more methods named "evaluate" which will be called by Hive.
     *
     * 2. "evaluate" should never be a void method. However it can return "null" if needed.
     */
    public Text evaluate(Text str){
        // input parameter validate
        if(null == str){
            return null ;
        }

        // validate
        if(StringUtils.isBlank(str.toString())){
            return null ;
        }

        // lower
        return new Text(str.toString().toLowerCase()) ;
    }

    public static void main(String[] args) {
        System.out.println(new LowerUDF().evaluate(new Text("BBB")));
    }

}

第三步：打包
打包之前大家一定要注意你这个工程所用的jdk是什么版本的，是否与Hadoop集群使用的jdk是一个版本，如果版本不一致的话可能会导致问题。因此为了安全起见，大家最好使用一样的版本。打包步骤如下
在这里插入图片描述
找到maven打包的位置

第四步：注册UDF

首先我们需要把第三步生成的jar包上传到服务器，并上传至hdfs

hadoop fs -put Hive_UDF_demo-1.0-SNAPSHOT.jar /user/hue/weisc/

hive> add jar hdfs://hadoop01:8020/user/hue/weisc/Hive_UDF_demo-1.0-SNAPSHOT.jar；
converting to local hdfs://hadoop01:8020/user/hue/weisc/Hive_UDF_demo-1.0-SNAPSHOT.jar
Added [/tmp/c1473e1f-6985-4699-b956-95962489759c_resources/Hive_UDF_demo-1.0-SNAPSHOT.jar] to class path
Added resources: [hdfs://hadoop01:8020/user/hue/weisc/Hive_UDF_demo-1.0-SNAPSHOT.jar]
hive> create temporary function lower_udf as 'UDF.LowerUDF';
OK
Time taken: 0.049 seconds

第五步：测试

hive> create table b (id int ,name string);
OK
Time taken: 0.29 seconds
hive> insert into b values(1,'WWWWAA');
Query ID = hue_20180814080214_8592e828-113f-4126-9959-06066d28b7d9
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1533807733727_0109)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 6.35 s     
--------------------------------------------------------------------------------
Loading data to table test1.b
Table test1.b stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
OK
Time taken: 21.383 seconds
hive> select lower_udf(name) from b;
OK
wwwwaa
Time taken: 0.187 seconds, Fetched: 1 row(s)

清平の乐

发布了451 篇原创文章 · 获赞 121 · 访问量 20万+

私信关注