Hive 自定义函数 UDF UDAF UDTF

一 什么是UDF

UDF是UserDefined Function 用户自定义函数的缩写。Hive中除了原生提供的一些函数之外,如果还不能满足我们当前需求,我们可以自定义函数。

除了UDF 之外,我们还可以定义聚合函数UDAF 和 Table-Generating函数

 

 

二 如何创建UDF函数

2.1编写JAVA类,需要继承UDF类或者GenericUDF

一般需要返回简单数据类型的,继承UDF就可以,然后实现evaluate方法;如果类型稍微复杂的可以使用GenericUDF,然后实现initializegetDisplayString evaluate方法。

publicclass UDFStripDoubleQuotes extends UDF{

 

       private staticfinal String DOUBLE_QUOTES ="\"";

 

       private staticfinal String BLANK_SYMBOL ="";

 

扫描二维码关注公众号,回复: 1816937 查看本文章

       public  Text evaluate (Text textthrows UDFArgumentException{

             if (null ==text || BLANK_SYMBOL.equals(text)) {

                    throw new UDFArgumentException("The function STRIP_DOUBLE_QUOTES(s) takes exactly 1arguments.");

              }

 

              Stringtemp = text.toString().trim();

             if (temp.startsWith(DOUBLE_QUOTES) ||temp.endsWith(DOUBLE_QUOTES)) {

                    temp = temp.replace(DOUBLE_QUOTES,BLANK_SYMBOL);

              }

             return new Text(temp);

       }

}

2.2编译这个java类并打成jar包

2.3在hive中添加jar包

hive(hadoop)> add jar /opt/data/UDFStripDoubleQuotes.jar;

Added/opt/data/UDFStripDoubleQuotes.jar to class path

Addedresource: /opt/data/UDFStripDoubleQuotes.jar

 

2.4创建临时函数和永久函数

2.4.1创建临时函数

语法:CREATETEMPORARY FUCNTION strip_double_quotes

AS' com.hive.udf.UDFStripDoubleQuotes';


举例:
create temporary function EncryptByMD5 as 'MD5.EncryptByMD5' using jar 'hdfs:///user/xx/hiveUDF/EncryptByMD5.jar'


2.4.2创建永久函数

CREATEFUNCTION [db_name.]function_name AS class_name

[USINGJAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];

说白了其实就是把jar放到HDFS上,然后指定这个函数是哪一个数据库的,然后跟一个URL,这一个url就是你jar包所放的那个HDFS目录

举个例子:

CREATE  FUNCTION  库名.函数名  AS  'com.hive.udf.LowerAndUpperUDF'   USING   JAR  'hdfs:/var/hive/udf/lowerOrUpper.jar';

举例:

永久(永久的必须要加hive库名)

create function xx.EncryptByMD5 as 'MD5.EncryptByMD5' using jar 'hdfs:///user/xx/hiveUDF/EncryptByMD5.jar' 

 (说明:如果函数名前不加库名.  ,那么所有库都能用,如果加了库名.   ,那么只有在该库名下时,可以用该函数,在其他库下使用时,会报invalid function错误)

2.5测试

准备测试数据:/opt/data/quotes.txt

"10"    "ACCOUNTING"    "NEW YORK"

"20"    "RESEARCH"      "DALLAS"

"30"    "SALES" 'CHICAGO'

"40"    "OPERATIONS"    'BOSTON'

 

Hive这边:

CREATETABLE t_dept LIKE dept;

LOADDATA LOCAL INPATH '/opt/data/quotes.txt' INTO TABLE t_dept;

 

SELECTstrip_double_quotes(dname) name, strip_double_quotes(loc) loc FROM t_dept;

运行结果:

name                   loc

ACCOUNTING  NEW YORK

RESEARCH        DALLAS

SALES                 'CHICAGO'

OPERATIONS    'BOSTON'

 

三 如何创建UDAF函数

继承AbstractGenericUDAFAverageEvaluator,并且继承Generic

UDAFEvaluator。GenericUDAFEvaluator就是根据job不同的阶段执行不同的方法。Hive通过GenericUDAFEvaluator.Model来确定job的执行阶段。

那有哪些阶段呢?

PARTIAL1:从原始数据到部分聚合,会调用iterate,terminatePartial方法 -->map的输入 到 输出

PARTIAL2:从部分数据聚合和部分数据聚合,会调用merge和terminatePartial--> map的输出 到reduce输入

FINAL: 从部分数据聚合到全部数据聚合,调用merge和 terminate方法 -->Reduce输入到输出

COMPLETE: 从原始数据到全部数据聚合,会调用iterate和 terminate方法;没有reduce阶段,只有map阶段

有几个注意点:

如果要聚合的数据量比较大,我们需要注意内存是否够,很容易出现内存溢出的问题;

尽可能重用对象,尽量避免new对象,尽量减轻JVM垃圾回收的过程。

 

publicclass UDAFAdd extendsAbstractGenericUDAFResolver{

 

       static final LogLOG = LogFactory.getLog(UDAFAdd.class.getName());

 

       @Override

       publicGenericUDAFEvaluator getEvaluator(TypeInfo[]argumentsthrows SemanticException {

             //check arguments length

             if (arguments.length != 1) {

                    throw new UDFArgumentException("Exactly one argument is expected.");

              }

             //check if arguments data type is primitive

             if (arguments[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {

                    throw new UDFArgumentException("Argument is not expected.");

              }

             

             switch(((PrimitiveTypeInfo)arguments[0]).getPrimitiveCategory()) {

                    case BYTE:

                    case SHORT:

                    case INT:

                    case LONG:

                           return new UDAFAddLong();

                    case FLOAT:

                    case DOUBLE:

                           return new UDAFAddDouble();

                    default:

                           throw new UDFArgumentException("Only numeric or string type argumentsare accepted but "

                                          +arguments[0].getTypeName() + " is passed.");

              }

       }

      

       public staticclass UDAFAddDouble extendsGenericUDAFEvaluator{

             privatePrimitiveObjectInspectorinputOI;

             privateDoubleWritable result;

             

             //invoke the INIT method on the each stage

             @Override

             publicObjectInspector init(Modemode, ObjectInspector[] argumentsthrows HiveException {

                    super.init(mode,arguments);

                    //INIT double value

                    result = new DoubleWritable(0);

                    inputOI =(DoubleObjectInspector)arguments[0];

                    returnPrimitiveObjectInspectorFactory.writableDoubleObjectInspector;

              }

 

             /**

               * it is used for storing aggregation resultduring the process of aggregation

               *@authornickyzhang

               *

               */

             static class AddDoubleAggextendsAbstractAggregationBuffer{

                    boolean empty;

                    double sum;

                    @Override

                    public int estimate() {

                           return JavaDataModel.PRIMITIVES1 + JavaDataModel.PRIMITIVES2;

                     }

              }

 

             /**Get a new aggregation object*/

             @Override

             public AggregationBuffergetNewAggregationBuffer()throws HiveException {

                     AddDoubleAggaddDoubleAgg = new AddDoubleAgg();

                     reset(addDoubleAgg);

                    return addDoubleAgg;

              }

 

             /** Reset the aggregation. This is useful if we want toreuse the same aggregation. */

             @Override

             public void reset(AggregationBufferaggthrows HiveException {

                     AddDoubleAggaddDoubleAgg = (AddDoubleAgg)agg;

                    addDoubleAgg.empty = Boolean.TRUE;

                    addDoubleAgg.sum = 0;

              }

 

             /** Iterate through original data.*/

             @Override

             public void iterate(AggregationBufferagg, Object[] arguments)throwsHiveException {

                    if (arguments.length != 1) {

                           throw new UDFArgumentException("Just one argument expected!");

                     }

                    this.merge(agg,arguments);

              }

 

             /** Get partial aggregation result.*/

             @Override

             public ObjectterminatePartial(AggregationBufferaggthrows HiveException {

                    returnterminate(agg);

              }

 

             /**Combiner or Reduce merge the mapper*/

             @Override

             public void merge(AggregationBufferagg, Object partial)throwsHiveException {

                    if (partial ==null) {

                           return;

                     }

                     AddDoubleAggaddDoubleAgg = (AddDoubleAgg)agg;

                    addDoubleAgg.empty =false;

                    addDoubleAgg.sum += PrimitiveObjectInspectorUtils.getDouble(partial,inputOI);

              }

             

             /** Get final aggregation result */

             @Override

             public Objectterminate(AggregationBufferaggthrows HiveException {

                     AddDoubleAggaddDoubleAgg = (AddDoubleAgg)agg;

                    if (addDoubleAgg.empty) {

                           return null;

                     }

                    result.set(addDoubleAgg.sum);

                    return result;

              }

       }

      

       public staticclass UDAFAddLong extendsGenericUDAFEvaluator{

             privatePrimitiveObjectInspectorinputOI;

             private LongWritableresult;

             

             //invoke the INIT method on the each stage

             @Override

             publicObjectInspector init(Modemode, ObjectInspector[] argumentsthrows HiveException {

                    super.init(mode,arguments);

                    //INIT double value

                    result = new LongWritable(0);

                    inputOI =(LongObjectInspector)arguments[0];

                    returnPrimitiveObjectInspectorFactory.writableLongObjectInspector;

              }

 

             /**

               * it is used for storing aggregation resultduring the process of aggregation

               *@authornickyzhang

               *

               */

             static class AddLongAggextends AbstractAggregationBuffer{

                    boolean empty;

                    long sum;

                    @Override

                    public int estimate() {

                           return JavaDataModel.PRIMITIVES1 + JavaDataModel.PRIMITIVES2;

                     }

              }

 

             /**Get a new aggregation object*/

             @Override

             public AggregationBuffergetNewAggregationBuffer()throws HiveException {

                     AddLongAggaddLongAgg = new AddLongAgg();

                     reset(addLongAgg);

                    return addLongAgg;

              }

 

             /** Reset the aggregation. This is useful if we want toreuse the same aggregation. */

             @Override

             public void reset(AggregationBufferaggthrows HiveException {

                     AddLongAggaddLongAgg = (AddLongAgg)agg;

                    addLongAgg.empty = Boolean.TRUE;

                    addLongAgg.sum = 0;

              }

 

             /** Iterate through original data.*/

             @Override

             public void iterate(AggregationBufferagg, Object[] arguments)throwsHiveException {

                    if (arguments.length != 1) {

                           throw new UDFArgumentException("Just one argument expected!");

                     }

                    this.merge(agg,arguments);

              }

 

             /** Get partial aggregation result.*/

             @Override

             public ObjectterminatePartial(AggregationBufferaggthrows HiveException {

                    returnterminate(agg);

              }

 

             /**Combiner or Reduce merge the mapper*/

             @Override

             public void merge(AggregationBufferagg, Object partial)throws HiveException{

                    if (partial ==null) {

                           return;

                     }

                     AddLongAggaddLongAgg = (AddLongAgg)agg;

                    addLongAgg.empty =false;

                    addLongAgg.sum += PrimitiveObjectInspectorUtils.getDouble(partial,inputOI);

              }

 

             /** Get final aggregation result */

             @Override

             public Objectterminate(AggregationBufferaggthrows HiveException {

                     AddLongAggaddLongAgg = (AddLongAgg)agg;

                    if (addLongAgg.empty) {

                           return null;

                     }

                    result.set(addLongAgg.sum);

                    return result;

              }

       }

}

 

四 如何创建UDTF函数

一般用于解析工作,比如说解析url,然后获取url信息,需要继承GenericUDTF.

publicclass UDTFEmail extendsGenericUDTF{

 

       @Override

       publicStructObjectInspector initialize(StructObjectInspectorinspectorthrows UDFArgumentException {

             if (inspector ==null) {

                    throw new UDFArgumentException("arguments is null");

              }

             List args = inspector.getAllStructFieldRefs();

             if(CollectionUtils.isEmpty(args) ||args.size()!= 1){

                    throw new UDFArgumentException("UDF tables only one argument");

              }

              List<String>fields = new ArrayList<String>();

             fields.add("name");

             fields.add("email");

              List<ObjectInspector>fieldIOList = new ArrayList<ObjectInspector>();

             fieldIOList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

             fieldIOList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

             returnObjectInspectorFactory.getStandardStructObjectInspector(fields,fieldIOList);

       }

 

       @Override

       public void process(Object[]argsthrows HiveException {

             if(ArrayUtils.isEmpty(args) ||args.length != 1) {

                    return;

              }

              Stringname = args[0].toString();

              Stringemail = name +"@163.com";

             super.forward(new String[] {name,email});

       }

 

       @Override

       public void close()throws HiveException {

             super.forward(new String[] {"complete","finish"});

       }

}

 

五UDF UDAF UDTF 区别

UDF:一进一出

UDAF:多进一出,一般聚合用

UDTF:一进多出

猜你喜欢

转载自blog.csdn.net/gyxinguan/article/details/79270824