HIVE UDAF development get started, you understand at a glance!

Stand-alone run a script to do data processing, but the input data is too great, the process of memory-intensive systems are often killed, so consider on the hive to do data aggregation. UDAF take this opportunity to study how to write down, write out the experience of riding the pit, hoping to help you avoid detours! Ok. . . On Jiang Zi.

Often listen to UDF, then UDAF What the hell? UDF is a polymerizable function such as friends ~ built hive count, sum, max, min, avg and the like. But the built-in functions actually can not meet the needs of our complex statistics, we need to own a method to achieve.

There are two ways to achieve a simple, a common, simple method is said to have a performance problem, we just look at it - a common implementation

Achieve a Generic UDAF has two parts:

  1. solve
  2. evaluator

Maybe goods corresponding two abstract classes:

import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;

The main parameters used for checking resolver and operator overloading, we can select the appropriate evaluator depending on the input parameters

the evaluator is implemented where the main logic, in the form of a static inner class

#! The Java
 public  class GenericUDAFHistogramNumeric the extends AbstractGenericUDAFResolver {
   static  Final the Log LogFactory.getLog the LOG = (GenericUDAFHistogramNumeric. Class .getName ()); 
 
  @Override 
  public GenericUDAFEvaluator getEvaluator (GenericUDAFParameterInfo info) throws SemanticException {
     // parameter check 
 
    return  new new GenericUDAFHistogramNumericEvaluator (); 
  } 
  / ** 
   * this static inner classes is to write a logical place to ourselves, to change the class name if necessary, this is a bar chart example of official documents written 
   * / 
  public  static  class GenericUDAFHistogramNumericEvaluatorthe extends GenericUDAFEvaluator {
     // UDAF logic 
  } 
}

We should introduce function in this example: hIve in histogram_numeric function to do the histogram, for example, we want to age 30 barrel build a histogram is SELECT histogram_numeric (age, 30) FROM employees;

Here we continue to look at the example

! # The Java
   / ** 
  * This method of parameters have changed in the new version, is directly TypeInfo [] the Parameters 
  * / 
  public GenericUDAFEvaluator getEvaluator (GenericUDAFParameterInfo info) throws SemanticException { 
    TypeInfo [] the Parameters = info.getParameters ();
     IF (the Parameters. ! length = 2 ) {
       the throw  new new UDFArgumentTypeException (parameters.length -. 1 ,
           . "Please the Specify exactly TWO arguments" ); 
    } 
     
    // check the first parameter type, if not the original type (base type) Throws 
    IF (parameters [ 0] .getCategory ()! = ObjectInspector.Category.PRIMITIVE) {
       the throw  new new UDFArgumentTypeException(0,
          "Only primitive type arguments are accepted but "
          + parameters[0].getTypeName() + " was passed as parameter 1.");
    }
    switch (((PrimitiveTypeInfo) parameters[0]).getPrimitiveCategory()) {
    case BYTE:
    case SHORT:
    case INT:
    case LONG:
    case FLOAT:
    case DOUBLE:
      break;
    case STRING:
    case BOOLEAN:
    default:
      throw new UDFArgumentTypeException(0,
           "Only numeric type arguments are accepted But" 
          + Parameters [0] .getTypeName () + "WAS AS passed 1. Parameter" ); 
    } 
 
    // check the second parameter type, bar bucket number, is required here assumed integer 
    IF (Parameters [. 1] .getCategory ()! = ObjectInspector.Category.PRIMITIVE) {
       the throw  new new UDFArgumentTypeException (. 1 ,
           "Only primitive type arguments are accepted But" 
          + Parameters [. 1] .getTypeName () + "WAS passed 2. Parameter AS " ); 
    } 
    // if it is not an integer, Throws 
    IF (((PrimitiveTypeInfo) Parameters [. 1 ]) getPrimitiveCategory ().
         =! PrimitiveObjectInspector.PrimitiveCategory.INT) {
      throw new UDFArgumentTypeException(1,
          "Only an integer argument is accepted as parameter 2, but "
          + parameters[1].getTypeName() + " was passed instead.");
    }
    //返回对应的处理类
    return new GenericUDAFHistogramNumericEvaluator();
  }

Then we look evaluator

#!Java
  public static class GenericUDAFHistogramNumericEvaluator extends GenericUDAFEvaluator {
 
    // For PARTIAL1 and COMPLETE: ObjectInspectors for original data,这俩货是用来做类型转换的
    private PrimitiveObjectInspector inputOI;
    private PrimitiveObjectInspector nbinsOI;
 
    // For PARTIAL2 and FINAL: ObjectInspectors for partial aggregations (list of doubles)
    private StandardListObjectInspector loi;
 
 
    @Override
    public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {
      super.init(m, parameters);
      // return type goes here
    }
 
    @Override
    public Object terminatePartial(AggregationBuffer agg) throws HiveException {
      // return value goes here
    }
 
    @Override
    public Object terminate(AggregationBuffer agg) throws HiveException {
      // final return value goes here
    }
 
    @Override
    public void merge(AggregationBuffer agg, Object partial) throws HiveException {
    }
 
    @Override
    public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException {
    }
 
    // Aggregation buffer definition and manipulation methods
    static class StdAgg implements AggregationBuffer {
    };
 
    @Override
    public AggregationBuffer getNewAggregationBuffer() throws HiveException {
    }
 
    @Override
    public void reset(AggregationBuffer agg) throws HiveException {
    }   
  }

 Understand this class we first need to understand a few things, wrote Hadoop MapReduce students should know that a MapReduce job divided map, combine, reduce three stages, map stage is applied to the input data to the function of each building key-value for subsequent polymerization; Combine mapper stage is carried out at the local end of the polymerization, the polymerization intermediate pass result reduce function, and reduce the input function is the same, is called reduce mapper end. After understanding this process, we look at a few ways evaluator, basically corresponding these stages.

method

effect

init

Initialization function

getNewAggregationBuffer

It used to generate a cache object, recording the provisional polymerization results

iterate

One by one data processing, the result is stored in the cache

terminatePartial

This approach means that map phase is completed, the data in the cache of persistent storage. Data type returned here only supports java basic types, the basic type of packaging, as well as an array of Hadoop Writables, Lists, and Map, do not use a custom type

merge

Results returned by the receiving terminatePartial combined partial polymerization results

terminate

Returns the final result can be achieved where the last evaluation, such as calculating an average value

In the hive, with an enumeration class to represent different stages of Mode

  /**
   * Mode.
   *官方的注释写的挺详细了^_^
   */
  public static enum Mode {
    /**
     * PARTIAL1: from original data to partial aggregation data: iterate() and
     * terminatePartial() will be called.
     */
    PARTIAL1,
        /**
     * PARTIAL2: from partial aggregation data to partial aggregation data:
     * merge() and terminatePartial() will be called.
     */
    PARTIAL2,
        /**
     * FINAL: from partial aggregation to full aggregation: merge() and
     * terminate() will be called.
     */
    FINAL,
        /**
     * COMPLETE: from original data directly to full aggregation: iterate() and
     * terminate() will be called.
     */
    COMPLETE
  };

Ok. . . After writing to make a bag out of the jar, create a temporary functions to use either a

add jar hiveUDF.jar;
create temporary function test_udf as 'com.test.xxxx';
select test_udf(a,b) from table2 groupy by xxx.

Well, first write, I write when most data types with the java, it produces various types of conversion errors, later going to see Hadoop built-in type ~ hope to help you ~

Guess you like

Origin www.cnblogs.com/jeason1991/p/10986716.html