Stand-alone run a script to do data processing, but the input data is too great, the process of memory-intensive systems are often killed, so consider on the hive to do data aggregation. UDAF take this opportunity to study how to write down, write out the experience of riding the pit, hoping to help you avoid detours! Ok. . . On Jiang Zi.
Often listen to UDF, then UDAF What the hell? UDF is a polymerizable function such as friends ~ built hive count, sum, max, min, avg and the like. But the built-in functions actually can not meet the needs of our complex statistics, we need to own a method to achieve.
There are two ways to achieve a simple, a common, simple method is said to have a performance problem, we just look at it - a common implementation
Achieve a Generic UDAF has two parts:
- solve
- evaluator
Maybe goods corresponding two abstract classes:
import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver; import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
The main parameters used for checking resolver and operator overloading, we can select the appropriate evaluator depending on the input parameters
the evaluator is implemented where the main logic, in the form of a static inner class
#! The Java public class GenericUDAFHistogramNumeric the extends AbstractGenericUDAFResolver { static Final the Log LogFactory.getLog the LOG = (GenericUDAFHistogramNumeric. Class .getName ()); @Override public GenericUDAFEvaluator getEvaluator (GenericUDAFParameterInfo info) throws SemanticException { // parameter check return new new GenericUDAFHistogramNumericEvaluator (); } / ** * this static inner classes is to write a logical place to ourselves, to change the class name if necessary, this is a bar chart example of official documents written * / public static class GenericUDAFHistogramNumericEvaluatorthe extends GenericUDAFEvaluator { // UDAF logic } }
We should introduce function in this example: hIve in histogram_numeric function to do the histogram, for example, we want to age 30 barrel build a histogram is SELECT histogram_numeric (age, 30) FROM employees;
Here we continue to look at the example
! # The Java / ** * This method of parameters have changed in the new version, is directly TypeInfo [] the Parameters * / public GenericUDAFEvaluator getEvaluator (GenericUDAFParameterInfo info) throws SemanticException { TypeInfo [] the Parameters = info.getParameters (); IF (the Parameters. ! length = 2 ) { the throw new new UDFArgumentTypeException (parameters.length -. 1 , . "Please the Specify exactly TWO arguments" ); } // check the first parameter type, if not the original type (base type) Throws IF (parameters [ 0] .getCategory ()! = ObjectInspector.Category.PRIMITIVE) { the throw new new UDFArgumentTypeException(0, "Only primitive type arguments are accepted but " + parameters[0].getTypeName() + " was passed as parameter 1."); } switch (((PrimitiveTypeInfo) parameters[0]).getPrimitiveCategory()) { case BYTE: case SHORT: case INT: case LONG: case FLOAT: case DOUBLE: break; case STRING: case BOOLEAN: default: throw new UDFArgumentTypeException(0, "Only numeric type arguments are accepted But" + Parameters [0] .getTypeName () + "WAS AS passed 1. Parameter" ); } // check the second parameter type, bar bucket number, is required here assumed integer IF (Parameters [. 1] .getCategory ()! = ObjectInspector.Category.PRIMITIVE) { the throw new new UDFArgumentTypeException (. 1 , "Only primitive type arguments are accepted But" + Parameters [. 1] .getTypeName () + "WAS passed 2. Parameter AS " ); } // if it is not an integer, Throws IF (((PrimitiveTypeInfo) Parameters [. 1 ]) getPrimitiveCategory (). =! PrimitiveObjectInspector.PrimitiveCategory.INT) { throw new UDFArgumentTypeException(1, "Only an integer argument is accepted as parameter 2, but " + parameters[1].getTypeName() + " was passed instead."); } //返回对应的处理类 return new GenericUDAFHistogramNumericEvaluator(); }
Then we look evaluator
#!Java public static class GenericUDAFHistogramNumericEvaluator extends GenericUDAFEvaluator { // For PARTIAL1 and COMPLETE: ObjectInspectors for original data,这俩货是用来做类型转换的 private PrimitiveObjectInspector inputOI; private PrimitiveObjectInspector nbinsOI; // For PARTIAL2 and FINAL: ObjectInspectors for partial aggregations (list of doubles) private StandardListObjectInspector loi; @Override public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException { super.init(m, parameters); // return type goes here } @Override public Object terminatePartial(AggregationBuffer agg) throws HiveException { // return value goes here } @Override public Object terminate(AggregationBuffer agg) throws HiveException { // final return value goes here } @Override public void merge(AggregationBuffer agg, Object partial) throws HiveException { } @Override public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException { } // Aggregation buffer definition and manipulation methods static class StdAgg implements AggregationBuffer { }; @Override public AggregationBuffer getNewAggregationBuffer() throws HiveException { } @Override public void reset(AggregationBuffer agg) throws HiveException { } }
Understand this class we first need to understand a few things, wrote Hadoop MapReduce students should know that a MapReduce job divided map, combine, reduce three stages, map stage is applied to the input data to the function of each building key-value for subsequent polymerization; Combine mapper stage is carried out at the local end of the polymerization, the polymerization intermediate pass result reduce function, and reduce the input function is the same, is called reduce mapper end. After understanding this process, we look at a few ways evaluator, basically corresponding these stages.
method |
effect |
init |
Initialization function |
getNewAggregationBuffer |
It used to generate a cache object, recording the provisional polymerization results |
iterate |
One by one data processing, the result is stored in the cache |
terminatePartial |
This approach means that map phase is completed, the data in the cache of persistent storage. Data type returned here only supports java basic types, the basic type of packaging, as well as an array of Hadoop Writables, Lists, and Map, do not use a custom type |
merge |
Results returned by the receiving terminatePartial combined partial polymerization results |
terminate |
Returns the final result can be achieved where the last evaluation, such as calculating an average value |
In the hive, with an enumeration class to represent different stages of Mode
/** * Mode. *官方的注释写的挺详细了^_^ */ public static enum Mode { /** * PARTIAL1: from original data to partial aggregation data: iterate() and * terminatePartial() will be called. */ PARTIAL1, /** * PARTIAL2: from partial aggregation data to partial aggregation data: * merge() and terminatePartial() will be called. */ PARTIAL2, /** * FINAL: from partial aggregation to full aggregation: merge() and * terminate() will be called. */ FINAL, /** * COMPLETE: from original data directly to full aggregation: iterate() and * terminate() will be called. */ COMPLETE };
Ok. . . After writing to make a bag out of the jar, create a temporary functions to use either a
add jar hiveUDF.jar; create temporary function test_udf as 'com.test.xxxx';
select test_udf(a,b) from table2 groupy by xxx.
Well, first write, I write when most data types with the java, it produces various types of conversion errors, later going to see Hadoop built-in type ~ hope to help you ~