Hive-write custom UDF functions and UDTF functions

1. User-defined function UDF

User Defined Function (UDF) is a powerful function that allows users to extend HiveQL. Users can write their own UDFs in Java. Once user-defined functions are added to the user session (interactive or executed by script), they will be used like built-in functions, and online help can even be provided. Hive has many types of user-defined functions, each of which will perform a specific "category" conversion process for the input data

UDF function features : line in and line out. For short, one in and one out.

UDF function to parse public fields:

Write UDF class

Add the following content in the pom.xml file

<dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>${hive.version}</version>
    </dependency>
<dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

 Code

To write a UDF, you need to inherit the UDF and implement the evaluate() function. In the query process, this class will be instantiated for each application of this function in the query. The evaluate() function is called for each line of input. The value processed by evaluate() will be returned to Hive. At the same time, the user can override the evaluate method. Hive will automatically select the matching method like Java method overloading

package UDF;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;

public class BaseFieidUDF extends UDF {
    public static void main(String[] args) throws JSONException {
        String line = "1583776223469|{\"cm\":{\"ln\":\"-48.5\",\"sv\":\"V2.5.7\",\"os\":\"8.0.9\",\"g\":\"[email protected]\",\"mid\":\"0\",\"nw\":\"4G\",\"l\":\"pt\",\"vc\":\"3\",\"hw\":\"750*1134\",\"ar\":\"MX\",\"uid\":\"0\",\"t\":\"1583707297317\",\"la\":\"-52.9\",\"md\":\"sumsung-18\",\"vn\":\"1.2.4\",\"ba\":\"Sumsung\",\"sr\":\"V\"},\"ap\":\"app\",\"et\":[{\"ett\":\"1583705574227\",\"en\":\"display\",\"kv\":{\"goodsid\":\"0\",\"action\":\"1\",\"extend1\":\"1\",\"place\":\"0\",\"category\":\"63\"}},{\"ett\":\"1583760986259\",\"en\":\"loading\",\"kv\":{\"extend2\":\"\",\"loading_time\":\"4\",\"action\":\"3\",\"extend1\":\"\",\"type\":\"3\",\"type1\":\"\",\"loading_way\":\"1\"}},{\"ett\":\"1583746639124\",\"en\":\"ad\",\"kv\":{\"activityId\":\"1\",\"displayMills\":\"111839\",\"entry\":\"1\",\"action\":\"5\",\"contentType\":\"0\"}},{\"ett\":\"1583758016208\",\"en\":\"notification\",\"kv\":{\"ap_time\":\"1583694079866\",\"action\":\"1\",\"type\":\"3\",\"content\":\"\"}},{\"ett\":\"1583699890760\",\"en\":\"favorites\",\"kv\":{\"course_id\":4,\"id\":0,\"add_time\":\"1583730648134\",\"userid\":7}}]}";

        String mid = new BaseFieidUDF().evaluate(line, "st");

        System.out.println(mid);
    }

    private String evaluate(String line, String key) throws JSONException {

        String[] log = line.split("\\|");

        //对数据合法性验证,这一块比较复杂
        if (log.length != 2 || StringUtils.isBlank(log[1])){
            return "";
        }

        // 如果能走到下面,说明if没有走进去,数据合法,那么就说明切分之后长度为2 而且json数据不为空
        JSONObject baseJson = new JSONObject(log[1].trim());

        String result = "";

        //获取服务器时间 st : server_time  ,mid,l,os
        if ("st".equals(key)) {
            result = log[0].trim();
        }else if ("et".equals(key)){
            //获取事件数组
            if (baseJson.has("et")){
                result = baseJson.getString("et");
            }
        }else {
            // 获取cm:{具体的一个个kv}
            /*
                {"ln":"-106.3","sv":"V2.7.0","os":"8.1.2","g":"[email protected]","mid":"1","nw":"WIFI
                ","l":"es","vc":"8","hw":"1080*1920","ar":"MX","uid":"1","t":"1603997770291","la":"-39.8","md":"sumsung-16"
                ,"vn":"1.3.2","ba":"Sumsung","sr":"B"}
             */
            JSONObject cm = baseJson.getJSONObject("cm");

            //获取key对应公共字段的value,这个key就是公共字段的很多个不同的key值
            if (cm.has(key)){
                result = cm.getString(key);
            }
        }
        return result;
    }
}

2. Custom UDTF function

Features of UDTF function: multi-travel and multi-travel. Abbreviation, more in, more out

  • Inherit org.apache.hadoop.hive.ql.udf.generic. GenericUDTF and implement the three methods of initialize, process and close .

  • UDTF first calls the initialize method, which returns the bold style (return number, type ) of the UDTF return line information .

  • After the initialization is completed, the process method will be called. The real processing process is in the process function. In the process, each forward() call generates a row ; if multiple columns are generated, the values ​​of multiple columns can be placed in an array, and then This array is passed to the forward() function.

  • Finally, the close() method is called to clean up the methods that need to be cleaned up.

UDTF function to analyze specific events

Code

package UDTF;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.json.JSONArray;
import org.json.JSONException;

import java.util.ArrayList;

public class EventJsonUDTF extends GenericUDTF {
    //该方法中,我们指定输出参数的名称和参数类型
    public StructObjectInspector initialize(StructObjectInspector argOIs)throws UDFArgumentException {
        ArrayList<String> fieldNames = new ArrayList<String>();
        ArrayList<ObjectInspector> fieldOIs = new ArrayList<>();

        fieldNames.add("event_name");
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        fieldNames.add("event_json");
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);

    }

    //输入一条记录,输出若干条结果,多条
    @Override
    public void process(Object[] objects) throws HiveException {
        //获取传入的et [{},{},{},{}]
        // objects[0].toString() ----> [{},{},{},{}]
        String input = objects[0].toString();

        //如果传进来的数据为空,直接返回过滤掉该数据
        if (StringUtils.isBlank(input)){
            return;
        }else {
            try {
                //获取一共有几个事件(ad/facoriters)
                JSONArray ja = new JSONArray(input);

                if (ja == null)
                    return;

                //循环遍历每一个事件
                for (int i = 0;i < ja.length();i++){
                    //遍历出来的没一条数据: {"ett":"1604021380867","en":"display","kv":{"goodsid":"0","action":"2","extend1":"2","place":"4","category":"53"}}
                    String[] result = new String[2];

                    try {
                        //取出每个事件的名称(ad/factoriters)
                        result[0] = ja.getJSONObject(i).getString("en");

                        //取出每一个事件整体
                        result[1] = ja.getString(i);
                    }catch (JSONException e){
                        continue;
                    }
                    //将结果返回
                    forward(result);
                }
            }catch (JSONException e){
                e.printStackTrace();
            }
        }
    }

    //当没有记录处理的时候该方法会被调用,用来清理代码或者产生额外的输出
    @Override
    public void close() throws HiveException {

    }

}

 

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/109453718