Hive (7) Hive function and UDF function

7. Function

7.1 Hive function classification

  • Classification from the perspective of input and output.
    Standard function: one or more columns in a row of data are input, and the result is a single value.
    Aggregate function: multiple rows of zero columns to multiple columns are input, and the result is a single value
    table generating function: zero or more Input, the result is multiple columns or multiple rows
  • Classification from the realization method
    Built-in function
    Custom function:
    UDF: Custom standard
    function
    UDAF: Custom aggregate function UDTF: Custom table generation function

7.2 Built-in functions

Hive provides a large number of built-in functions for developers to use

  • Standard functions
    Character functions
    Type conversion functions
    Mathematical functions
    Date functions
    Aggregate functions
    Conditional functions
  • Aggregate function
  • Table generation function

7.2.1 Character functions

return value function description
string concat(string|binary A, string|binary B…) Concatenate binary bytecodes or strings in order
int instr(string str, string substr) Find the position of the substring substr in the string str
int length(string A) Returns the length of the string
int locate(string substr, string str[, int pos]) Find the position of the first occurrence of the string substr after the pos position in the string str
string lower(string A) /upper(string A) Convert all letters of string A to lowercase/uppercase letters
string regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT) Replace the qualified part of the string with the string specified by REPLACEMENT according to the regular expression PATTERN
array split(string str, string pat) Split the string str according to the regular expression pat
string substr(string|binary A, int start, int len)substring(string|binary A, int start, int len) For string A, intercept a string of length len from the start position and return
string trim(string A) Remove the spaces before and after the string A
map str_to_map(text[, delimiter1, delimiter2]) Convert the string str into Map according to the specified separator
binary encode(string src, string charset) Use the specified character set charset to encode a string into a binary value

7.2.2 Type conversion functions and mathematical functions

return value Type conversion function description
“type” cast(expr as ) Convert expr to type type, such as: cast("1" as BIGINT) Convert string 1 to BIGINT type
binary binary(string|binary) Convert the entered value to binary
return value Mathematical function description
DOUBLE round(DOUBLE a) Returns the BIGINT value rounded to a
binary round(DOUBLE a, INT d) Returns the value that rounds a and retains d decimal places
BIGINT floor(DOUBLE a) Round down, such as: 6.10->6 -3.4->-4
DOUBLE rand(INT seed) Returns a DOUBLE random number, seed is a random factor
DOUBLE power(DOUBLE a, DOUBLE p) Calculate the p power of a
DOUBLE abs(DOUBLE a) Calculate the absolute value of a

7.2.3 Date functions

return value function description
string from_unixtime(bigint unixtime[, string format]) Convert the timestamp to format format
int unix_timestamp() Get the timestamp in the local time zone
bigint unix_timestamp(string date) Convert a time string in the format yyyy-MM-dd HH:mm:ss into a timestamp
string to_date(string timestamp) Returns the date part of the time string
int year(string date)month/day/hour/minute/second/weekofyear Return the year part of the time string Return month/day/hour/minute/second/week
int datediff(string enddate, string startdate) Calculate the number of days between the start time and the end time
string date_add(string startdate, int days) From the start time startdate plus days
string date_sub(string startdate, int days) Subtract days from start date
date current_date Returns the date of the current time
timestamp current_timestamp Returns the current timestamp
string date_format(date/timestamp/string ts, string fmt) Return the time date in the specified format, such as: date_format("2016-06-22","MM-dd")=06-22

7.2.4 Aggregate functions

return value function description
int size(Map<K.V>) Returns the number of key-value pairs in the map
int size(Array) Returns the length of the array
array map_keys(Map<K.V>) Return all keys in the map
array map_values(Map<K.V>) Return all values ​​in the map
boolean array_contains(Array, value) If the Array contains value, return true. , Otherwise return false
array sort_array(Array) Sort the array

7.2.5 条件函数

返回值 函数 描述
T if(boolean testCondition, T valueTrue, T valueFalseOrNull) 如果testCondition 为true就返回valueTrue,否则返回valueFalseOrNull
T nvl(T value, T default_value) value为NULL返回default_value,否则返回value
T COALESCE(T v1, T v2, …) 返回第一非null的值,如果全部都为NULL就返回NULL
T CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END 如果a=b就返回c,a=d就返回e,否则返回f
T CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END 如果a=ture就返回b,c= ture就返回d,否则返回e
boolean isnull( a ) 如果a为null就返回true,否则返回false
boolean isnotnull ( a ) 如果a为非null就返回true,否则返回false

7.2.6 聚合函数和表生成函数

  • 聚合函数
    count、sum、max、min、avg、var_samp等
  • 表生成函数:输出可以作为表使用
返回值 函数 描述
N rows explode(array) 对于array中的每个元素生成一行且包含该元素
N rows explode(MAP) 每行对应每个map键值对其中一个字段是map的键,另一个字段是map的值
N rows posexplode(ARRAY) 与explode类似,不同的是还返回各元素在数组中的位置
N rows stack(INT n, v_1, v_2, …, v_k) 把k列转换成n行,每行有k/n个字段,其中n必须是常数
tuple json_tuple(jsonStr, k1, k2, …) 从一个JSON字符串中获取多个键并作为一个元组返回,与get_json_object不同的是此函数能一次获取多个键值

7.3 自定义UDF函数

Hive UDF开发流程

  • 继承UDF类或GenericUDF类
  • 重写evaluate()方法并实现函数逻辑
  • 编译打包为jar文件
  • 复制到正确的HDFS路径
  • 使用jar创建临时/永久函数
  • 调用函数

1.创建一个Maven工程Hive

https://blog.csdn.net/zmzdmx/article/details/108401283

2.导入依赖

<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>1.2.1</version>
    </dependency>

3.创建一个类

package cn.kgc.kb09;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;


public class TestUDF extends UDF {
    
    
    public Text evaluate(Text str){
    
    
        if(null==str){
    
    
            return null;
        }
        return new Text(str.toString().toUpperCase());
    }

    public static void main(String[] args) {
    
    
        TestUDF tu=new TestUDF();
        Text rst = tu.evaluate(new Text());
        System.out.println(rst);

    }
}

方法一(创建临时函数):

4.打成jar包上传到服务器/opt/testudf.jar

5.将jar包添加到hive的classpath(linux路径)

add jar /opt/testudf.jar;

6.创建临时函数与开发好的java class关联

create temporary function mylower as "cn.kgc.kb09.TestUDF";

方法二(创建永久函数):

4.在linux命令行使用hdfs命令把jar包上传到hdfs的路径

hdfs dfs -put 路径
create function 函数名 as ‘方法的全路径’using as ‘jar包的hdfs路径’

7.即可在hql中使用自定义的函数

select ename, mylower(ename) lowername from emp;

注意

  • 报错在linux执行

报错

java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:294)
	at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
	at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
	at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:736)
	at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:819)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:103)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:632)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
	at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

执行命令,不行就退出hive,再次执行

zip -d testUdf.jar 'META-INF/.SF' 'META-INF/.RSA' 'META-INF/*SF'
  • 临时函数可以垮库运行,永久函数需要库名.函数名

Guess you like

Origin blog.csdn.net/zmzdmx/article/details/108739886