Article Directory
7. Function
7.1 Hive function classification
- Classification from the perspective of input and output.
Standard function: one or more columns in a row of data are input, and the result is a single value.
Aggregate function: multiple rows of zero columns to multiple columns are input, and the result is a single value
table generating function: zero or more Input, the result is multiple columns or multiple rows - Classification from the realization method
Built-in function
Custom function:
UDF: Custom standard
function
UDAF: Custom aggregate function UDTF: Custom table generation function
7.2 Built-in functions
Hive provides a large number of built-in functions for developers to use
- Standard functions
Character functions
Type conversion functions
Mathematical functions
Date functions
Aggregate functions
Conditional functions - Aggregate function
- Table generation function
7.2.1 Character functions
return value | function | description |
---|---|---|
string | concat(string|binary A, string|binary B…) | Concatenate binary bytecodes or strings in order |
int | instr(string str, string substr) | Find the position of the substring substr in the string str |
int | length(string A) | Returns the length of the string |
int | locate(string substr, string str[, int pos]) | Find the position of the first occurrence of the string substr after the pos position in the string str |
string | lower(string A) /upper(string A) | Convert all letters of string A to lowercase/uppercase letters |
string | regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT) | Replace the qualified part of the string with the string specified by REPLACEMENT according to the regular expression PATTERN |
array | split(string str, string pat) | Split the string str according to the regular expression pat |
string | substr(string|binary A, int start, int len)substring(string|binary A, int start, int len) | For string A, intercept a string of length len from the start position and return |
string | trim(string A) | Remove the spaces before and after the string A |
map | str_to_map(text[, delimiter1, delimiter2]) | Convert the string str into Map according to the specified separator |
binary | encode(string src, string charset) | Use the specified character set charset to encode a string into a binary value |
7.2.2 Type conversion functions and mathematical functions
return value | Type conversion function | description |
---|---|---|
“type” | cast(expr as ) | Convert expr to type type, such as: cast("1" as BIGINT) Convert string 1 to BIGINT type |
binary | binary(string|binary) | Convert the entered value to binary |
return value | Mathematical function | description |
---|---|---|
DOUBLE | round(DOUBLE a) | Returns the BIGINT value rounded to a |
binary | round(DOUBLE a, INT d) | Returns the value that rounds a and retains d decimal places |
BIGINT | floor(DOUBLE a) | Round down, such as: 6.10->6 -3.4->-4 |
DOUBLE | rand(INT seed) | Returns a DOUBLE random number, seed is a random factor |
DOUBLE | power(DOUBLE a, DOUBLE p) | Calculate the p power of a |
DOUBLE | abs(DOUBLE a) | Calculate the absolute value of a |
7.2.3 Date functions
return value | function | description |
---|---|---|
string | from_unixtime(bigint unixtime[, string format]) | Convert the timestamp to format format |
int | unix_timestamp() | Get the timestamp in the local time zone |
bigint | unix_timestamp(string date) | Convert a time string in the format yyyy-MM-dd HH:mm:ss into a timestamp |
string | to_date(string timestamp) | Returns the date part of the time string |
int | year(string date)month/day/hour/minute/second/weekofyear | Return the year part of the time string Return month/day/hour/minute/second/week |
int | datediff(string enddate, string startdate) | Calculate the number of days between the start time and the end time |
string | date_add(string startdate, int days) | From the start time startdate plus days |
string | date_sub(string startdate, int days) | Subtract days from start date |
date | current_date | Returns the date of the current time |
timestamp | current_timestamp | Returns the current timestamp |
string | date_format(date/timestamp/string ts, string fmt) | Return the time date in the specified format, such as: date_format("2016-06-22","MM-dd")=06-22 |
7.2.4 Aggregate functions
return value | function | description |
---|---|---|
int | size(Map<K.V>) | Returns the number of key-value pairs in the map |
int | size(Array) | Returns the length of the array |
array | map_keys(Map<K.V>) | Return all keys in the map |
array | map_values(Map<K.V>) | Return all values in the map |
boolean | array_contains(Array, value) | If the Array contains value, return true. , Otherwise return false |
array | sort_array(Array) | Sort the array |
7.2.5 条件函数
返回值 | 函数 | 描述 |
---|---|---|
T | if(boolean testCondition, T valueTrue, T valueFalseOrNull) | 如果testCondition 为true就返回valueTrue,否则返回valueFalseOrNull |
T | nvl(T value, T default_value) | value为NULL返回default_value,否则返回value |
T | COALESCE(T v1, T v2, …) | 返回第一非null的值,如果全部都为NULL就返回NULL |
T | CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END | 如果a=b就返回c,a=d就返回e,否则返回f |
T | CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END | 如果a=ture就返回b,c= ture就返回d,否则返回e |
boolean | isnull( a ) | 如果a为null就返回true,否则返回false |
boolean | isnotnull ( a ) | 如果a为非null就返回true,否则返回false |
7.2.6 聚合函数和表生成函数
- 聚合函数
count、sum、max、min、avg、var_samp等 - 表生成函数:输出可以作为表使用
返回值 | 函数 | 描述 |
---|---|---|
N rows | explode(array) | 对于array中的每个元素生成一行且包含该元素 |
N rows | explode(MAP) | 每行对应每个map键值对其中一个字段是map的键,另一个字段是map的值 |
N rows | posexplode(ARRAY) | 与explode类似,不同的是还返回各元素在数组中的位置 |
N rows | stack(INT n, v_1, v_2, …, v_k) | 把k列转换成n行,每行有k/n个字段,其中n必须是常数 |
tuple | json_tuple(jsonStr, k1, k2, …) | 从一个JSON字符串中获取多个键并作为一个元组返回,与get_json_object不同的是此函数能一次获取多个键值 |
7.3 自定义UDF函数
Hive UDF开发流程
- 继承UDF类或GenericUDF类
- 重写evaluate()方法并实现函数逻辑
- 编译打包为jar文件
- 复制到正确的HDFS路径
- 使用jar创建临时/永久函数
- 调用函数
1.创建一个Maven工程Hive
https://blog.csdn.net/zmzdmx/article/details/108401283
2.导入依赖
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
3.创建一个类
package cn.kgc.kb09;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class TestUDF extends UDF {
public Text evaluate(Text str){
if(null==str){
return null;
}
return new Text(str.toString().toUpperCase());
}
public static void main(String[] args) {
TestUDF tu=new TestUDF();
Text rst = tu.evaluate(new Text());
System.out.println(rst);
}
}
方法一(创建临时函数):
4.打成jar包上传到服务器/opt/testudf.jar
5.将jar包添加到hive的classpath(linux路径)
add jar /opt/testudf.jar;
6.创建临时函数与开发好的java class关联
create temporary function mylower as "cn.kgc.kb09.TestUDF";
方法二(创建永久函数):
4.在linux命令行使用hdfs命令把jar包上传到hdfs的路径
hdfs dfs -put 路径
create function 函数名 as ‘方法的全路径’using as ‘jar包的hdfs路径’
7.即可在hql中使用自定义的函数
select ename, mylower(ename) lowername from emp;
注意
- 报错在linux执行
报错
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:294)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:736)
at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:819)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:103)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:632)
at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
执行命令,不行就退出hive,再次执行
zip -d testUdf.jar 'META-INF/.SF' 'META-INF/.RSA' 'META-INF/*SF'
- 临时函数可以垮库运行,永久函数需要库名.函数名