5. Hive parameter configuration and use of functions and operators

1. Hive client and attribute configuration

1.1 CLIs and Commands

1.1.1 Hive CLI

$HIVE_HOME/bin/hive is the first shell Util. It has two main functions.
1. Run Hive query in interactive or batch mode.
2. To start hive-related services,
you can run "hive -H" or "hive --help " to view the command line options

-e <quoted-query-string> 执行命令行-e参数后指定的sql语句 运行完退出
-f <filename> 执行命令行-f参数后指定的sql文件 运行完退出
-H --help 打印帮助信息
--hiveconf <property=value> 设置参数
-S,--silent 静默设置
-v,--verbose 详情模式,将指向sql回显到console
--service Service_name 启动hive的i相关服务

Function 1: Batch Mode batch mode
When running bin/hive with the -e or -f option, it will execute SQL commands in batch mode.
The so-called batch processing can be understood as one-time execution, pointing to exit after completion

# -e
$HIVE_HOME/bin/hive -e 'show databases'

# -f
cd ~
# 编辑一个sql文件 里面协商合法正确的sql语句
vim hive.sql
show databases;
# 执行从客户端所在及其的本地磁盘加载文件
$HIVE_HOME/bin/hive -f /root/hive.sql
# 也可以从其他文件系统加载sql文件执行
$HIVE_HOME/bin/hive -f hdfs://<namenode>:<port>/hive-script.sql
$HIVE_HOME/bin/hive -f s3://mys3bucket/s2-script.sql

# -i 进入交互模式之前运行初始化脚本
$HIVE_HOME/bin/hive -i /home/my/hive-init.sql

# 使用静默模式 将数据从查询中转储到文件中
$HIVE_HOME/bin/hive -S -e 'select * from student' > a.txt

Function 2: Interactive Shell interactive mode
The so-called interactive mode can be understood as the connection between the client and the hive service, unless the client is manually exited

/export/server/hive/bin/hive

hive> show databases;

Function 3: Start Hive services
such as the start of metastore service and hiveserver2 service

# --hiveconf
$HIVE_HOME/bin/hiva --hiveconf hive.root.logger=DEBUG,console

# --service
$HIVE_HOME/bin/hiva --service metastore
$HIVE_HOME/bin/hiva --service hiveserver2

1.1.2 Beeline CLI

$HIVE_HOME/bin/beeline is called the second generation client and is a JDBC client. Compared with the first generation client, the performance is enhanced and the security is improved.
In embedded mode, it runs embedded Hive.
In remote mode, beeline runs Thrift and connects to a separate HiveServer2 service.
Beeline supports a lot of parameters, which can be queried through official documents

The common way to use this is as follows: use beeline to remotely connect to HS2 service under the premise of starting hiveserver2 service

启动beeline
/export/server/hive/bin/beeline

beeline> ! connect jdbc:hive2://node1:10000

1.2 Configuration Properties attribute configuration

overview

  • In addition to the default attribute configuration, Hive also supports users to modify the configuration when using
  • Before modifying the Hive configuration, as a user, you need to master two things
    • Which attributes support user modification, and what are the functions and functions of the attributes
    • Which methods are supported for modification, whether it is temporary or permanent
  • Hive configuration properties are actually managed in the HiveConf.java class, you can refer to the file for a list of configuration properties available in the current version
  • Starting from Hive0.14.0, the configuration template file hive-default.xml.template will be generated directly from HiveConf.java
  • For detailed configuration parameters, please refer to Hive official website configuration parameters

Configuration
method 1: hive-site.xml
is in the $HIVE_HOME/conf path, you can add a hive-site.xml file, and add the configuration properties that need to be defined and modified. This configuration file will affect the Hive installation package based on this Any kind of service startup and client usage

Method 2: –hiveconf command line parameter

  • hiveconf is a command line parameter used to specify configuration parameters when using Hive CLI or Beeline CLI
  • The configuration in this way takes effect in the entire session session, and the session ends and becomes invalid
  • For example, when starting the hive service, in order to better view the startup details, you can modify the log level through the hiveconf parameter

Method 3: set command

  • Use the set command in Hive CLI or Beeline to set configuration parameters for all SQL statements after the set command, which is also at the session level
  • This method is also the most commonly used configuration parameter method in daily development.
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrit;

Method 4: Service-specific configuration files
hivemetastore-site.xml, hiveserver2-site.xml

  • Hive Metastore will load available hive-site.xml and hivemetastore-site configuration files
  • HiveServer2 will load available hive-site.xml and hiveserver2-site.xml
    If HiveServer2 uses metastore in embedded mode, it will also be in hivemetastore-site.xml

Summarize

  • Configuration method priority: set settings > hiveconf parameters > hive-site.xml configuration I see
  • The set parameter declaration will override the command line parameter hiveconf, and the command line parameter will override the configuration file hive-site.xml settings
  • In daily development and use, if it is not a core parameter attribute that needs to be modified globally, it is recommended to use the set command to set it
  • Hive will also read Hadoop configuration, because Hive is started as a Hadoop client, Hive configuration will override Hadoop configuration

2. Hive built-in operators

Overview
On the whole, the operators supported by Hive are divided into three categories: relational operations, arithmetic operations, and logical operations.
You can refer to official documents
or use the following methods to view the use of operators

-- 显示所有的函数和运算符
show functions;
-- 查看运算符或者函数的使用说明
describe function count;
-- 使用extended可以查看更加详细的使用说明
describe function extended count;

Test environment preparation
Create an empty table dual in Hive to test the functions of various operators

-- 1.创建dual
create table dual(id string);

-- 2.加载一个文件dual.txt到dual表中
-- dual.txt只有一行内容,内容为一个空格

-- 3.在select查询语句中使用dual表完成运算符、函数功能测试
select 1+1 from dual;

2.1 Relational Operators

  • Relational operators are binary operators that perform a comparison of two operands
  • Each relational operator returns a boolean type result (TRUE or FALSE)
等值比较:=、==
不等值比较:<>、!=
小于比较:<
小于等于比较:<=
大于比较:>
大于等于比较:>=
空值判断:IS NULL
非空判断:IS NOT NULL
LIKE比较:LIKE
JAVA的LIKE操作:RLIKE
REGEXP操作:REGEXP

example

-- 1.Hive中关系运算符
-- is null控制判断
select 1 from dual where 'test' is null;

-- is not null非空值判断
select 1 from dual where 'test' is not null;

-- like比较: _表示任意单个字符,%表示任意数量字符
-- 否定判断:NOT A like B
select 1 from dual where 'test' like 'te__';
select 1 from dual where 'test' like 'te%';
select 1 from dual where 'test' like '_e%';

-- rlike:确定字符串是否匹配正则表达式,是REGEXP_LIKE()的同义词
select 1 from dual where 'test' rlike '^t.*t$';
select 1 from dual where '123456' rlike '^\\d+$'; -- 判断是否全为数字
select 1 from dual where '123456aa' rlike '^\\d+$';

-- regexp:功能与rlike相同,用于判断字符串是否匹配正则表达式
select 1 from dual where 'test' rlike '^t.*t$';

2.2 Arithmetic operators

Arithmetic operator operands must be of numeric type. Divided into unary operators and binary element-wise
unary operators, which have only one operand; binary operators have two operands, and the operator is between the two operands

加法操作:+
减法操作:-
乘法操作:*
除法操作:/
取整操作:div
取余操作:%
位与操作:&
位或操作:|
位异或操作:^
位取反操作:~

example

-- 取整操作:div
select 17 div 3;

-- 取余操作:%
select 17 % 3;

-- 位与操作:& A和B按位进行与操作
select 4 & 8 from dual;

-- 位或操作:|
select 4 | 8 from dual;

-- 位异或操作:^
select 4 ^ 8 from dual;

2.3 Logical operators

与操作:A AND B
或操作:A OR B
非操作:NOT A、!A
在:A IN (val1,val2,...)
不在:A NOT IN (val1,val2,...)
逻辑是否存在:[NOT] EXISTS (subquery)

example

-- 3.Hive逻辑运算符
-- 与操作:A AND B
select 1 from dual where 3 > 1 and 2 > 1;
-- 或操作:A OR B
select 1 from dual where 3 > 1 or 2 != 2;
-- 非操作:NOT A、!A
select 1 from dual where not 2 > 1;
select 1 from dual where !2 = 1;
-- 在:A IN (val1,val2,...)
select 1 from dual where 11 in (11,22,33);
-- 不在:A NOT IN (val1,val2,...)
select 1 from dual where 11 not in (22,33,44);
-- 逻辑是否存在:[NOT] EXISTS (subquery)
select A.* from A
where exists (select B.id from B where A.id = B.id);

Other operators include string concatenation (||), construction operators, etc.

3. Getting Started with Hive Functions

3.1 Overview and classification criteria of Hive functions

Overview
There are many built-in functions in Hive to meet the different needs of users and improve the efficiency of SQL writing
1. Use show functions to view all the functions currently available
2. Use describe function extended funcname to view the usage of functions

Classification criteria
Hive functions are divided into two categories: built-in functions and user-defined functions UDF
1. Built-in functions can be divided into: numeric type functions, date type functions, string type functions, aggregate functions, conditional functions, etc.
2. User-defined Functions can be divided into three categories according to the number of input and output lines: UDF, UDAF, UDTF

User-defined function UDF classification standard
UDF ordinary function, one-input-one-out
UDAF aggregation function, multi-input one-out
UDTF table generation function, one-input-multiple-out

UDF taxonomy expansion
The UDF taxonomy is originally aimed at functions written and developed by users themselves. The UDF taxonomy can be extended to all functions of Hive: including built-in functions and user-defined functions. For example, in the official Hive documentation, the standard for aggregate functions is the built-in UDAF type

3.2 Hive built-in functions

Overview
The built-in functions refer to the functions that can be used by Hive development and implementation, and are also called built-in functions. For details, please refer to the official documentation.
The built-in functions can be divided into 8 types according to the application classification: date function, string function, mathematical function, conditional function, type conversion function, data desensitization function, aggregate function, and other miscellaneous functions

3.2.1 String functions

  • String length function: length
  • String reversal function: reverse
  • String concatenation function: concat
  • String concatenation function with delimiter: concat_ws
  • String interception function: substr, substring
  • String to uppercase function: uppe, ucase
  • String to lowercase function: lower, lcase
  • Function to remove spaces: trim
  • Function to remove spaces on the left: ltrim
  • Function to remove spaces on the right: rtrim
  • Regular expression replacement function: regexp_replace
  • Regular expression analysis function: regexp——extract
  • URL parsing function: parse_url
  • JSON parsing function: get_json_object
  • Space string function: space
  • Repeat string function: repeat
  • The first character ascii function: ascii
  • Left complement function: lpad
  • Right complement function: rpad
  • Split string function: split
  • Set lookup function: find_in_set
-- 字符串拼接:concat
select concat("angela","baby");
-- 带分隔符字符串拼接:concat_ws(separator, [string |array(string)]+)
select concat_ws('.','www',array('baidu','com'));
-- 字符串截取函数:substr(str,pos[, len]) 或者 substring(str,pos[, len])
select substr("angelababy",-2); -- pos是从1开始的索引,如果是复数则倒着数
select substr("angelababy",2,2);
-- 正则表达式替换函数:regexp_replace(str, regexp, rep)
select regexp_replace('100-200','(\\d+)', 'num');
-- 正则表达式解析函数:regexp_extract(str, regexp[, idx]) 提取正则陪陪到的指定组内容
select regexp_extract('100-200','(\\d+)-(\\d+)', 2);
-- URL解析函数:parse_url 注意要想一次解析出多个 可以使用parse_url_tuple这个UDTF函数
select parse_url('http://www.baidu.com/123/fasfawq','HOST');
-- 分割字符串函数:split(str, regex)
select split('apache hive','\\s+'); -- \\s+ 自动匹配空的符号,比如空格,制表符等
-- json解析函数:get_json_object(json_txt,path)
-- $表示json对象
select get_json_object('[{"website":"www.baidu.com","name":"test"}],[{"website":"www.google.com","name":"test22"}]','$.[1].website');

3.2.2 Date functions

  • Get the current date: current_date
  • Get the current timestamp: current_timestamp
  • UNIX timestamp to date function: from_unixtime
  • Get the current UNIX timestamp function: unix_timestamp
  • Date to UNIX timestamp function: unix_timestamp
  • Convert date in specified format to UNIX timestamp function: unix_timestamp
  • Extract date function: to_date
  • Date transfer function: year
  • Date to month function: month
  • Date to day function: day
  • Date to hour function: hour
  • Date to minute function: minute
  • Date to second function: second
  • Date transfer function: weekofyear
  • Date comparison function: datediff
  • Date increase function: date_add
  • Date reduction function: date_sub
-- 获取当前日期:current_date
select current_date();
-- 获取当前时间戳:current_timestamp
-- 同一查询中对current_timestamp的所有调用均返回相同的值
select current_timestamp();
-- 获取当前UNIX时间戳函数:unix_timestamp
select unix_timestamp();
-- 日期转UNIX时间戳函数:unix_timestamp
select unix_timestamp("2011-12-07 13:01:03");
-- 指定格式日期转UNIX时间戳函数:unix_timestamp
select unix_timestamp("20111207 13:01:03",'yyyMMdd HH:mm:ss');
-- UNIX时间戳转日期函数:from_unixtime
select from_unixtime(1618238391);
select from_unixtime(0,'YYY=MM-dd HH:mm:ss');
-- 日期比较函数:datediff 日期格式要求'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'
select datediff('2012-02-28','2012-05-28');
-- 日期增加函数:date_add
select date_add('2012-03-28',10);
-- 日期减少函数:date_sub
select date_sub('2012-01-01',10);

3.2.3 Mathematical functions

  • Rounding function: round
  • Specify precision rounding function: round
  • Rounding down function: floor
  • Round up function: ceil
  • Take a random function: rand
  • Binary function: bin
  • Hexadecimal conversion function: conv
  • Pairing function: abs
-- 取整函数:round 返回double类型的整数值部分(遵循四舍五入)
select round(3.1415926);
-- 指定精度取整函数:round(double a,int d) 返回指定精度d的double类型
select round(3.1415926, 5);
-- 向下取整函数:floor
select floor(3.1415926);
select floor(-3.1415926);
-- 向上取整函数:ceil
select ceil(3.1415926);
select ceil(-3.1415926);
-- 取随机数函数:rand 每次执行偶读不一样 返回一个0到1范围内的随机数
select rand();
-- 指定种子取随机数函数:rand(int seed) 得到一个稳定的随机数序列
select rand(2);
-- 二进制函数:bin(BIGINT a)
select bin(18);
-- 进制转换函数:conv(BIGINT num,int from_base, int to_base)
select conv(17,10,16);
-- 求绝对值函数:abs
select abs(-3.9);

3.2.4 Aggregate functions

  • Set element size function: size(Map<K,V>) size(Array)
  • Take the map collection key function: map_keys(Map<K,V>)
  • Take the map set values ​​function: map_values(Map<K,V>)
  • Determine whether the array contains the specified element array_contains(Array, value)
  • Array sorting function: sort_array(Array)
-- 集合元素size函数:size(Map<K,V>) size(Array<T>)
select size(`array`(11,22,33));
select size(`map`("id",10086,"name","zhangsan","age",18));

-- 取map集合keys函数:map_keys(Map<K,V>)
select map_keys(`map`("id",10086,"name","zhangsan","age",18));
-- 取map集合values函数:map_values(Map<K,V>)
select map_values(`map`("id",10086,"name","zhangsan","age",18));

-- 判断数组是否包含指定元素:array_contains(Array<T>, value)
select array_contains(`array`(111,222,333),111);

-- 数组排序函数:sort_array(Array<T>)
select sort_array(`array`(12,2,32));

2.3.5 Conditional functions

It is mainly used in occasions such as conditional judgment and logical judgment conversion

  • If condition judgment: if (boolean testCondition, T valueTrue, T valueFalseOrNull)
  • Empty judgment function: isnull(a)
  • Non-empty judgment function: isnotnull(a)
  • Null value conversion function nvl(T value, T default_value)
  • Non-null lookup function: COALESCE(T v1, T v2,…)
  • Conditional conversion function: case a when b then c [when d then e]* [else f] end
  • nullif(a,b): If a = b, return null, otherwise return a
  • assert_true: If the condition part is true, an exception is raised, otherwise null is returned
-- 使用之前课程创建好的student表数据
-- if条件判断
select if(1 = 2, 100, 200);
select if(sex = '男', 'M','W') from student limit 3;

-- 空判断函数:isnull(a)
select isnull("allen");
select isnull(null);

-- 非空判断函数:isnotnull(a)
select isnotnull("allen");
select isnotnull(null);

-- 空值转换函数:nvl(T value, T default_value)
select nvl("allen",'helen');
select nvl(null,'test');

-- 非空查找函数 COALESCE(T v1, T v2,...)
-- 返回参数中的第一个非空值,如果所有值都为NULL,那么返回NULL
select COALESCE(null,11,22,33);
select COALESCE(null,null);

-- 条件转换函数:
select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end;
select case sex when '男' then 'male' else 'female' end from student limit 3;

-- nullif(a,b)
-- 如果 a = b,则返回null,否则返回a
select nullif(11,11);
select nullif(11,12); 

-- assert_true(condition)
-- 如果'condition' 不为真,则引发异常,否则返回null
select assert_true(11 >= 0);

2.3.6 Type conversion functions

Mainly used for explicit data type conversion

describe function extended cast;

-- 在任意数据类型之间转换:cast
select cast(12.14 as bigint);
select cast(12.14 as string);

2.3.7 Data masking function

Mainly complete the data desensitization conversion function and shield the original data

mask
mask_first_n(string str [, int n])
mask_last_n(string str [, int n])
mask_show_first_n(string str [, int n])
mask_show_last_n(string str [, int n])
mask_hash(string|char|varchar str)
-- mask
-- 将查询返回的数据,大写字母转换位X,小写字母转换为x,数字转换为n
select mask("abc123DFG");
select mask("abc123DFG",'-','.','^'); -- 自定义替换字符

-- mask_first_n(string str[, int n])
-- 对前n个进行脱敏替换
select mask_first_n("abc123DFG",4);

-- mask_last_n(string str[, int n])
-- 对后n个进行脱敏替换
select mask_last_n("abc123DFG",4);

-- mask_show_first_n(string str[, int n])
-- 除了前n个字符,其余进行掩码
select mask_show_first_n("abc123DFG",4);

-- mask_show_last_n(string str[, int n])
-- 除了后n个字符,其余进行掩码
select mask_show_last_n("abc123DFG",4);

-- mask_hash(string|char|varchar str)
-- 返回字符串的hash编码
select mask_hash("abc123DFG");

2.3.8 Other miscellaneous functions

  • Hive calls the java method: java_method(class, method[, arg1[, arg2]])
  • Reflection function: reflect(class,method[,arg1[,arg2]])
  • Hash value function: hash
  • current_user()、logged_in_user()、current_database()、version()
  • SHA-1 encryption: sha1(string/binary)
  • SHA-2 family algorithm encryption: sha2(string/binary, int) (SHA-224, SHA-256, SHA-384, SHA-512)
  • crc32 encryption
  • MD5 encryption: md5(string/binary)
-- 如果调用的java方法所在的jar包不是hive自带的 可以使用add jar添加进来
-- Hive调用java方法
select java_method("java.lang.Math","max",11,22);

-- 反射函数
select reflect("java.lang.Math","max,11,22");

-- 取哈希值函数
select hash("allen");

-- current_user()、logged_in_user()、current_database()、version()

-- SHA-1加密:sha1(string/binary)
select sha1("allen");

-- SHA-2
select sha2("allen",224);
select sha2("allen",512);

-- crc32加密
select crc32("allen");

3.3 User-defined functions

UDF ordinary function
features: one input and one output, input one line and output one line

UDAF Aggregation Function
The A of UDAF represents the meaning of Aggregation aggregation.
Multiple inputs and one output, that is, input multiple lines and output one line.
Functions such as count and sum

  • count: Count the total number of rows retrieved
  • sum: Sum
  • avg: average
  • min: minimum value
  • max: maximum value
  • Data collection function (deduplication): collect_set(col)
  • Data collection function (without deduplication): collect_list(col)

UDTF table generation function
UDTF table generation function, T stands for Table-Generating table generation. It
is characterized by one input and multiple outputs, that is, one line is input and multiple lines are output.
This type of function returns a result similar to a table, such as the explode function

3.3.1 Develop Hive UDF to realize mobile phone number encryption

Requirements
1. Be able to judge whether the input data is non-empty or the number of digits of the mobile phone number
2. Be able to verify the format of the mobile phone number and encrypt the data that meets the rules
3. Return the data that does not meet the rules of the mobile phone number directly without processing

UDF implementation steps
1. Write a java class, inherit UDF, and overload the evaluate method, which implements the business logic of the function
2. Overloading means that multiple functions can be implemented in a java class
3. The program is packaged as a jar package , upload the local HS2 server or HDFS
4. Add the jar package to the classpath of Hive in the client command line: hive > add JAR /xxx/udf.jar
5. Register as a temporary function (name the UDF): create temporary function function name as 'UDF class full path'
6. Functions used in HQL

Development environment preparation
Create a Maven project in IDEA and add the following pom dependencies for developing Hive UDF

<dependencies>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>3.1.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.1.4</version>
    </dependency>
</dependencies>

Step 1: Write business code

import org.apache.hadoop.hive.ql.exec.UDF;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class EncryptPhoneNumber extends UDF{
    
    
    public String evaluate(String phoNum){
    
    
        String encryPhoNum = null;
//        手机号不为空,并且位11位
        if (StringUtils.isNotEmpty(phoNum) && phoNum.trim().length() == 11){
    
    
//            判断数据是否满足中国大陆手机号码规范
            String regex = "^(1[3-9)\\d{9}$";
            Pattern p = Pattern.complie(regex);
            Matcher m = p.matcher(phoNum);
//            判断是否符合手机号规则
            if(m.matches()){
    
    
//                使用正则替换 返回加密后数据
                encryPhoNum = phoNum.trim().replaceAll("(\\d{3})\\d{4}(\\d{4})","$1****$2");
            }else {
    
    
//                不符合手机号规则的返回源号码
                encryPhoNum = phoNum;
            }
        }else {
    
    
//            不符合11位的返回原号码
            encryPhoNum = phoNum;
        }
        return encryPhoNum;
    }
}

Step 2: Upload the jar package and
use the inherited Maven plug-in for packaging in IDEA. Here, the dependencies will be put into the jar package together

Step 3: Upload the jar package to the HS2 server locally
Upload the jar package to the Linux system where the HS2 service runs and its Linux system, or upload the HDFS file system, and specify the follow-up path clearly

Step 4: Add the jar to the Hive Classpath
Use the command on the client to add the jar package to the classpath

add jar <jar包路径>

Step 5: Registering a temporary function
is to give a name to the function written by the user

create temporary function 函数名 as 'UDF类全路径';

4. High-order Hive function

4.1 The explode function of Hive UDTF

Features

  • explode accepts map and array type data as input, and then disassembles each element in the input data into a row of data, one element per row (array one column, map two columns)
  • The execution effect of explode is just enough to input one line and output multiple lines, so it is called UDTF function
  • In general, the explode function can be used directly and alone
  • It can also be used in conjunction with the lateral view side view according to business needs

Exercise: Analysis of the NBA championship team roster
1. Practice the use of the explode function
2. Understand what is called a UDTF table generation function
3. Discover the restrictions on the use of UDTF functions

Business requirements:
There is a copy of "The_NBA_Championship.txt", about the list of NBA championship teams in some years.
The first field indicates the team name, and the second field is the year when the championship was won.

数据示例:
Chicago Bulls,1991|1992|1993|1996|1997|1998

Requirements: Use Hive to build a table to map successful data, and split the data

Create table and load data

-- step1:建表
create table the_nba_championship(
    team_name string,
    champion_year array<string>
) row format delimited
fields terminated by ','
collection items terminated by '|';

-- step2:加载数据文件到表中
load data local inpath '/root/hivedata/The_NBA_Championship.txt' into table the_nab_championsship;

-- step3:验证
select * from the_nba_championship;

The UDTF syntax restricts
the explode function to be a UDTF table generation function. The execution result of explode can be understood as a virtual table whose data comes from the source table. It is no
problem to only query the source table data in select, and only query the virtual table data generated by explode. Problem, but when only querying the source table, you want to return both the source table data and the virtual table fields generated by explode

Solution to UDTF syntax limitations
From the SQL level, problem solution: join relational query
Hive provides a syntax lateral view, which is specially used to match UDTF functions such as explode

-- step4:使用explode函数对champion_year 进行拆分
select explode(champion_year) from the_nba_championship;

-- step5:lateral view + explode
select a.team_name,b.year
from the_nba_championship a lateral view explode(champion_year) b as year
order by b.year desc;

4.2 Hive Lateral View side view

The concept
Lateral View is a special syntax, which is mainly used with UDTF type functions to solve some query restrictions of UDTF functions.
Generally, UDTF will be used with lateral view.

Principle
Multiply the result of UDTF by a table similar to a view, and then connect each row in the original table with each row output by the UDTF function to generate a new virtual table. This avoids the problem of UDTF usage restrictions.
When using lateral view, you can also set field names for records generated by UDTF. The generated fields can be used in group by, order by, limit, etc., without nesting a separate layer Inquire

-- lateral view 侧视图基本语法
select ... from tableA lateral view UDTF(xxx) 别名 as col1,col2,...;

4.3 Hive Aggregation aggregation function

overview

  • The function of an aggregate function is to perform a calculation on a set of values ​​​​and return a single value
  • Aggregation function is a typical input of multiple rows and output of one row, using Hive's taxonomy, allowing UDAF type functions
  • Usually used with the Group By syntax, aggregated after grouping

Basic Aggregation
HQL provides several built-in UDAF aggregate functions such as max(…), min(…) and avg(…)

enhanced aggregation

  • Enhanced aggregation includes the functions of grouping_sets, cube, and rollup; it is mainly applicable to the OLAP multidimensional data analysis mode, and the dimension in multidimensional analysis refers to the dimension and angle of looking at the problem when analyzing the problem
  • The following uses a case to understand the functional meaning of the function. The meaning of the fields in the data: month, day, user indicates cookieid
数据示例:
2018-03,2018-03-10,cookie1
-- 表创建
create table cookie_info(
    month string,
    day string,
    cookieid string
) row format delimited
fields terminated by ',';
-- 加载数据

grouping sets
grouping sets is a convenient way to write multiple group by logics in one sql statement. It is equivalent to union all the group by result sets with different dimensions. grouping_id indicates which grouping set the result belongs to

select month,day,count(distinct cookieid) as numa,grouping__id
from cookie_info
group by month,day
grouping sets (month,day)
order by grouping__id;

-- grouping_id表示这一组结果属于哪个分组集合
-- 根据grouping sets中的分组条件month,day,1代表month,2代表day

-- 等价于
select month,null,count(distinct cookieid) as nums,1 as grouping__id
from cookie_info group by month
union all
select null as month,day,count(distinct cookieid) as nums,2 as grouping__id
from cookie_info group by day;

cube

  • Cube represents aggregation based on all combinations of group by dimensions
  • For cube, if there are n dimensions, the total number of possible combinations is:2^n
-- cube
select month,day,count(distinct cookieid) as numa,grouping__id
from cookie_info
group by month,day
with cube
order by grouping__id;

-- 等价于
select null,null,count(distinct cookieid) as nums,0 as grouping__id
from cookie_info
union all
select month,null,count(distinct cookieid) as nums,1 as grouping__id
from cookie_info
union all
select null,day,count(distinct cookieid) as nums,2 as grouping__id
from cookie_info
union all
select month,day,count(distinct cookieid) as nums,3 as grouping__id
from cookie_info;

rollup

  • The grammatical function of cube refers to: aggregate according to all combinations of the dimensions of group by
  • Rollup is a subset of cube, mainly based on the leftmost dimension, and hierarchical aggregation is performed from this dimension.
    For example, rollup has three dimensions a, b, and c, and the combination is: (a,b,c),(a, b), (a), ()
-- rollup
-- 以month维度进行层级聚合
select month,day,count(distinct cookieid) as numa,grouping__id
from cookie_info
group by month,day
with rollup
order by grouping__id;

4.4 Hive Windows Functions window function

overview

  • Window functions are also called window functions and OLAP functions. The biggest feature is that the input value is obtained from one or more rows of "windows" in the result set of the select statement.
  • If a function has an OVER clause, it is a window function
  • The window function can be simply interpreted as a calculation function similar to the aggregation function, but the conventional aggregation combined by the group by clause will hide each row being aggregated, and finally output a row, and the window function can also access each row after aggregation, and Add some attributes from these rows to the result set
-- 建表和加载数据
create table employee(
    id int,
    name string,
    deg string,
    salary int,
    dept string
) row format delimited
fields terminated by ',';
-- load加载数据

-- sum+group by
select dept,sum(salary) as total from employee group by dept;
-- sum+窗口函数聚合操作
select id,name,deg,salary,dept,sum(salary) over(partition by dept) as total from employee;

grammar rules

Function(arg1,...,argn) OVER ([PARTITION BY <...>] [ORDER BY <...>] [<window_expression>])
-- 其中Function(arg1,...,argn) 可以是下面分类中的任意一个
    -- 聚合函数:比如sum max avg等
    -- 排序函数:比如rank row_number等
    -- 分析函数:比如lead lag first_value等

-- OVER [PARTITION BY <...>] 类似于group by用于指定分组 每个分组可以叫做窗口
-- 如果没有PARTITION BY 那么整张表所有行就是一组

-- [ORDER BY <...>] 用于指定每个分组内的数据排序规则 支持ASC、DESC

-- [<window_expression>] 用于指定每个窗口中操作的数据范围 默认是窗口中所有行

4.4.1 Exercise: Analysis of Website User Page Views

In website access, cookies are often used to identify different user identities, and cookies can be used to track the page visits of different users. Learn the
relevant grammatical knowledge of window functions in Hive through the user's website access data.

grammar practice environment

两份数据
1.字段含义:cookieid、访问时间、pv数(页面浏览数)
cookie1,2018-04-10,5
2.字段含义:cookieid、访问时间、访问页面url
cookie1,2018-04-10 10:00:02,url2

Create table load

create table website_pv_info(
    cookieid string,
    createtime string,
    pv int
) row format delimited
fields terminated by ',';

create table website_url_info(
    cookieid string,
    createtime string,
    url int
) row format delimited
fields terminated by ',';

(1) Window aggregate function
The so-called window aggregate function refers to the use of aggregate functions such as sum, max, min, and avg in the window Here, the
sum() function is used as an example, and other aggregate functions are used similarly

-- 1. 求出每个用户总pv数 sum + group by普通常规聚合操作
select cookieid,sum(pv) as total_pv from website_pv_info group by cookieid;

-- 2.sum+窗口函数 总共有四种用法 注意是整体聚合 还是累积聚合
-- sum(...) over() 对表所有行求和
-- sum(...) over(order by ...) 连续累积求和
-- sum(...) over(partition by ...) 同组内所有行求和
-- sum(...) over(partition by ... order by ...) 在每个分组内,连续累积求和

(2) Window expression

  • In the case where the sum(…) over(partition by… order by…) syntax is complete, the cumulative aggregation operation is performed. The default cumulative aggregation behavior is to aggregate from the first row to the current row
  • The window expression window expression provides us with the ability to control the line range, such as 2 lines forward and 3 lines backward
  • The syntax is as follows
关键字是rows between,包括下面这几个选项
preceding:向前
following:向后
current row:当前行
unbounded:边界
unbounded preceding:表示从前面的起点
unbounded following:表示到后面的终点
-- 窗口表达式
select cookieid,createtime,pv,
    sum(pv) over(partition by cookieid order by createtime) as pv1 -- 默认从第一行到当前行
from werbsite_pv_info;
    
-- 第一行到当前行
select cookieid,createtime,pv,
    sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
from werbsite_pv_info;

-- 向前3行至当前行
select cookieid,createtime,pv,
    sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv3
from werbsite_pv_info;

-- 向前3行 向后1行
select cookieid,createtime,pv,
    sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv4
from werbsite_pv_info;

-- 当前行至最后一行
select cookieid,createtime,pv,
    sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv5
from werbsite_pv_info;

-- 第一行到最后一行 也就是分组内的所有行
select cookieid,createtime,pv,
    sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and unbounded following) as pv6
from werbsite_pv_info;

(3) Window sorting function row_number family
row_number: in each group, each row is assigned a unique sequence number starting from 1, incrementing, regardless of duplication
rank: in each group, each row is assigned a sequence number starting from 1 Sequence number, considering duplication, occupying subsequent positions
dense_rank: In each group, assign a serial number starting from 1 to each row, considering duplication, not occupying subsequent positions

-- 窗口排序函数
select cookieid,createtime,pv,
    RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS m1,
    DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS m2,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS m3
from website_pv_info
where cookieid = 'cookie1';

(3) window sorting function ntile

  • Divide the data in each group into specified buckets, and assign a bucket number to each bucket
  • If it cannot be evenly distributed, buckets with smaller numbers are allocated first, and the number of rows that can be placed in each bucket differs by at most 1
-- 把每个分组内的数据分为3桶
select cookieid,createtime,pv,
    NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS m2
from website_pv_info
ORDER BY cookieid,createtime;

(4) Window analysis function

  • LAG(col,n,DEFAULT) is used to count the value of the nth row on the intranet of the window, the first parameter is the column name, the second parameter is the nth row up (optional, the default is 1), the third parameter It is the default value (when the nth line on the Internet is null, the default value is taken, and if not specified, it is null)
  • LEAD(col,n,DEFAULT) is used to count the values ​​of the nth row down in the window
  • FIRST_VALUE After sorting within the group, as of the current row, the first value
  • LAST_VALUE After taking the sorting in the group, intercept to the current row, the last value
-- LAG
select cookieid,createtime,url
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
    LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
from website_url_info;

-- LEAD
select cookieid,createtime,url
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
    LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
from website_url_info;

4.5 Sampling functions

overview

  • When the amount of data is too large, it is necessary to find a subset of data to speed up data processing and analysis
  • This is sampling, sampling, a technique used to identify and analyze subsets in data to discover patterns and trends in the overall data set
  • In HQL, data can be sampled in three ways: random sampling, bucket table sampling and block sampling

(1) Random random sampling Random
sampling uses the rand() function to ensure random access to data, and LIMIT to limit the number of extracted data. The advantage is random
, but the disadvantage is that the speed is not fast, especially when there is a lot of table data
. 1. Recommended DISTRIBUTE+SORT, you can Ensure that the data is also randomly divided between the mapper and the reducer, so that the underlying execution is efficient.
2. The ORDER BY statement can also achieve the same purpose, but the performance is not good, because the ORDER BY is globally sorted, and the command starts to run a reducer

-- 需求:随机抽取2个学生的情况进行查看
select * from student
distribute by rand() sort by rand() limit 2;

-- 使用order by + rand也可以实现相同的效果 但是效率不高
select * from student
order by rand() limit 2;

(2) Block based on data block sampling

  • Block block sampling allows random acquisition of n rows of data, percentage data or data of a specified size
  • HDFS block size when sampling granularity
  • The advantage is the speed block, the disadvantage is that it is not random
-- block抽样
-- 根据行数抽样
select * from student TABLESAMPLE(1 ROWS);

-- 根据数据大小百分比抽样
select * from student TABLESAMPLE(50 PERCENT);

-- 根据数据大小抽样
-- 支持数据单位 b/B k/K m/M g/G
select * from student TABLESAMPLE(1k);

(3) Bucket table sampling based on bucket table
This is a special sampling method, which is optimized for bucket table. The advantage is that it is both random and fast. The
syntax is as follows

TABLESAMPLE(BUCKET x OUT OF y [ON colname])

-- 1.y必须是table总bucket数的倍数或者因子。hive根据y的大小,决定抽样的比例
    -- 例如,table总共分了4分(4个bucket),当y=2时,抽样(4/2)=2个bucket的数据,当y=8时,抽取(4/2)=1/2个bucket的数据
-- 2.x表示从哪个bucket开始抽取
    -- 例如,table总bucket数为4,tablesample(bucket 4 out of 4),表示总共抽取(4/4=)1个bucket的数据,抽取第4个bucket的数据
    -- 注意:x的值必须小于等于y的值,否则报错
-- 3.ON colname表示基于什么抽
    -- ON rand() 表示随机抽
    -- ON 分桶字段 表示基于分桶字段抽样 效率更高 推荐
-- bucket table抽样
-- 根据整行数据进行抽样
select * from t_usa_covid19_bucket TABLESAMPLE(BUCKET 1 OUT OF 2 ON rand());

-- 根据分桶字段进行抽样
describe formatted t_usa_covid19_bucket;
select * from t_usa_covid19_bucket TABLESAMPLE(BUCKET 1 OUT OF 2 ON state);

Guess you like

Origin blog.csdn.net/hutc_Alan/article/details/131481170