Introduction
Hello, friends, as the saying goes: if you want to do good things, you must first sharpen your tools. Hive is a tool for us to deal with big data, so we need to use some functions of hive proficiently, so that the subsequent data testing will be handy.
Don't talk nonsense, directly serve dry goods
The directory is as follows
-
data preparation
-
Character function
-
Aggregate function
-
Mathematical function
-
Time function
-
Window function
- Condition function
1 Data preparation
First, we create a SQL table to collect user traffic pages, students can directly create it under mysql:
/*
SQLyog Ultimate v12.09 (64 bit)
MySQL - 5.7.16-log : Database -
*********************************************************************
*/
/*!40101 SET NAMES utf8 */;
/*!40101 SET SQL_MODE=''*/;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`user_database` /*!40100 DEFAULT CHARACTER SET utf8 */;
USE `user_database`;
/*Table structure for table `user_view` */
DROP TABLE IF EXISTS `user_view`;
CREATE TABLE `user_view` (
`site_id` char(4) DEFAULT NULL,
`user_name` char(11) DEFAULT NULL,
`pv` int(4) DEFAULT NULL,
`dt` char(16) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*Data for the table `user_view` */
insert into `user_view`(`site_id`,`user_name`,`pv`,`dt`) values ('A10','Sone',2,'20200801'),('A10','welsh',3,'20200801'),('A10','Sone',16,'20200801'),('A10','Albert',20,'20200802'),('A10','GG',32,' 20200801'),('A20','Albert',42,' 20200801'),('A20','welsh',10,'20200801'),('A20','welsh',15,'20200802'),('A10','Albert',20,'20200801'),('A20','Sone',NULL,'20200802'),('A20','welsh',15,'20200802'),('A20','Albert',10,'20200802'),('A10','Jojo',16,'20200802'),('A20','welsh',35,'20200803'),('A10','welsh',33,'20200803'),('A20','Sone',66,'20200803'),('A20','Jojo',15,'20200802'),('A10','Albert',53,'20200803'),('A10','Jojo',12,'20200803'),('A20','GG',35,'20200803'),('A20','J.K',30,'20200803');
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
Preview:
2 Character functions
Description: splicing, intercepting and removing spaces
枚举:concat、concat_ws、substring、trim、lpad、rpad、split、find_in_set
2.1 concat
说明:拼接字符
SELECT CONCAT(user_name,dt) FROM user_view
# 输出:
"welsh20200801"
"Albert20200801"
...
2.2 concat_ws
说明:拼接字符且分割
SELECT CONCAT_WS(':',user_name,dt) FROM user_view
# 输出:
"welsh:20200801"
"Albert:20200801"
...
2.3 substring
说明:截取字符串
用法:subString(col, intstart, int len)
SELECT CONCAT_WS(':',user_name,dt) FROM user_view
# 输出:
"welsh:20200801"
"Albert:20200801"
...
2.4 trim
说明:去掉两边的空格
用法:trim(col)
select trim(' welsh ')
# 输出:
"welsh"
2.5 repeat
说明:复制函数
用法:repeat(string str, int n)
select repeat('welsh',2)
# 输出:
"welshwelsh"
2.6 lpad
说明:填充函数,默认从左开始补充
用法:lpad(string str, int len, string pad)
select lpad('welsh',10, 'ddd')
# 输出:
"dddddwelsh"
2.7 rpad
说明:右补充函数,默认从右开始补充
用法:rpad(string str, int len, string pad)
select rpad('welsh',10, 'ddd')
# 输出:
"welshddddd"
2.8 split:
说明:分割函数,返回list
用法:split(string str, stringpat)
select split('welshUAlbertUGG','U')
# 输出:
["welsh","Albert","GG"]
2.9 find_in_set:
说明:查找函数,返回首次出现该字符位置
用法:find_in_set(string str, string strList)
select find_in_set('welsh','Albert,and,welsh,go,to,Swimming')
# 输出:
3
3 Aggregate functions
说明:对数据汇总、相加、平均、最大值、最小值
枚举:count、sum、avg、min、max、collect_list、collect_set
3.1 count
说明:汇总,若使用distinct则是去重后再汇总
用法:count(*),count(distint col)
# count统计包含null值总数
select count(*) from user_view
# count 不含null值总数
select count(pv) from user_view
# count(distinct col)统计去重总数
select count(distinct user_name) from user_view
3.2 sum
说明:相加,若使用distinct则是去重后再汇总
用法:sum(*),sum(distint col)
# sum 统计总值
select SUM(pv) FROM user_view
# sum 统计去重后总值
SELECT SUM(DISTINCT pv) FROM user_view
3.3 avg
说明:平均值,若使用distinct则是去重后在求平均值
用法:avg(*),avg(distint col)
# avg平均值
SELECT avg(pv) FROM user_view
# avg(distinct pv)去重后平均值
SELECT avg(distinct pv) FROM user_view
# min最小值
SELECT min(pv) FROM user_view
# max最大值
SELECT max(pv) FROM user_view
3.4 collect_list
说明:将字段组装成一个list,没有去重
用法:collect_list(col)
select collect_list(user_name) from dmall_gaea_analysis.user_view;
# 输出:
["Sone","welsh","Sone","Albert","GG","Albert","welsh","welsh","Albert","Sone","welsh","Albert","Jojo","welsh","welsh","Sone","Jojo","Albert","Jojo","GG","J.K"]
3.4 collect_set
说明:将字段组装成一个list,去重
用法:collect_set(col)
select collect_set(user_name) from dmall_gaea_analysis.user_view;
# 输出:
["Sone","welsh","Albert","GG","Jojo","J.K"]
4 Mathematical functions
说明:对数据球方差、标准偏差、样本标准层
枚举:variance、stddev_pop、stddev_samp
# variance方差
SELECT variance(pv) FROM user_view
# stddev_pop标准偏差
SELECT stddev_pop(pv) FROM user_view
# stddev_samp样本标准偏差
SELECT stddev_samp(pv) FROM user_view
5 Time function
Description: Time acquisition, formatting, 2 time difference, time increase, time decrease
枚举:unix_timestamp、FROM_UNIXTIME、to_date、weekofyear、weekofyear、datediff、date_add、date_sub
5.1 unix_timestamp
说明:获取当前时间戳
用法:unix_timestamp()
SELECT unix_timestamp()
# 输出:
1600226901
5.2 FROM_UNIXTIME
说明:格式化时间戳,通常与unix_timestamp()一起用,获取当前时间
用法:FROM_UNIXTIME()
SELECT FROM_UNIXTIME(unix_timestamp(),'yyyyMMdd')
# 输出:
20200916
5.3 to_date
说明:格式化时间
用法:to_date()
SELECT to_date('2020-09-10 10:03:01') as now_time
# 输出:
2020-09-10
5.4 weekofyear
说明:返回当前周
用法:weekofyear()
SELECT weekofyear('2020-09-08 10:03:01') as now_time
# 输出:
37
5.5 datediff
说明:日期相差天数
用法:datediff()
select datediff('2020-09-09','2020-08-08')
# 输出:
32
5.6 date_add
说明:日期增加
用法:date_add()
select date_add('2020-09-08',10) as date_time
# 输出:
2020-09-18
5.7 date_sub
说明:日期减少N天
用法:date_sub()
select date_sub('2020-09-08',10) as date_time
# 输出:
2020-08-29
6 Window function
Description: Often used to rank existing data
Enumeration: row_number, RANK, DENSE_RANK
row_number(): 分组后,从1开始排名,遇到相同值按照表中记录的顺序进行排列
RANK():分组后,从1开始排名,遇到相同值会在名次中留下空位
DENSE_RANK():分组后,从1开始排名,遇到相同值不会留下空位
select
user_name,pv,
row_number() over (partition by site_id,dt order by pv desc) as ord_1,
RANK() over (partition by site_id,dt order by pv desc) as ord_2,
DENSE_RANK() over (partition by site_id,dt order by pv desc) as ord_3
from dmall_gaea_analysis.user_view where dt='20200803' and site_id='A20'
输出:
user_name pv ord_1 ord_2 ord_3
Sone 66 1 1 1
welsh 35 2 2 2
GG 35 3 2 2
J.K 30 4 4 3
结论:由于welsh 与 GG 的pv值一样,所以根据规则排名如下
row_number() 排名:1234
RANK() 排名:1224
DENSE_RANK() 排名:1223
7 Conditional function
说明:常用于对null进行处理
枚举:case
select
user_name,
case when pv is null then 0 else pv end as pv
from dmall_gaea_analysis.user_view where pv is null
# 输出:
user_name pv
Sone 0
Follow my WeChat public account [Data Ape Wen Da]
Get hive official authoritative manual