An article to get the hive function, it is recommended to collect!

Introduction

Hello, friends, as the saying goes: if you want to do good things, you must first sharpen your tools. Hive is a tool for us to deal with big data, so we need to use some functions of hive proficiently, so that the subsequent data testing will be handy.

Don't talk nonsense, directly serve dry goods

The directory is as follows

  1. data preparation

  2. Character function

  3. Aggregate function

  4. Mathematical function

  5. Time function

  6. Window function

  7. Condition function

1 Data preparation

First, we create a SQL table to collect user traffic pages, students can directly create it under mysql:

/*
SQLyog Ultimate v12.09 (64 bit)
MySQL - 5.7.16-log : Database -
*********************************************************************
*/

/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`user_database` /*!40100 DEFAULT CHARACTER SET utf8 */;

USE `user_database`;

/*Table structure for table `user_view` */

DROP TABLE IF EXISTS `user_view`;

CREATE TABLE `user_view` (
  `site_id` char(4) DEFAULT NULL,
  `user_name` char(11) DEFAULT NULL,
  `pv` int(4) DEFAULT NULL,
  `dt` char(16) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

/*Data for the table `user_view` */

insert  into `user_view`(`site_id`,`user_name`,`pv`,`dt`) values ('A10','Sone',2,'20200801'),('A10','welsh',3,'20200801'),('A10','Sone',16,'20200801'),('A10','Albert',20,'20200802'),('A10','GG',32,'  20200801'),('A20','Albert',42,'  20200801'),('A20','welsh',10,'20200801'),('A20','welsh',15,'20200802'),('A10','Albert',20,'20200801'),('A20','Sone',NULL,'20200802'),('A20','welsh',15,'20200802'),('A20','Albert',10,'20200802'),('A10','Jojo',16,'20200802'),('A20','welsh',35,'20200803'),('A10','welsh',33,'20200803'),('A20','Sone',66,'20200803'),('A20','Jojo',15,'20200802'),('A10','Albert',53,'20200803'),('A10','Jojo',12,'20200803'),('A20','GG',35,'20200803'),('A20','J.K',30,'20200803');

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

Preview:
An article to get the hive function, it is recommended to collect!

2 Character functions

Description: splicing, intercepting and removing spaces

枚举:concat、concat_ws、substring、trim、lpad、rpad、split、find_in_set

2.1 concat
说明:拼接字符

SELECT CONCAT(user_name,dt) FROM user_view

# 输出:
"welsh20200801"
"Albert20200801"
...
2.2 concat_ws
说明:拼接字符且分割

SELECT CONCAT_WS(':',user_name,dt) FROM user_view

# 输出:
"welsh:20200801"
"Albert:20200801"
...
2.3 substring
说明:截取字符串

用法:subString(col, intstart, int len)

SELECT CONCAT_WS(':',user_name,dt) FROM user_view

# 输出:
"welsh:20200801"
"Albert:20200801"
...
2.4 trim
说明:去掉两边的空格

用法:trim(col)

select trim('    welsh     ') 
# 输出:
"welsh"
2.5 repeat
说明:复制函数

用法:repeat(string str, int n)

select repeat('welsh',2)
# 输出:
"welshwelsh"
2.6 lpad
说明:填充函数,默认从左开始补充

用法:lpad(string str, int len, string pad)

select lpad('welsh',10, 'ddd')
# 输出:
"dddddwelsh"
2.7 rpad
说明:右补充函数,默认从右开始补充

用法:rpad(string str, int len, string pad)

select rpad('welsh',10, 'ddd')
# 输出:
"welshddddd"
2.8 split:
说明:分割函数,返回list

用法:split(string str, stringpat)

select split('welshUAlbertUGG','U')
# 输出:
["welsh","Albert","GG"]
2.9 find_in_set:
说明:查找函数,返回首次出现该字符位置

用法:find_in_set(string str, string strList)

select find_in_set('welsh','Albert,and,welsh,go,to,Swimming')
# 输出:
3

3 Aggregate functions

说明:对数据汇总、相加、平均、最大值、最小值

枚举:count、sum、avg、min、max、collect_list、collect_set
3.1 count
说明:汇总,若使用distinct则是去重后再汇总

用法:count(*),count(distint col)

# count统计包含null值总数
select count(*) from user_view

# count 不含null值总数
select count(pv) from user_view

# count(distinct col)统计去重总数
select count(distinct user_name) from user_view
3.2 sum
说明:相加,若使用distinct则是去重后再汇总

用法:sum(*),sum(distint col)

# sum 统计总值
select SUM(pv) FROM user_view

# sum 统计去重后总值
SELECT SUM(DISTINCT pv) FROM user_view
3.3 avg
说明:平均值,若使用distinct则是去重后在求平均值

用法:avg(*),avg(distint col)

# avg平均值
SELECT avg(pv) FROM user_view

# avg(distinct pv)去重后平均值
SELECT avg(distinct pv) FROM user_view

# min最小值
SELECT min(pv) FROM user_view

# max最大值
SELECT max(pv) FROM user_view
3.4 collect_list
说明:将字段组装成一个list,没有去重

用法:collect_list(col)

select collect_list(user_name) from dmall_gaea_analysis.user_view;
# 输出:
["Sone","welsh","Sone","Albert","GG","Albert","welsh","welsh","Albert","Sone","welsh","Albert","Jojo","welsh","welsh","Sone","Jojo","Albert","Jojo","GG","J.K"]
3.4 collect_set
说明:将字段组装成一个list,去重

用法:collect_set(col)

select collect_set(user_name) from dmall_gaea_analysis.user_view;

# 输出:
["Sone","welsh","Albert","GG","Jojo","J.K"]

4 Mathematical functions

说明:对数据球方差、标准偏差、样本标准层

枚举:variance、stddev_pop、stddev_samp

  # variance方差
  SELECT variance(pv) FROM user_view

  # stddev_pop标准偏差
  SELECT stddev_pop(pv) FROM user_view

  # stddev_samp样本标准偏差
  SELECT stddev_samp(pv) FROM user_view

5 Time function

Description: Time acquisition, formatting, 2 time difference, time increase, time decrease

枚举:unix_timestamp、FROM_UNIXTIME、to_date、weekofyear、weekofyear、datediff、date_add、date_sub

5.1 unix_timestamp

说明:获取当前时间戳

用法:unix_timestamp()

SELECT unix_timestamp()

# 输出:
1600226901

5.2 FROM_UNIXTIME

说明:格式化时间戳,通常与unix_timestamp()一起用,获取当前时间

用法:FROM_UNIXTIME()

SELECT FROM_UNIXTIME(unix_timestamp(),'yyyyMMdd') 

# 输出:
20200916

5.3 to_date

说明:格式化时间

用法:to_date()

SELECT to_date('2020-09-10 10:03:01') as now_time

# 输出:
2020-09-10

5.4 weekofyear

说明:返回当前周

用法:weekofyear()

SELECT weekofyear('2020-09-08 10:03:01') as now_time

# 输出:
37

5.5 datediff

说明:日期相差天数

用法:datediff()

select datediff('2020-09-09','2020-08-08')

# 输出:
32

5.6 date_add

说明:日期增加

用法:date_add()

select date_add('2020-09-08',10) as date_time

# 输出:
2020-09-18

5.7 date_sub

说明:日期减少N天

用法:date_sub()

select date_sub('2020-09-08',10) as date_time
# 输出:
2020-08-29

6 Window function

Description: Often used to rank existing data

Enumeration: row_number, RANK, DENSE_RANK

row_number():  分组后,从1开始排名,遇到相同值按照表中记录的顺序进行排列

RANK():分组后,从1开始排名,遇到相同值会在名次中留下空位

DENSE_RANK():分组后,从1开始排名,遇到相同值不会留下空位

select 
    user_name,pv,
    row_number() over (partition by site_id,dt order by pv desc) as ord_1,
    RANK() over (partition by site_id,dt order by pv desc) as ord_2,
    DENSE_RANK() over (partition by site_id,dt order by pv desc) as ord_3
from dmall_gaea_analysis.user_view where dt='20200803' and site_id='A20'
输出:

user_name   pv  ord_1   ord_2   ord_3
Sone    66  1   1   1
welsh   35  2   2   2
GG  35  3   2   2
J.K 30  4   4   3
结论:由于welsh 与 GG 的pv值一样,所以根据规则排名如下

row_number() 排名:1234

RANK() 排名:1224

DENSE_RANK() 排名:1223

7 Conditional function

说明:常用于对null进行处理

枚举:case

select 
    user_name,
    case when pv is null then 0 else pv end as pv
from dmall_gaea_analysis.user_view where pv is null

# 输出:
user_name       pv
Sone            0  

Follow my WeChat public account [Data Ape Wen Da]

Get hive official authoritative manual

An article to get the hive function, it is recommended to collect!

Guess you like

Origin blog.51cto.com/14974545/2543115