一、Hive的产生背景

1、MapReduce编程十分繁琐

2、传统RDBMS人员的需要

Hive由Facebook开源项目：

1、用于解决海量结构化日志的数据统计问题

2、构建在Hadoop之上的数据仓库

3、Hive提供SQL查询语言：HQL

4、底层支持多种不同的执行引擎【MR/Tez/Spark】1.x默认为MR 2.x默认为Spark 当然也可以设置为Tez

5、Hive提供了统一的元数据管理，因Hive数据是存放在HDFS中的，而HDFS不存在schema【HDFS只是普通的文本文件，无法确定字段的含义】，元数据信息一般存放在MySQL中

二、Hive的体系架构

最上面的一层shell、Thrift/JDBC(server/jdbc)都是client，另外还包括WebUI(HUE/Zeppelin)等

Metastore(元数据): ———— 在生产环境中通常存在MySQL中

对于database: name、location、owner等

对于table: name 、 location 、owner 、column name/type ...

实际上HQL语句经过Driver驱动，SQL Parser(SQL 分析/解析器) 生成语法树，Query Optimizer(查询优化器)进行优化，选择最优的执行计划，最后生成物理计划(Physical Plan)、经过序列化与反序列化，UDF(用户定义函数)，虽Hive提供了很多内置函数，但在实际工作中可能不足以应付，那么用户就可以自定义函数，最终执行(Execution)，Execution过程转换为MapReduce作业。【整个过程Hive就是将HQL生成MapReduce】，将作业提交到HDFS/HBase中运行。

三、Hive部署架构

四、Hive安装配置简单介绍

(1)从官网下载hive安装包，推荐使用Hive-1.2.1【因为Hive1.x底层是MapReduce，自Hive2.x后改为Spark】

(2)将Hive-1.2.1导入到服务器，进入/hive-1.2.1/conf文件夹中，里面有个hive-default.xml.template文件，里面是hive的默认配置信息。

(3)由于hive的metastore存储在MySQL中，那么hive所在的服务器怎么知道你连接哪个MySQL服务器呢？那么就需要手动配置一下MySQL相关信息，所以在hive-1.2.1/conf下创建一个hive-site.xml，用于配置数据库MySQL相关信息，该文件会覆盖hive-default.xml.template中的相关配置。

hive-site.xml：

<configuration>
  <property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:mysql://hdp-03:3306/hive?createDatabaseIfNotExist=true</value>
	<description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>com.mysql.jdbc.Driver</value>
	<description>Driver class name for a JDBC metastore</description>
  </property>

  <property>
	<name>javax.jdo.option.ConnectionUserName</name>
	<value>root</value>
	<description>username to use against metastore database</description>
  </property>

  <property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>root</value>
	<description>password to use against metastore database</description>
  </property>
</configuration>

(4)hive服务器默认不带mysql驱动包，所以将mysql-connector-java-5.1.39.jar 上传到hive-1.2.1/lib包下

(5)配置HADOOP_HOME和HIVE_HOME到环境变量中 : vi /etc/profile ------> source /etc/profile

(6)hive启动测试直接输入hive即可出现 hive> (交互性界面)

设置一些基本参数，让hive使用起来更便捷，比如：

让提示符显示当前库：

hive>set hive.cli.print.current.db=true;

显示查询结果时显示字段名称：

hive>set hive.cli.print.header=true;

但是这样设置只对当前会话有效，重启hive会话后就失效，解决办法：

在linux的当前用户目录【root用户为/root下】中，编辑一个.hiverc文件，将参数写入其中：

vi .hiverc

set hive.cli.print.header=true;
set hive.cli.print.current.db=true;

五、启动hive服务/客户端连接hive

5.1 启动hive服务

启动hive的服务：

[root@hdp-02 hive-1.2.1]# bin/hiveserver2 -hiveconf hive.root.logger=DEBUG,console

上述启动，会将这个服务启动在前台，如果要启动在后台，则命令如下：

[root@hdp-02 hive-1.2.1]# nohup bin/hiveserver2 1>/dev/null 2>&1 &

含义：不挂断后台运行 bin/hiveserver2程序标准输出到/dev/null (LInux下的黑洞，指不输出) 错误输出重定向到标准输出

科普：

nohup 是 no hang up 的缩写，就是不挂断的意思。

nohup命令：如果你正在运行一个进程，而且你觉得在退出帐户时该进程还不会结束，那么可以使用nohup命令。该命令可以在你退出帐户/关闭终端之后继续运行相应的进程。

& ：指在后台运行

例子：

nohup command > myout.file 2>&1 &

在上面的例子中最后有一个& 指后台运行，0 – stdin (standard input)，1 – stdout (standard output)，2 – stderr (standard error)

2>&1是将标准错误（2）重定向到标准输出（&1），标准输出（&1）再被重定向输入到myout.file文件中。

5.2 hive客户端连接

启动成功后，可以在别的节点上用beeline去连接

方式1:

[root@hdp-02 hive-1.2.1]# bin/beeline

回车，进入beeline的命令界面，输入命令连接hiveserver2

beeline> !connect  jdbc:hive2//hdp-02:10000

（hdp-02是hiveserver2所启动的那台主机名，端口默认是10000）

方式2:

启动时直接连接：

bin/beeline -u jdbc:hive2://hdp-02:10000 -n root

接下来就可以做正常sql查询了

六、脚本化运行【生产环境】

大量的hive查询任务，如果用交互式shell来进行输入的话，显然效率及其低下，因此，生产中更多的是使用脚本化运行机制：

该机制的核心点是：hive可以用一次性命令的方式来执行给定的hql语句

[root@hdp-02 ~]#  hive -e "insert into table t_dest select * from t_src;"

然后，进一步，可以将上述命令写入shell脚本中，以便于脚本化运行hive任务，并控制、调度众多hive任务，示例如下：

书写shell脚本， vi t_order_etl.sh

#!/bin/bash
hive -e "select * from db_order.t_order"
hive -e "select * from default.t_user"
hql="create table  default.t_bash as select * from db_order.t_order"
hive -e "$hql"

直接执行sh文件即可如 ./ t_order_etl.sh

【常见做法】

如果要执行的hql语句特别复杂，那么，可以把hql语句写入一个文件： vi x.hql

select * from db_order.t_order;
select count(1) from db_order.t_user;

然后，用hive -f /root/x.hql 来执行

七、hive建库建表与数据导入

7.1 建库

hive中有一个默认的库：

库名：default

库目录：hdfs://hdp-02:9000/user/hive/warehouse

新建库：

create database db_order;

库建好后，在hdfs中会生成一个库目录：

hdfs://hdp-02:9000/user/hive/warehouse/db_order.db

7.2 建表

7.2.1 基本建表语句

use db_order;

create table t_order(id string,create_time string,amount float,uid string);

表建好后，会在所属的库目录中生成一个表目录

hdfs://hdp-02:9000/user/hive/warehouse/db_order.db/t_order

只是，这样建表的话，hive会认为表数据文件中的字段分隔符为^A (对应键盘control V + control A)

正确的建表语句为：

create table t_order(id string,create_time string,amount float,uid string)

row format delimited

fields terminated by ',';

这样就指定了，我们的表数据文件中的字段分隔符为","

7.2.2 删除表

drop table t_order;

删除表的效果是：

hive会从元数据库中清除关于这个表的信息；

hive还会从hdfs中删除这个表的表目录；

7.2.3 内部表与外部表

内部表(MANAGED_TABLE)：表目录按照hive的规范来部署，位于hive的仓库目录/user/hive/warehouse中

外部表(EXTERNAL_TABLE)：表目录由建表用户自己指定[如我们采集到的日志在/log/2019-04-09中]，想实现该文件与hive的映射关系，则无需将日志文件移动到/user/hive/warehouse中，一是为了方便，二是担心因为移动文件而对外部程序造成影响。

create external table t_access(ip string,url string,access_time string)

row format delimited

fields terminated by ','

location '/log/2019-04-09';

外部表和内部表的特性差别：

内部表的目录在hive的仓库目录中，外部表的目录由用户指定
drop一个内部表时：hive会清除相关元数据，并删除表数据目录
drop一个外部表时：hive只会清除相关元数据；

一个hive的数据仓库，最底层的表，一定是来自于外部系统，为了不影响外部系统的工作逻辑，在hive中可建external表来映射这些外部系统产生的数据目录；然后，后续的ETL操作，产生的各种表建议用managed_table

7.2.4 分区表

分区表的实质是：在表目录中为数据文件创建分区子目录，以便于在查询时，MR程序可以针对分区子目录中的数据进行处理，缩减读取数据的范围。

比如，网站每天产生的浏览记录，浏览记录应该建一个表来存放，但是，有时候，我们可能只需要对某一天的浏览记录进行分析

这时，就可以将这个表建为分区表，每天的数据导入其中的一个分区

当然，每日的分区目录，应该有一个目录名（分区字段）

/user/hive/warehouse/t_pv_log/day=2019-04-08/
			    /day=2019-04-09/

/user/hive/warehouse/t_buyer_log/city=beijing/
			    /city=shanghai/

这样的话，day=2019-04-08和day=2019-04-09都属于t_pv_log，在查询的时候可以按日期查，也可以根据t_pv_log一起查出来，十分方便。

7.2.4.1 一个分区字段的实例：

1、创建带分区的表

create table t_access(ip string,url string,access_time string)

partitioned by(day string)

row format delimited

fields terminated by ',';

将来把数据向表中insert的时候，就需要指定一个day了，如day=2017-09-16，指定之后就插入到该目录。（plus：这个子目录day=2017-09-16并不是在建表时候就有的，而是在插入/导入数据时候才在HDFS中生成该目录的）

注意：分区字段不能是表定义中的已存在字段，否则会冲突，实际上分区字段是伪字段，在select查询时也会显示出来。

2、向分区中导入数据

load data local inpath '/root/access.log.2019-04-08.log' into table t_access partition(day='20190408');

load data local inpath '/root/access.log.2019-04-09.log' into table t_access partition(day='20190409');

【注意点：local inpath是指hive服务端所在的机器的本地目录】，导入后发现/user/hive/warehouse/access.db/t_access下生成了对应的文件夹day='20190408和day='20190409，而day='20190408文件内部是我们上传的log日志文件

3、针对分区数据进行查询

a、统计4月8号的总PV：

select count(*) from t_access where day='20190408';

实质：就是将分区字段当成表字段来用【实际上是伪字段】，就可以使用where子句指定分区了

b、统计表中所有数据总的PV：

select count(*) from t_access;

实质：不指定分区条件即可

7.3 数据导入导出

方式1：导入数据的一种方式：
手动用hdfs命令，将文件放入表目录；

方式2：在hive的交互式shell中用hive命令来导入本地数据到表目录

hive>load data local inpath '/root/order.data.2' into table t_order;

方式3：用hive命令导入hdfs中的数据文件到表目录

hive>load data inpath '/access.log.2019-04-09.log' into table t_access partition(dt='20190409');

注意：导本地文件和导HDFS文件的区别：
本地文件导入表：复制
hdfs文件导入表：移动(实际上是移动到表所在文件夹内部)

将hive表中的数据导出到指定路径的文件

(1)将hive表中的数据导入HDFS的文件

insert overwrite directory '/root/access-data'

row format delimited fields terminated by ','

select * from t_access;

(2)将hive表中的数据导入本地磁盘文件

insert overwrite local directory '/root/access-data'

row format delimited fields terminated by ','

select * from t_access limit 100000;

7.4 hive文件格式

HIVE支持很多种文件格式：SEQUENCE FILE | TEXT FILE | PARQUET FILE | RC FILE

《Hive文件格式之textfile,sequencefile和rcfile的使用与区别详解》

create table t_pq(movie string,rate int) stored as textfile;

create table t_pq(movie string,rate int) stored as sequencefile;

create table t_pq(movie string,rate int) stored as parquetfile;

演示：
1、先建一个存储文本文件的表

create table t_access_text(ip string, url string,access_time string)
row format delimited fields terminated by ','
stored as textfile;

导入文本数据到表中：

load data local inpath '/root/access-data/000000_0' into table t_access_text;

2、建一个存储sequence file文件的表：

create table t_access_seq(ip string,url string,access_time string)
stored as sequencefile;

从文本表中查询数据插入sequencefile表中，生成数据文件就是sequencefile格式的了：

insert into t_access_seq
select * from t_access_text;

3、建一个存储parquet file文件的表：

create table t_access_parq(ip string,url string,access_time string)
stored as parquetfile;

八 Hive数据类型

8.1 数字类型

TINYINT (1-byte signed integer, from -128 to 127)

SMALLINT (2-byte signed integer, from -32,768 to 32,767)

INT/INTEGER (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)

BIGINT   (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)

FLOAT     (4-byte single precision floating point number)

DOUBLE   (8-byte double precision floating point number)

示例：

create table t_test(a string ,b int,c bigint,d float,e double,f tinyint,g smallint)

8.2 日期时间类型

TIMESTAMP (Note: Only available starting with Hive 0.8.0)
DATE (Note: Only available starting with Hive 0.12.0)

示例，假如有以下数据文件：

1,zhangsan,1985-06-30
2,lisi,1986-07-10
3,wangwu,1985-08-09

那么，就可以建一个表来对数据进行映射

create table t_customer(id int,name string,birthday date)

row format delimited fields terminated by ',';

然后导入数据

load data local inpath '/root/customer.dat' into table t_customer;

然后，就可以正确查询

8.3 字符串类型

STRING

VARCHAR (Note: Only available starting with Hive 0.12.0)

CHAR (Note: Only available starting with Hive 0.13.0)

8.4 混杂类型

BOOLEAN
BINARY (Note: Only available starting with Hive 0.8.0)

8.5 复合类型

8.5.1 array数组类型

arrays: ARRAY<data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14)

示例：array类型的应用

假如有如下数据需要用hive的表去映射：

战狼2,吴京:吴刚:龙母,2017-08-16
三生三世十里桃花,刘亦菲:痒痒,2017-08-20

设想：如果主演信息用一个数组来映射比较方便

建表：

create table t_movie(moive_name string , actors array<string> , first_show date)

row format delimited 

fields terminated by ','

collection items terminated by ':';

导入数据：

load data local inpath '/root/movie.dat' into table t_movie;

查询：

select * from t_movie;

select moive_name,actors[0] from t_movie;

-- 使用array_containns(field,'keyword')看某字段是否包含keyword
select moive_name,actors from t_movie where array_contains(actors,'吴刚');

-- 求每部电影中包含多少位主演(actors是array类型,  size(field) 数组的长度函数)
select moive_name,size(actors) from t_movie;

8.5.2 map类型

maps: MAP<primitive_type, data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)

1、假如有以下数据：

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:xiaolin#mother:ruhua#sister:xiaoniu,29
4,mayun,father:mababa#mother:xiaoqiang,26

可以用一个map类型来对上述数据中key-value的家庭成员进行描述

2、建表语句：

create table t_person(id int,name string,family_members map<string,string>,age int)

row format delimited fields terminated by ','

collection items terminated by '#'

map keys terminated by ':';

3、查询

select * from t_person;

## 取map字段的指定key的值(查出每个人的爸爸)

select id,name,family_members['father'] as father from t_person;

## 取map字段的所有key(查出每个人的亲属关系)

select id,name,map_keys(family_members) as relation from t_person;

## 取map字段的所有value(查出每个人亲人的名字)

select id,name,map_values(family_members) from t_person;

##查出每个人中亲人的数量(使用size() 函数)

select id,name,size(family_members) as relations,age from t_person;

## 综合：查询有brother的用户信息[谁有兄弟、兄弟是谁]

-- 方式1
select id,name,father 
from 
(select id,name,family_members['brother'] as brother_name from t_person) tmp
where brother_name is not null;

--方式2
select id,name,age, famaily_members['brother'] 
from t_person where array_contains(map_key(famaily_members),'brother');

8.5.3 struct类型

structs: STRUCT<col_name : data_type, ...>

1、假如有如下数据：

1,zhangsan,18:male:beijing
2,lisi,28:female:shanghai

其中的用户信息包含：年龄：整数，性别：字符串，地址：字符串

设想用一个字段来描述整个用户信息，可以采用struct

2、建表：

create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)

row format delimited fields terminated by ','

collection items terminated by ':';

3、查询

select * from t_person_struct;

-- 查询 id , name , 年龄
select id,name,info.age from t_person_struct;

8.6 修改表定义

仅修改Hive元数据，不会触动表中的数据，用户需要确定实际的数据布局符合元数据的定义。

修改表名：

ALTER TABLE table_name RENAME TO new_table_name

示例：alter table t_1 rename to t_x;

修改分区名：

alter table t_partition partition(department='xiangsheng',sex='male',howold=20) rename to partition(department='1',sex='1',howold=20);

添加分区：

alter table t_partition add partition (department='2',sex='0',howold=40);

删除分区：

alter table t_partition drop partition (department='2',sex='2',howold=24);

修改表的文件格式定义：

ALTER TABLE table_name [PARTITION partitionSpec] SET FILEFORMAT file_format

ALTER TABLE t_partition partition(department='2',sex='0',howold=40 ) set fileformat sequencefile;

修改列名定义：

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENTcol_comment] [FIRST|(AFTER column_name)]  

alter table t_user change price jiage float first;

增加/替换列：

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type[COMMENT col_comment], ...)  

alter table t_user add columns (sex string,addr string);
alter table t_user replace columns (id string,age int,price float);

九、Hive查询语法

9.1 基本查询示例

select * from t_access;

select count(*) from t_access;

select max(ip) from t_access;

9.2 条件查询

select * from t_access where access_time < '2017-08-06 15:30:20'

select * from t_access where access_time < '2017-08-06 16:30:20' and ip>'192.168.33.3';

9.3 join关联查询示例

假如有a.txt文件

a,1
b,2
c,3
d,4

假如有b.txt文件

a,xx
b,yy
d,zz
e,pp

进行各种join查询：

1、inner join（join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name

结果：

2、left outer join（left join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name

3、right outer join（right join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name

4、full outer join（full join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
full join t_b b
on a.name=b.name;

5、left semi join(左半连接)

hive中不支持exist/IN子查询，可以用left semi join来实现同样的效果：

select 
a.name as aname,
a.numb as anumb
from t_a a
left semi join t_b b
on a.name=b.name;

注意：left semi join的select子句中，不能有右表的字段，所以select b.*也没有用。。。

left semi join中，右表的引用不能出现在where条件中

9.4 group by分组聚合

1、建表映射上述数据【使用分区partition】

create table t_access(ip string,url string,access_time string)
partitioned by (day string)
row format delimited fields terminated by ',';

2、书写测试数据

//1、新建/root/hivetest/access.log.0804
192.168.33.3,http://www.edu360.cn/stu,2017-08-04 15:30:20
192.168.33.3,http://www.edu360.cn/teach,2017-08-04 15:35:20
192.168.33.4,http://www.edu360.cn/stu,2017-08-04 15:30:20
192.168.33.4,http://www.edu360.cn/job,2017-08-04 16:30:20
192.168.33.5,http://www.edu360.cn/job,2017-08-04 15:40:20

//2、新建/root/hivetest/access.log.0805
192.168.33.3,http://www.edu360.cn/stu,2017-08-05 15:30:20
192.168.44.3,http://www.edu360.cn/teach,2017-08-05 15:35:20
192.168.33.44,http://www.edu360.cn/stu,2017-08-05 15:30:20
192.168.33.46,http://www.edu360.cn/job,2017-08-05 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-05 15:40:20

//3、新建/root/hivetest/access.log.0806
192.168.133.3,http://www.edu360.cn/register,2017-08-06 15:30:20s
192.168.111.3,http://www.edu360.cn/register,2017-08-06 15:35:20
192.168.34.44,http://www.edu360.cn/pay,2017-08-06 15:30:20
192.168.33.46,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-06 15:40:20
192.168.33.46,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.25,http://www.edu360.cn/job,2017-08-06 15:40:20
192.168.33.36,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-06 15:40:20

3、导入数据

-- 导入数据
load data local inpath '/root/hivetest/access.log.0804' into table t_access partition(day = '2017-08-04');
load data local inpath '/root/hivetest/access.log.0805' into table t_access partition(day = '2017-08-05');
load data local inpath '/root/hivetest/access.log.0806' into table t_access partition(day = '2017-08-06');

4、查看表的分区信息

show partitions t_access;

5、执行SQL

-- 题1:求8月4号以后,每天http://www.edu360.cn/job的总访问次数,以及访问者中ip地址最大的。(3种写法一样)
select day,max(url),count(1),max(ip)
from t_access
where url = 'http://www.edu360.cn/job'
group by day
having day > '2017-08-04';

select day,'http://www.edu360.cn/job',count(1),max(ip)
from t_access
where url = 'http://www.edu360.cn/job'
group by day
having day > '2017-08-04';

select day,url,count(1),max(ip)
from t_access
where url = 'http://www.edu360.cn/job'
group by day,url
having day > '2017-08-04';


-- 题2:求8月4号以后,每天每个页面的总访问次数,以及访问者中ip地址最大的
select day,url,count(1),max(ip)
from t_access where day > '2017-08-04'
group by day,url;

题1：

题2：

注意：一旦有group by子句，那么，在select子句中就不能有（分组字段，聚合函数）以外的字段

## 为什么where必须写在group by的前面，为什么group by后面的条件只能用having？

因为，where是用于在真正执行查询逻辑之前过滤数据用的

having是对group by分组聚合之后的结果进行再过滤；

语句的执行逻辑：

1、where过滤不满足条件的数据
2、用聚合函数和group by进行数据运算聚合，得到聚合结果
3、用having条件过滤掉聚合结果中不满足条件的数据

9.5 子查询

-- 题3:求8月4号以后,每天每个页面的总访问次数,以及访问者中ip地址最大的,且总访问次数大于2的
select day,url,count(1) as count,max(ip)
from t_access where day > '2017-08-04'
group by day,url 
having count > 2;

-- 方式2 : 使用子查询(子查询出来的结果实际上就是一张"中间表",再通过条件过滤中间表的数据)
select day,url,count,max_ip
from
(select day,url,count(1) as count,max(ip) as max_ip
from t_access where day > '2017-08-04'
group by day,url) temp
where temp.count > 2;

十、hive函数使用

《hive函数》

10.1 类型转换函数

-- 将字符串转int
select cast("5" as int); 
--将字符串转date  
select cast("2017-08-03" as date) ;
--将timestamp转date
select cast(current_timestamp as date);

10.2 数学运算函数

select round(5.4);   -- 5.0
select round(5.1345,3);  -- 5.135
select ceil(5.4);  -- select ceiling(5.4);   ## 6
select floor(5.4)  -- 5
select abs(-5.4)  -- 5.4
select greatest(3,5,6)  -- 6 
select least(3,5,6) --3


select max(age) from t_person;   -- 聚合函数
select min(age) from t_person;   -- 聚合函数

10.3 字符串函数

substr(string, int start) ## 截取子串

substring(string, int start)

示例：select substr("abcdefg",2); //输出 bcdefg

substr(string, int start, int len)

substring(string, int start, int len)

示例：select substr("abcdefg",2,3); //输出bcd

concat(string A, string B...) ## 拼接字符串

concat_ws(string SEP, string A, string B...) ##SEP分隔符

示例：select concat("ab","xy");

select concat_ws(".","192","168","33","44"); //输出192.168.33.44

length(string A) #字符串长度

示例：select length("192.168.33.44");

split(string str, string pat) #字符串分割

示例：~~select split("192.168.33.44" , "." );~~ //错误，因为.号是正则语法中的特定字符，不报错，但无法出正确结果

select split("192.168.33.44","\\."); //正确输出， ["192","168","2","1"]

select split("192.168.33.44","\\.")[1]; //正确输出， 168

upper(string str) ##转大写

lower(string str) ##转小写

10.4 时间函数

select current_timestamp; ##查看当前时间戳

select current_date; ##查看当前日期

## 取当前时间的毫秒数时间戳

select unix_timestamp();

from_unixtime (bigint unixtime [, string format] )

示例：select from_unixtime(unix_timestamp());

select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");

## 字符串转unix 时间戳

unix_timestamp(string date, string pattern)

示例：select unix_timestamp("2019-04-10 02:50:30");

select unix_timestamp("2019/04/10 02:50:30","yyyy/MM/dd HH:mm:ss");

## 将字符串转成日期date

select to_date("2017-09-17 16:58:32");

10.5 表生成函数

10.5.1 行转列函数：explode(field)

假如有以下数据：

1,zhangsan,化学:物理:数学:语文
2,lisi,化学:数学:生物:生理:卫生
3,wangwu,化学:语文:英语:体育:生物

映射成一张表：

create table t_stu_subject(id int,name string,subjects array<string>)

row format delimited fields terminated by ','

collection items terminated by ':';

使用explode()对数组字段转为列

然后，我们利用这个explode的结果，来求去重的课程：

select distinct tmp.sub
from 
(select explode(subjects) as sub from t_stu_subject) tmp;

10.5.2 表生成函数lateral view

select id,name,tmp.sub 
from t_stu_subject lateral view explode(subjects) tmp as sub;

理解： lateral view 相当于两个表在join
左表：是原表
右表：是explode(某个集合字段)之后产生的表
而且：这个join只在同一行的数据间进行

那样，可以方便做更多的查询：
比如，查询选修了生物课的同学

select a.id,a.name,a.sub from 
(select id,name,tmp.sub as sub from t_stu_subject lateral view explode(subjects) tmp as sub) a
where sub='生物';

10.6 条件控制函数

10.6.1 case when

示例：

select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;

10.6.2 IF

select id,if(age>25,'working','worked') from t_user;

select moive_name,if(array_contains(actors,'吴刚'),'好电影','rom t_movie;

10.7 json解析函数：表生成函数

json_tuple函数

示例：

-- movie、rate、timeStamp,uid都为json的key值,as表示生成的列命名
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_rating_json;

产生结果：

select * 
from t_rating_json
limit 10;

10.8 分析函数：row_number() over()——分组TOPN

有如下数据：

1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female

-- 创建表
create table t_rn(id int , age int , name string ,sec string)
row format delimited fields terminated by ',';

--导入数据
load data local inpath '/root/hivetest/t_rn.data' into table t_rn;

需要查询出每种性别中年龄最大的2条数据

思考：使用分组聚合函数group by只能产生一个结果，如最大、最小等

那么实现思路可以为：先分组 --> 排序 --> 标记序号 --> 书写过滤条件 where rn < 3 (组内序号小于3的，就是前两条)。

-- 根据sex进行分组 order by age,根据年龄进行降序操作
select * from 
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rn
from t_rn) tmp
where rn < 3;

10.9 级联报表查询(窗口分析函数累计报表——sum-over() )

窗口分析函数 sum() over() ：可以实现在窗口内逐行累加报表

有如下数据：

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
C,2015-01,10
C,2015-01,20
A,2015-02,4
A,2015-02,6
C,2015-02,30
C,2015-02,10
B,2015-02,10
B,2015-02,5
A,2015-03,14
A,2015-03,6
B,2015-03,20
B,2015-03,25
C,2015-03,10
C,2015-03,20

建表映射：

create table t_access_times(username string,month string,counts int)
row format delimited fields terminated by ',';

需要要开发hql脚本，来统计出如下累计报表：

用户	月份	月总额	累计到当月的总额
A	2015-01	33	33
A	2015-02	10	43
A	2015-03	30	73
B	2015-01	30	30
B	2015-02	15	45

假设已存在前3列数据(用户uuid、月份month、月总额amount)在t_access_amount表中，那么累计到当月总额accumulate就为分组(A)，排序后(按月份升序排列)的前月累加总和，如1月总和 = 1月、2月总和 = 1月 + 2月、 3月综合 = 1月 + 2月 + 3月......

求每个人累计到当月的总额

书写sql语句：

select * 
from 
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rn
from t_rn) tmp
where rn < 3;

10.10 hive自定义函数

有如下json数据：rating.json

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}

建表映射上述数据

create table t_ratingjson(json string);

load data local inpath '/root/hivetest/rating.json' into table t_ratingjson;

想把上面的原始数据变成如下形式：

1193,5,978300760,1
661,3,978302109,1
914,3,978301968,1
3408,4,978300275,1

思路：如果能够定义一个json解析函数，则很方便了

create table t_rate
as
select myjson(json,1) as movie,cast(myjson(json,2) as int) as rate,myjson(json,3) as ts,myjson(json,4) as uid from t_ratingjson;

解决：
hive中如何定义自己的函数：
1、先写一个java类（extends UDF,重载方法public C evaluate(A a,B b)），实现你所想要的函数的功能（传入一个json字符串和一个脚标，返回一个值） C为hive中的返回值 , A、B为hive中的两个入参

public class ParseJson extends UDF{
	
	// 重载 ：返回值类型 和参数类型及个数，完全由用户自己决定
	// 本处需求是：给一个字符串，返回一个数组
	public String[] evaluate(String json) {
		
		String[] split = json.split("\"");
		String[] res = new String[]{split[3],split[7],split[11],split[15]};
		return res;
	}
}

2、将java程序打成jar包，上传到hive所在的机器
3、在hive命令行中将jar包添加到classpath ：
hive> add jar /root/hivetest/myjson.jar;
4、在hive命令中用命令创建一个函数叫做myjson，关联你所写的这个java类
hive> create temporary function myjson as 'cn.itcats.hive.udf.MyJsonParser';

参考官方文档UDF: https://cwiki.apache.org/confluence/display/Hive/HivePlugins

十一、练习

11.1 用hql来做wordcount

有以下文本文件：

hello tom hello jim
hello rose hello tom
tom love rose rose love jim
jim love tom love is what
what is love

需要用hive做wordcount

-- 建表映射
create table t_wc(sentence string);

-- 导入数据
load data local inpath '/root/hivetest/xx.txt' into table t_wc;

hql答案：

先使用split(sentence,' ') 按空格分割，返回数组

将数组expolode(array) 后将返回结果当做临时表，使用聚合分组得到结果

SELECT word,count(1) as cnts
FROM (
    SELECT explode(split(sentence, ' ')) AS word
    FROM t_wc
    ) tmp
GROUP BY word
order by cnts desc;

itcats_cn

发布了162 篇原创文章 · 获赞 237 · 访问量 26万+

私信关注

深入理解Hive【Hive架构介绍、安装配置、Hive语法介绍】