Hive之——总结

转载请注明出处：https://blog.csdn.net/l1028386804/article/details/88674278

一、Hive命令说明

1.执行完立刻退出
使用hive -e 的形式

hive -e 'select count(*) from test'

2.不需看到其他无关紧要的信息

hive -S -e 'select count(*) from test'

3.一次性执行多个查询语句，可以将这些查询语句保存到后缀为hql的文件中，利用hive -f来一次性执行

hive -f test.sql

4.可以用"--"开头的字符串对Hive脚本进行注释

select count(*) from test --count the test table;

5.查看Hive的使用方式

hive --help --service cli

二、Hive的分隔符

\n: 换行符
^A(Ctrl+A): 在文本中以八进制编码\001表示，列分隔符
^B(Ctrl+B): 在文本中以八进制编码\002表示，作为分隔ARRAY、STRUCT中的元素，或者MAP中键值对的分隔
^C(Ctrl+C): 在文本中以八进制编码\003表示，用于MAP中键值对的分隔

比如：建表语句：

create table student(
name string,
age int,
cource array<string>,
body map<string, float>,
address struct<street:string, city:string, state:string>
)
row format delimited 
fields terminated by '\001' 
collection items terminated by '\002' 
map keys terminated by '\003' 
lines terminated by '\n' 
stored as textfile;

如果在建表时对文件或者数据格式没有太多的要求，可以部分或者全部使用hive默认的格式，某些子句可以忽略不写。如下：

create table student(
name string,
age int,
cource array<string>,
body map<string, float>,
address struct<street:string, city:string, state:string>
)
row format delimited 
fields terminated by '\t';

这样，除了列分隔符被指定为制表符外，其余的全部采用Hive的默认格式，如果连列分隔符也不需要设置，则row format delimited fields terminated by '\t' 都可以省略。

三、Hive中的数据库

Hive使用的默认数据库是default。可以使用create database语句来创建数据库。

create database test;

如果数据库存在，将将抛出一个错误信息，可以使用如下语句避免：

create database if not exists test;

可以使用show databases语句查看已存在的数据库。

show databases;

数据库在HDFS上的目录都是以.db结尾
如果想针对某个数据库改变其存放位置，可以如下命令在建表时修改默认存放位置:

create database test location '/user/hadoop/temp';

如果想查看某个已存在的数据库，可以使用如下命令:

describe database test;

查看存在的表结构

describe test;

切换当前工作的数据库:

use test;

删除数据库:

drop database if exists test cascade;

if exists是可选的，同样是为了避免该数据库不存在引起的警告信息，cascade同样也是可选，表示删除数据库时，将其中的表一起删除，默认情况下，Hive是不允许删除一个非空数据库的，如果强行删除，会得到"Database test not empty.One or more tabes exist" 的错误信息。当某个数据库被删除后，其对应的HDFS目录也将会被一起删除。

四、Hive中的表

1.查看数据库中的表

show tables;

如果要查看某个数据库中的表，可以先用use命令切换到工作数据库中，再使用show tables，如下:

use test;
show tables;

或者使用如下命令:

show tables in test;

2.创建表

一个完整的建表语句如下:

create table if not exists test.student(
name string comment 'student name',
age int comment 'student age',
cource array<string>,
body map<string, float>,
address struct<street:string,city:string,state:string> comment 'the info of student'
)
row format delimited 
fields terminated by '\001' 
collection items terminated by '\002' 
map keys terminated by '\003' 
lines terminated by '\n' 
stored as textfile 
location '/user/hive/warehouse/test.db/student';

查看表注释信息：

desc student;

查看表级别的注释:

desc extended student;

或者:

desc formatted student;

查看已经存在的表:

show tables;

或者指定数据库查看:

show tables in test;

可以通过复制另外一张表的表结构(不复制数据)的方法来创建表:

create table if not exists test.student2 like test.student;

3.管理表

Hive中，在建表时，如果没有特别指明的话，都是Hive中所谓的管理表(MANAGED TABLE),也叫托管表，管理表意味着负责管理表的数据，Hive默认会将数据保存到数据仓库目录下。当删除管理表时，Hive将删除管理表的数据和元数据。

外部表：

创建外部表的语句如下:

create external table if not exists test.student(
name string,
age int,
cource array<string>,
body map<string, float>,
address struct<street:string,city:string,state:string>
)
location '/user/test/x';

关键字external指明了该表为外部表，而location子句指明了该数据存放在HDFS /user/test/x目录下。
当需要删除外部表时，Hive会认为没有完全拥有这份数据，所以Hive只会删除该外部表的元数据信息而不会删除该表的数据。

管理表和外部表的差异并不能从数据是否保存在Hive默认的数据仓库下来判断，即使是管理表，在建表时也可以通过制定location子句制定存放路径。一般来说，当数据需要被多个工具共享时，最好创建一个外部表来明确数据的所有权。

分区表：

分区可以对表进行水平切分,创建分区表的语句如下：

create table student_info(
student_id string,
name string,
age int,
sex string,
father_name string,
mother_name string
)
partitioned by (province string, city string);

注意：分区的字段不能和定义的表字段重合，Hive中表是以目录形式存在与HDFS上，而表的分区则是以表目录的子目录存在
对于直接命中分区的查询，Hive不会执行MapReduce作业，最常见的是按照创建时间或修改时间进行分区，所以一张表中的分区数目比较多。
将Hive的安全模式设置为"strict"模式，这样如果一个针对分区表的查询没有对分区进行限制的话，该作业将会被禁止提交。可以修改hive-site.xml文件的hive.mapred.mode配置项为strict(默认为nostrict)，或者是在hive命令行中：

set hive.mapred.mode=strict

两者生效范围不同：前者所有会话生效，后者当前会话生效。

通过使用show partitions student_info来显示student_info的分区情况，使用describe extended student_info查看分区表详细信息。

外部分区表：

create external table student_info1(
student_id string,
name string,
age int,
sex string,
father_name string,
mother_name string
)
partitioned by (province string, city string);

和普通外部表不同，在建表时没有指定表的存储路径，所以在创建完外部表后，执行查询语句是查不到任何数据的，需要单独为外部表的分区键指定值和存储位置；

alter table student_info add partition (province = 'sichuan', city='chengdu') location 'hdfs://liuyazhuang11:9000/user/hadoop/temp/student_info/sichuan/chengdu';

同其他外部表一样，外部分区表被删除，数据也不会被删除。

无论是管理表还是外部表，一旦该表存在分区，在数据加载时必须加载进入指定分区中。如下：

load data inpath '/user/hadoop/data' into student_info partition (province = 'sichuan', city ='chengdu');

4.删除表

drop table test;
drop table if exists test;

5.修改表

这里，我们首先以如下命令创建一个外部表：

create external table if not exists test(id string, name string) partitioned by (x string, y string);

5-1.重命名表

alter table test rename to test2;

5-2.增加、修改、删除表分区
增加分区(通常是外部表)

alter table test add partition(x='x1', y='y1') location '/user/test/x1/y1';

修改分区:

alter table test partition(x='x1', y='y1') set location '/user/test/x1/y1';

该命令修改已存在的分区路径

删除分区：

alter table test drop partition(x='x1', y='y1');

5-3.修改列信息
用户可以对某个字段(列)进行重命名，并修改其数据类型、注释、在表中的位置:

alter table test 
change column id uid int 
comment 'the unique id' 
after name;

将test表中的id字段重命名为uid，并执行类型为int(即使类型和原来一样，也需要重新制定),并注释为the unique id, 最后将该字段移动到name字段之后。

5-4.增加列
可以使用如下语句增加一列或多列：

alter table test add columns(new_col int, new_col2 string);

5-5.删除或替换列
alter table test replace columns(new_col int, new_col2 string);
该命令删除了test表的所有列并重新定义了字段，由于只是修改了元数据，表数据并不会丢失或改变。

注意：alter table只是修改了表的元数据，所以一定要保证表的数据和修改后的元数据模式要匹配，否则数据将会变得不可用。

五、数据操作

1.装载数据

load data inpath '/user/hadoop/o' into table test;

这条命令将HDFS的/user/hadoop/o文件夹下的所有文件追加到表test中，如果需要覆盖test表已有的记录，则需加上override关键字，如下：

load data inpath '/user/hadoop/o' override into table test;

如果test表是一个分区表，则在HQL中必须制定分区，如下：

load data inpath '/user/hadoop/o' override into table test partition (part = 'a');

从本地直接加载数据到表中:

load data local inpath '/home/hadoop/o' into table test;
load data local inpath '/home/hadoop/o' override into table test;

2.通过查询语句向表中插入数据

insert overwrite table test select * from source;

当test表是分区表时，必须指定分区:

insert overwrite table test partition (part = 'a') select id, name from source;

通过一次查询，产生多个不相交的输出，如下：

from source 
insert overwrite table test partition (part = 'a')
select id, name where id >=0 and id < 100
insert overwrite table test partition (part = 'b')
select id, name where id >=100 and id < 200
insert overwrite table test partition (part = 'c')
select id, name where id >=200 and id < 300

这样只通过对source表的一次查询，就将符合条件的数据插入test表的各个分区，非常方便，如果要使用Hive的这个特性，必须将from子句写在前面。

3.利用动态分区向表中插入数据

动态分区，支持基于查询参数自动推断出需要创建的分区。如下:

insert overwrite table test partition(time) select id, modify_time from source;

test表中的分区字段为time,hive会根据modify_time不同的值创建分区，值得注意的是，Hive会根据select语句中最后一个查询字段作为动态分区的依据，而不是根据字段名来选择。如果指定了n个动态分区的字段，Hive会将语句中最后n个字段作为动态分区的依据。

Hive默认没有开启动态分区，在执行这条语句前，必须对Hive进行一些参数设置：

set hive.exec.dynamic.partition = true;

设置为true，表示开启动态分区功能。

set hive.exec.dynamic.partition.mode=nostrict;

设置为nostrict表示允许所有分区都是动态的，Hive默认不允许所有分区都是动态的，并且静态分区必须位于动态分区之前。
其他和动态分区相关的参数还包括：

hive.exec.max.dynamic.partitions.pernode: 每个Mapper或Reducer可以创建的最大分区数。
hive exec.max.dynamic.partitions: 一条动态分区创建语句能够创建的最大分区数。
hive.exec.max.created.files: 一个MapReduce作业能够创建的最大文件数。

4.通过CTAS加载数据

CTAS 是 create table ... as select 的缩写,意味着在一条语句中创建表并加载数据，Hive支持这样的操作，如下：

create table test as select id, name from source;

5.导出数据

可以通过查询语句选取需要的数据格式，再用insert子句将数据导出至HDFS或是本地，如下：

insert overwrite directory '/user/hadoop/r' select * from test;
insert overwrite local directory '/home/hadoop/r' select * from test;

如果Hive表中的数据正好满足用户需要的数据格式，那么直接复制文件或目录就可以看了。如下：

hadoop fs -cp /user/hive/warehouse/source_table /user/hadoop/

六、数据查询

1.select ... from 语句

select col1, col2 from table;
select t.col1 c1, t.col2 c2 from table t;

嵌套查询：

select l.name, r.course from (select id, name from left) l join (select id, cource from right) r on l.id = r.id;

可以通过正则表达式来指定查询的列，如下：

select 'user.*' from test;

该语句表示查询test表中，前缀为user的列，如果test表中有user.name和user.age那么，结果就可以将这两列查询出来。

通过limit子句来限制返回的行数，如下：

select * from test limit 100;

如果需要在select语句中根据某列的值进行相应的处理，Hive支持在select语句中使用case...when...then的形式，如下：

select id, name, sex,
case 
when sex = 'M' then '男' 
when sex = 'F' then '女' 
else '无效数据'
end
from student;

2.where语句

很多时候需要多查询条件进行限制，就需要使用where语句

select * from student where age = 18;

3.group by和having语句

select count(*) from student group by age;
select avg(age) from student group by classId;

查询的字段如果没有出现在group by子句的后面，必须使用聚合函数

select name, avg(age) from student group by classId;

这样，Hive会抛出一个Expression not in group by key name异常。

对分组的结果进行条件过滤，可以使用HAVING子句，如下：

select classId, AVG(age) from student where sex = 'F' group by classId having avg(age) > 18;

4.JOIN语句

假设table1表中的ID数据为

table2表中的ID数据为:

4-1.INNER JOIN(内连接)

select t1.id,t2.id from table1 t1 join table2 t2 on t1.id = t2.id;

结果为：

2       2
3       3

4-2.left/right outer join(左/右外连接)

左连接：
包含左边的所有记录:

select t1.id, t2.id from table1 t1 left outer join table2 t2 on t1.id = t2.id;

结果为:

1       NULL
2       2
3       3
4       NULL

右连接：
包含右表的所有记录

select t1.id, t2.id from table1 t1 right outer join table2 t2 on t1.id = t2.id;

结果为:

2       2
3       3
NULL    5
NULL    6

4-3.full outer join(全外连接)

包含左边和右表的所有记录，如下:

select t1.id, t2.id from table1 t1 full outer join table2 t2 on t1.id = t2.id;

结果为:

1       NULL
2       2
3       3
4       NULL
NULL    5
NULL    6

4-4.left-semi join(左半连接)

返回左表的记录，前提是其记录对于右边表满足on语句中的判定条件。左半连接被用来代替标准SQL中in的操作，如下：

select t1.id from table1 t1 left semi join table2 t2 on t1.id = t2.id;

结果为:

2
3

这里，也可以通过内连接得到同样的效果

select t1.id from table1 t1 join table2 t2 on t1.id = t2.id;

结果为：

2
3

4-5.map-side join

Hive支持Map端的join，需要在HQL中指明

select /*+mapjoin(t1)*/ t1.id, t2.id from table1 t1 join table2 t2 on t1.id = t2.id;

通过设置hive.auto.convert.join=true开启自动在map端join,还可以通过设置hive.mapjoin.smalltable.filesize来定义表的大小，默认为25000000字节。

4-6.多表join

Hive可以支持多表进行join,如下：

select * 
from table1 t1 
join table2 t2 on t1.id = t2.id 
join table2 t3 on t1.id = t3.id;

5.order by 和 sort by 语句

Hive中的order by 和 SQL中的order by 语义是一样的，执行全局排序，这样就必须由一个reducer来完成，否则无法达到全局排序的要求，如下：

select * from student order by classId desc, age asc;

Hive还有一种排序方式，这种排序方式只会在Reducer中进行一个局部排序，也就是sort by，如下：

select * from student sort by classId desc, age asc;

当Reducer个数为1时，两种结果完全相同，当Reducer个数不止一个时，sort by的输出可能会有重合
比如：table3中的id列为:

其中，order by id的结果为:

而当Reducer的数目为2时，sort by id的结果为:

6. distribute by 和 sort by 语句

比如，我们希望第一个列相同的数据能够按照第二个列进行排序，就可以通过distribute by 和 sort by 完成，如下：

select col1, col2 from ss distribute by col1 sort by col1, col2;

distribute by 保证了col1相同的数据一定进入了同一个Reducer，在Reducer中再按照col1、col2的顺序即可达到要求。

7.cluster by

如果在使用distribute by和sort by语句时，distribute by和sort by涉及的列完全相同，并且采用升序排列，可以使用cluster by代替distribute by 和 sort by。

8.分桶和抽样

Hive提供了对表分桶抽样的功能，如下：

select * from test tablesample(bucket 3 out of 10 on id);

对于bucket x out of y on z，其中y表示y个桶,x表示取第x个桶，z表示分桶的依据是将z列的哈希值散列除以y的余数，如果不指定z,可以采取随机列抽样的方式:

select * from test tablesample (bucket 3 out of 10 on rand());

如果建表时，指定为分桶表，那么在抽样会更加高效。在创建表前，还需将hive.enforce.bucketing设定为true:

create table buckettable(id int) clustered by (id) into 4 buckets;

该表被划分为4个桶，然后执行insert语句:

insert overwrite table buckettable select * from source;

数据将被话分为4个文件存放在表路径下，每个文件代表一个桶。

9. union all

Hive中对于union all的使用时非常常见的，主要用于多表合并的场景。union all要求各表select 出的字段类型必须完全匹配。

select r.id, r.price 
from (
select m.id, m.price from monday m 
union all 
select t.id, t.price from tuesday t) r

注意： Hive不支持直接进行union all，必须进行嵌套查询。

七、Hive函数

Hive内置了许多函数，在数据查询操作中经常会用到，可以通过show functions来查看内置的函数
Hive内置的函数分为:标准函数、聚合函数和表生成函数。

1.标准函数

指一行的一列或者多列作为参数传入，返回值是一个值的函数，常见的有to_date(string timestamp)、sqrt(double a)等。

2.聚合函数

接收参数为从0行到多行的0个到多个列，然后返回单一值，聚合函数经常和group by 子句一起使用，常见的有sum(col)/avg(col)/max(col)/std(col)等。

3.表生成函数

表生成函数接收0个或多个输入，产生多列或多行输出，典型的有explode(Array a),例如:
select explode(array("a", "b", "c") as s from test;
首先利用array函数生成一个数组，作为explode的参数，explode函数将数组的每一个元素生成新的一行，如下：

a
b
c

一、Hive命令说明

二、Hive的分隔符

三、Hive中的数据库

四、Hive中的表

1.查看数据库中的表

2.创建表

3.管理表

外部表：

分区表：

外部分区表：

4.删除表

5.修改表

五、数据操作

1.装载数据

2.通过查询语句向表中插入数据

3.利用动态分区向表中插入数据

4.通过CTAS加载数据

5.导出数据

六、数据查询

1.select ... from 语句

2.where语句

3.group by和having语句

4.JOIN语句

4-1.INNER JOIN(内连接)

4-2.left/right outer join(左/右外连接)

4-3.full outer join(全外连接)

4-4.left-semi join(左半连接)

4-5.map-side join

4-6.多表join

5.order by 和 sort by 语句

6. distribute by 和 sort by 语句

7.cluster by

8.分桶和抽样

9. union all

七、Hive函数

1.标准函数

2.聚合函数

3.表生成函数

猜你喜欢