Hive之——高级查询

转载请注明出处：https://blog.csdn.net/l1028386804/article/details/80559811

一、查询操作

group by、 order by、 join、 distribute by、 sort by、 cluster by、 union all
order by: 全局排序
sort by：每个分组内部排序

二、底层的实现

MapReduce

三、简单的聚合操作

1、count计数

count(*)、 count(1)、 count(col)

2、sum求和

sum(可转化成数字的值)返回bigint
sum(col) + 1 //直接执行会报错，sum(col)返回bigint，1为int类型，直接相加会报错
sum(col) + cast(1 as bigint) //正常执行

3、avg求平均值

avg(可转化成数字的值)返回double

4、distinct不同值个数

count(distinct col)

三、order by

按照某些字段排序
例如：

select col1, other...
from table
where condition
order by col, col2 [asc | desc]

注意：

order by 后面可以有多了进行排序，默认按字典排序
order by 为全局排序
order by 需要reduce操作，且只有一个reduce，与配置无关

四、group by

按照某些字段的值进行分组，有相同值放到一起
例如：

select col1 [, col2], count(1), sel_expr(聚合操作)
from table
where condition
group by col1[,col2]
[having]

注意：

select 后面非聚合列必须出现在group by中
除了普通列就是一些聚合操作
group by后面也可以跟表达式，比如substr(col)

特性：

使用了reduce操作，受限于reduce数量，设置reduce参数mapred.reduce.task
输出文件个数与reduce数相同，文件大小与reduce处理的数据量有关

set mapred.reduce.task = 5;

问题：

网路负载过重
数据倾斜，优化参数hive.groupby.skewindata

set hive.groupby.skewindata = true;

五、Join

表连接

两个表m,n之间按照on条件连接，m中的一条记录和n中的一条记录组成一条新的记录

join等值连接，只有某个值在m和n中同时存在时，这个值才会输出到最终结果当中
left outer join左外连接，左边表中的值无论是否在右边表中存在时，都输出，右边表中的值只有在左边表中存在时才输出
right outer join和left outer join相反
left semi join 类似exists
mapjoin在map端完成join操作，不需要reduce，基于内存做join，属于优化操作

例如：

select m.col as col, m.col2 as col2, n.col3 as col3
from
(select col, col2
from test
where...(map端执行)
)m(左表)
[left outer | right outer | left semi] join
n(右表)
on m.col = n.col
where condition(reduce端执行)

set hive.optimize.skewjoin = true;

六、mapjoin

mapjoin(map sizd join)

在map端把小表加载到内存中，然后读取大表，和内存中的小表完成连接操作
其中使用了分布式缓存技术

优点：

不消耗集群的reduce资源(reduce相对紧缺)
减少了reduce操作，加快程序执行
降低网络负载

缺点：

占用部分内存，所以加载到内存中的表不能过大，因为每个计算节点都会加载一次
生成较多的小文件
配置以下参数，是hive自动根据sql,选择common join或者map join

set hive.auto.convert.join = true;
hive.mapjoin.smalltable.filesize默认值是25mb

第二种方式，手动指定

select /*+mapjoin(n)*/ m.col, m.col2, n.col3 from m
join n
on m.col = n.col

mapjoin使用场景：

关联操作中有一张表非常小
不等值的连接操作

七、Hive分桶

分桶

对于每一个表(table)或者分区，Hive可以进一步组成桶，也就是说桶是更为细粒度的数据范围划分
Hive是针对某一列进行分桶
Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录放在哪个桶中

好处

获得更高的查询处理效率
使取样(sampling)更高效

分桶的使用

create table bucketed_user
(
id int,
name string
)
cluster by(id) sorted by(name) into 4 buckets
row format delimited
fields terminated by '\t'
stored as textfile;

set hive.enforce.bucketing=true;

select * from bucketed_user tablesample(bucket 1 out of 2 on id);

bucket join
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

连接两个在(包含连接列)相同列上划分了桶的表，可以使用map端连接(map-side join)高效的实现。比如join操作。对于join操作两个表有一个相同的列，如果对这两个表都进行了桶操作，那么将保存相同列值的桶进行join操作就可以，可以大大减少join的数据量
对于map端连接的情况，两个表以相同方式划分桶。处理左边表内某个桶的mapper知道右边表内相匹配的行在对应的桶内。因此，mapper只需要获取那个桶(这只是右边表内存储数据的一小部分)即可进行连接。这一优化方法并不一定要求两个表必须桶的个数相同，两个表的桶个数是倍数关系也可以。

八、distribute by和sort by

Distribute分散数据

distribute by col
按照col列把数据分散到不同的reduce

sort排序

sort by col
按照col列把数据排序

select col1, col2 from m
distribute by col1
sort by col1 asc, col2 desc;

两者结合出现，确保每个reduce的输出都是有序的

distribute by和group by对比

都是按照key值划分数据
都适用reduce操作
唯一不同的是，distribute by只是单纯的分散数据，而group by把相同key的数据聚集在一起，后续必须是聚合操作

order by与sort by

order by:是全局排序
sort by:只是确保每个reduce上面输出的数据有序，如果只有一个reduce,和order by作用一样

distribute by和sort by使用场景

map输出的文件大小不均
reduce输出的文件大小不均
小文件过多
文件超大

九、cluster by

把有相同值的数据聚集在一起，并排序
效果
cluster by col 等同于 distribute by col order by col

十、union all

多个表的数据合并成一个表, hive不支持union，只支持union
样例：

select col
from(
select a as col from t1
union all
select b as col from t2
)tmp

要求：

字段名字一样
字段类型一样
字段个数一样
子表不能有别名
如果需要从合并之后的表中查询数据，那么合并的表必须有别名