Hive高级查询(group by、 order by、 join等)

时间 2015-07-19 10:40:34 IT社区推荐资讯相似文章 (1)

原文 http://itindex.net/detail/53925-hive-group-order

主题 Hive

查询操作

group by、 order by、 join 、 distribute by、
sort by、 clusrer by、 union all

底层的实现 mapreduce

常见的聚合操作

count计数

count(*)    所有值不全为NULL时，加1操作
count(1)    不管有没有值，只要有这条记录，值就加1
count(col)  col列里面的值为null，值不会加1，这个列里面的值不为NULL，才加1

sum求和

sum(可转成数字的值）返回bigint

avg求平均值

avg（可转成数字的值）返回double

distinct不同值个数

count(distinct col)

order by

按照某些字段排序

样例

select col1,other...
from table
where conditio
order by col1,col2 [asc|desc]

注意

order by后面可以有多列进行排序，默认按字典排序

order by为全局排序

order by需要reduce操作，且 只有一个reduce ，与配置无关。 数据量很大时，慎用。

执行流程从表中读取数据，执行where条件，以col1，col2列的值做成组合key，其他列值作为value，然后在把数据传到同一个reduce中，根据需要的排序方式进行。

group by

按照某些字段的值进行分组，有相同值放到一起。

样例

select col1 [,col2] ,count(1),sel_expr（聚合操作）from table
where condition         -->Map端执行
group by col1 [,col2]   -->Reduce端执行
[having]                -->Reduce端执行

注意

select后面非聚合列，必须出现在group by中

select后面除了普通列就是一些聚合操作

group by后面也可以跟表达式，比如substr(col)

特性

使用了reduce操作， 受限于reduce数量 ，设置reduce参数 mapred.reduce.tasks

输出文件个数与reduce数相同，文件大小与reduce处理的数据量有关。

问题

网络负载过重

数据倾斜，优化参数 hive.groupby.skewindata 为true，会启动一个优化程序，避免数据倾斜。

执行流程从表中读取数据，执行where条件，以col1列分组，把col列的内容作为key，其他列值作为value，上传到reduce，在reduce端执行聚合操作和having过滤。

eg:

set mapred.reduce.tasks=5;
select * from TabOrder order by ch asc,num desc;

set mapred.reduce.tasks=3;
select ch ,count(1) as num from TabOrder group by ch;

set hive.groupby.skewindata = true;
select ch ,count(1) as num from TabOrder group by ch having count(1)>2;

select col from tablename group by col; <==> select distinct col from tablename;

Join表连接

两个表m,n之间按照on条件连接，m中的一条记录和n中的一条记录组成一条新记录。

join等值连接（内连接），只有某个值在m和n中同时存在时。

left outer join 左外连接，左边表中的值无论是否在b中存在时，都输出；右边表中的值，只有在左边表中存在时才输出。

right outer join 和 left outer join 相反。

left semi join 类似 exists 。即查找a表中的数据， 是否在b表中存在，找出存在的数据。

mapjoin：在map端完成join操作，不需要用reduce，基于内存做join，属于优化操作。

select m.col as col1, m.col2 as col2, n.col3 as col3 from
(select col1,col2 from，test where ...   (map端执行)
)m  （左表）
[left outer |right outer | left semi] join
n   （右表）
on m.col=n.col
where condition     (reduced端执行)

set hive.optimize.skewjoin=true;

读取数据执行where条件，按col列分组，把col列的内容作为key，其他列作为value，传到reduce，在reduce端执行连接操作和where过滤。

eg:

create table m(
ch string,
num string
)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n' 
stored as textfile;
load data local inpath '/liguodong/hivedata/m' into table m;
create table n(
ch string,
num string
)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n' 
stored as textfile;
load data local inpath '/liguodong/hivedata/n' into table n;
select * from m;
select * from n;

内连接
select s.ch,s.num,t.num from
(select ch,num from m)s
join
(select ch,num from n)t
on s.ch=t.ch;

左外连接
select s.ch,s.num,t.num from
(select ch,num from m)s
left outer  join
(select ch,num from n)t
on s.ch=t.ch;

右外连接
select s.ch,s.num,t.num from
(select ch,num from m)s
right outer  join
(select ch,num from n)t
on s.ch=t.ch;

数据输出对比

select s.ch,s.num from
(select ch,num from m)s
left semi join
(select ch,num from n)t
on s.ch=t.ch;
运行结果：
A       1
C       5
C       3

MapJoin

mapjoin(map side join)在map端把小表加载到内存中，然后读取大表，和内存中的小表完成连接操作。其中使用了分布式缓存技术。

优点

不消耗集群的reduce资源（reduce相对紧缺）。

减少了reduce操作，加快程序执行。

降低网络负载。

缺点

占用部分内存，所以加载到内存中的表不能过大，因为每个计算节点都会加载一次。

生成较多的小文件。

执行流程从大表读取数据，执行where条件。把小表加载到内存中，每读取大表中的一条数据，都要和内存中的小表数据进行比较。

第一种方式，自动方式

配置以下参数

hive**自动**根据sql，选择使用common join或者map join

set hive.auto.convert.join=true;
hive.mapjoin.smalltable.filesize默认值是25mb

第二种方式，手动指定

select /*+mapjoin(n)*/ m.col, m.col2, n.col3 from m
join n on m.col=n.col;

注意： /*+mapjoin(n)*/ 不能省略，只需替换表名n值即可。

简单总结一下， map join的使用场景 ：

1、关联操作中有一张表非常小

2、不等值的链接操作

select c.city,p.province 
from
(select province,city from city)c
join
(select province from province)p
on c.province=p.province;

mapjoin手动方式
select /*+mapjoin(p)*/ c.city,p.province 
from
(select province,city from city)c
join
(select province from province)p
on c.province=p.province;

比较二则的比较时间。

Hive分桶JOIN

对于每一个表（table）或者分区，Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。

Hive是针对某一列进行分桶。

Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

好处

获得更高的查询处理效率。

使取样（sampling）更高效。

create table bucketed_user
(
id  int,
name  string
)
clustered by (id) sorted by (name) into 4 buckets
row format delimited fields terminated by '\t' 
stored as textfile;

set hive.enforce.bucketing=true;

分桶的使用

select * from bucketed_user tablesample(bucket 1 out of 2 on id)

bucket join

set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

连接两个在（包含连接列）相同列上划分了桶的表，可以使用Map端连接（Map side join）高效的实现。比如JOIN操作。

对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了捅操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大减少JOIN的数据量。

对于map端连接的情况，两个表以相同方式划分桶。处理左边表内某个桶的mapper知道右边表内相匹配的行在对应的桶内。因此，mapper只需要获取那个桶（这只是右边表内存储数据的·小部分）即可进行连接。

这一优化方法并不一定要求两个表必须桶的个数相同，两个表的桶个数是倍数关系也可以。

distribute by、sort by

distribute 分散数据distribute by col – 按照col列把数据分散到不同的reduce。

Sort排序sort by col – 按照col列把数据排序

select col1,col2 from M
distribute by col1
sort by col1 asc,col2 desc

两者结合出现，确保每个reduce的输出都是有序的。

distribute by与group by对比

都是按key值划分数据

都使用reduce操作

**唯一不同的是**distribute by只是单纯的分散数据，而group by把相同key的数据聚集到一起，后续必须是聚合操作。

order by与sort by 对比

order by是全局排序

sort by只是确保每个reduce上面输出的数据有序。如果只有一个reduce时，和order by作用一样。

执行流程

从表中读取数据，执行where条件。

设置reduce数为3，以distribute by列的值作为key,其他列值作为value，然后把数据根据key值传到不同的reduce，然后按sort by字段进行排序。

应用场景

map输出的文件大小不均

reduce输出文件大小不均

小文件过多

文件超大

把一个大文件放到一些小文件中
set mapred.reduce.tasks=5;-->下面的city将会输出到五个文件中

insert overwrite table city
selsct time,country,province,city from info
distribute by province;


把一些小文件放到一个大文件中
set mapred.reduce.tasks=1;-->下面的province将会输出到一个大文件中
insert overwrite table province partition(dt='20150719')
selsct time,country,province from city 
distribute by country;

注：province是一个分区表。

cluster by

把有相同值的数据聚集到一起，并排序。

效果等价于distribute by col sort by col

cluster by col <==> distribute by col sort by col

union all

多个表的数据合并成一个表，hive不支持union

select col from(
select a as col from t1
union all
select b as col from t2
)tmp

执行流程从表中读取数据，执行where条件。合并到同一个表中。

union all必须满足如下要求

字段名字一样

字段类型一样

字段个数一样

子表不能有别名

如果需要从合并之后的表中查询数据，那么合并的表必须要有别名

select * from (
select * from m
union all
select * from n
)temp;

如果两张表的字段名不一样，要将一个表修改别名同另一个表的字段名一样。
select * from (
select col1,col2 from m
union all
select col1,col3 as col2 from n
)temp;

作者：scgaliguodong123_ 发表于2015/7/19 22:43:08 原文链接