Groupby语句,GroupBy高级特性

语法：

groupByClause: GROUP BY groupByExpression (, groupByExpression)*

groupByExpression: expression

groupByQuery: SELECT expression (, expression)* FROM srcgroupByClause?

高级使用:

多GroupBy 插入

Group By的Map-Side聚合

GROUPING SETS

CUBE

ROLL UP

Grouping_ID

Grouping function

Group By 语法

groupByClause: GROUP BY groupByExpression (, groupByExpression)* groupByExpression: expression

groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

默认的：Group By子句是指定列名的，但在0.11之后可以通过指定列编号来写语句： 0.11.0 - 2.1.x, 设置 hive.groupby.orderby.position.alias = true (默认是 false). 2.2.0 以后, 设置hive.groupby.position.alias = true (默认是false).

简单例子

`统计行数:`

SELECT COUNT() FROM table2;
当然你也可以使用COUNT(1)来替换COUNT().
统计分组后，每组的个数，如下所示：

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

你可以这样：

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

但你不能这样：

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

`选择语句和groupby子句`

当使用group by语句的时候，你在select中只能使用被group by的字段。想使用其他字段你得使用udaf。（UDF以后单独讲解）
比如下面的例子：

CREATE TABLE t1(a INTEGER, b INTGER);

SELECT
a,
sum(b)
FROM
t1
GROUP BY
a;

下面的语句就不行：

SELECT
a,
b
FROM
t1
GROUP BY
a;

这是因为在select语句中引用了非group by字段b，如果那张表像以下这样：

a b

100 1
100 2
100 3

对a字段进行group by之后，b的值应该给多少呢？尽管可以给一个最低值或者最高值，但是hive并没有采用这个方式。如果想使用这些字段你可以使用UDAF

高级特性

GroupBy插入

例子：

ROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY ‘/user/facebook/tmp/pv_age_sum’
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;
Group By的Map-Side聚合
设置hive.map.aggr=true 来开启，默认值为false。
这样做可以提高执行效率，那么牺牲的是内存。
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;

rouping Sets, Cubes, Rollups, and the GROUPING__ID Function 高级聚合，Cube，分组，RollUp
该文档主要介绍了groupby子句的高级聚合特性

GROUPING SETS

GROUPING SETS作为GROUP BY的子句，允许开发人员在GROUP BY语句后面指定多个统计选项，可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来，下面是几个实例可以帮助我们了解，

以acorn_3g.test_xinyan_reg为例：

hive -e “use acorn_3g;desc test_xinyan_reg;”
user_id bigint None
device_id int None 手机，平板
os_id int None 操作系统类型
app_id int None 手机app_id
client_version string None 客户端版本
from_id int None 四级渠道
几个Demo帮助大家了解:

	### `		grouping sets语句	等价hive语句`

elect device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id)) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(os_id,app_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id

Union all

elect null,os_id,app_id,count(user_id) from
Test_xinyan_reg group by os_id,app_id;
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id

UNION ALL

ELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,null,count(user_id) FROM test_xinyan_reg

CUBE函数

cube简称数据魔方，可以实现hive多个任意维度的查询，cube(a,b,c)则首先会对(a,b,c)进行group by，然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by，他会统计所选列中值的所有组合的聚合 select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with cube; 手工实现需要写的hql语句（写个程序自己生成的，手写累死）：

SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version
UNION ALL
SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version
UNION ALL
SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version
UNION ALL
SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version
UNION ALL
SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version
UNION ALL
SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version
UNION ALL
SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version
UNION ALL
SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version
UNION ALL
SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id
UNION ALL
SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id
UNION ALL
SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id
UNION ALL
SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id
UNION ALL
SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id
UNION ALL
SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id
UNION ALL
SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id
UNION ALL
SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id
UNION ALL
SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id
UNION ALL
SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id
UNION ALL
SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id
UNION ALL
SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id
UNION ALL
SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id
UNION ALL
SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id
UNION ALL

SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg
看着很蛋疼是不是，体会到cube的强大了吗！(低版本hive可以通过union all方式解决，算是没有办法的办法)

ROLL UP函数

rollup可以实现从右到左递减多级的统计，显示统计某一层次结构的聚合。

select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id with rollup;

等价以下sql语句：

select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

Grouping_ID函数

当我们没有统计某一列时，它的值显示为null，这可能与列本身就有null值冲突，这就需要一种方法区分是没有统计还是值本来就是null。（写一个排列组合的算法，就马上理解了，grouping_id其实就是所统计各列二进制和）

grouping_id函数是计算分组级别的函数，注意如果要使用grouping_id函数那必须得有group by字句，而且group by字句的中的列与grouping_id函数的参数必须相等。比如group by A,B，那么必须使用grouping_id（A,B）。下面用一个等效关系来说明grouping_id()与grouping()的联系，grouping_id(A, B)等效于grouping(A) + grouping(B)，但要注意这里的+号不是算术相加，它表示的是二进制数据组合在一起，比如grouping（A）=1，grouping（B）=1，那么grouping_id(A, B)=11B，也就是十进制数3。原来的表数据执行下面的sql语句结果太多效果不明显，所以我改了下表数据，不过对比两个结果集效果很明显。

直接拿官方文档一个例子

Column1 (key) Column2 (value)
1 NULL
1 1
2 2
3 3
3 NULL
4 5

hql统计：

>  SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP

统计结果如下：
key value Grouping__ID count
NULL NULL 0 00 6
1 NULL 1 10 2
1 NULL 3 11 1
1 1 3 11 1
2 NULL 1 10 1
2 2 3 11 1
3 NULL 1 10 2
3 NULL 3 11 1
3 3 3 11 1
4 NULL 1 10 1
4 5 3 11 1

	GROUPING__ID转变为二进制，如果对应位上有值为null，说明这列本身值就是null。（通过类DataFilterNull.py 扫描，可以筛选过滤掉列中null、“”统计结果),

Grouping function

grouping函数用来区分NULL值，这里NULL值有2种情况，一是原本表中的数据就为NULL，二是由rollup、cube、grouping sets生成的NULL值。当为第一种情况中的空值时，grouping(NULL)返回0；当为第二种情况中的空值时，grouping(NULL)返回1。实例如下，从结果中可以看到第二个结果集中原本为null的数据由于grouping函数为1，故显示ROLLUP-NULL字符串。

SELECT key, value, GROUPING__ID,
grouping(key, value), grouping(value, key), grouping(key), grouping(value),
count(*)
FROM T1
GROUP BY key, value WITH ROLLUP;

This query will produce the following results.

NULL NULL 3 3 3 1 1 6
1 NULL 0 0 0 0 0 2
1 NULL 1 1 2 0 1 1
1 1 0 0 0 0 0 1
2 NULL 1 1 2 0 1 1
2 2 0 0 0 0 0 1
3 NULL 0 0 0 0 0 2
3 NULL 1 1 2 0 1 1
3 3 0 0 0 0 0 1
4 NULL 1 1 2 0 1 1
4 5 0 0 0 0 0 1

hive.new.job.grouping.set.cardinality 可能大家已经考虑到了，如果使用sets、cube或者rollup之类的操作，当基数很大时可能会出现一些问题。所以，使用这个配置来完成一些优化。该配置应该大于sets、cube或者rollup产生的组合大小。