hive中distinct用法

hive中的distinct是去重的意思，和group by在某些情况下有相同的功能

下面测试下distinct的部分功能，先创建一张测试表

create table test.trip_tmp(
id int,
user_id int,
salesman_id int,
huose_id int
);

插入模拟数据

insert into test.trip_tmp values(1, 2, 3, 3);
insert into test.trip_tmp values(1, 2, 3, 3);
insert into test.trip_tmp values(2, 2, 3, 3);
insert into test.trip_tmp values(3, 2, 3, 3);
insert into test.trip_tmp values(4, 2, 5, 3);
insert into test.trip_tmp values(6, 3, 3, 3);
insert into test.trip_tmp values(5, 4, 2, 3);
insert into test.trip_tmp values(5, 2, 3, 3);
insert into test.trip_tmp values(6, 2, 5, 3);
insert into test.trip_tmp values(5, 2, 3, 3);
insert into test.trip_tmp values(5, 2, 5, 3);

查看表的所有数据

select * from test.trip_tmp;
OK
1	2	3	3
1	2	3	3
5	2	5	3
2	2	3	3
3	2	3	3
4	2	5	3
6	3	3	3
5	4	2	3
5	2	3	3
6	2	5	3
5	2	3	3
Time taken: 0.277 seconds, Fetched: 11 row(s)

对表的所有列去重

select distinct id, user_id, salesman_id, huose_id from test.trip_tmp;

OK
1	2	3	3
2	2	3	3
3	2	3	3
4	2	5	3
5	2	3	3
5	2	5	3
5	4	2	3
6	2	5	3
6	3	3	3
Time taken: 13.142 seconds, Fetched: 9 row(s)

这样distinct后的所有列重复的数据去除了

hive中使用distinct必须在select的最前面，不能在distinct的前面加列名，否则会报错

select huose_id, distinct id, user_id, salesman_id from test.trip_tmp;
NoViableAltException(96@[80:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS
 LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA13.specialStateTransition(HiveParser_SelectClauseParser.java:4625)
	at org.antlr.runtime.DFA.predict(DFA.java:80)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:1616)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1177)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:951)
	at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:42192)
	at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36852)
	at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:37119)
	at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36765)
	at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35954)
	at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35842)
	at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2285)
	at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1334)
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:17 cannot recognize input near 'distinct' 'id' ',' in selection target

distinct也可以这样用，但和把所有的列都放一起差不多

select distinct (id, user_id, huose_id), salesman_id from test.trip_tmp;
OK
{"col1":1,"col2":2,"col3":3}	3
{"col1":2,"col2":2,"col3":3}	3
{"col1":3,"col2":2,"col3":3}	3
{"col1":4,"col2":2,"col3":3}	5
{"col1":5,"col2":2,"col3":3}	3
{"col1":5,"col2":2,"col3":3}	5
{"col1":5,"col2":4,"col3":3}	2
{"col1":6,"col2":2,"col3":3}	5
{"col1":6,"col2":3,"col3":3}	3
Time taken: 9.201 seconds, Fetched: 9 row(s)

distinct不能和聚合函数并列使用，否则会报错

select distinct id, user_id, salesman_id, count(huose_id) from test.trip_tmp;
FAILED: SemanticException [Error 10128]: Line 1:42 Not yet supported place for UDAF 'count'

但可以在聚合函数里面使用distinct

select count(distinct id) from test.trip_tmp;
OK
6
Time taken: 4.775 seconds, Fetched: 1 row(s)

最后，如果能用group by的就尽量使用group by，因为group by性能比distinct更好，尤其数据量大的时候能明显感觉到。

猜你喜欢