Advanced query for hive

1. Query operation

group by、 order by、 join 、 distribute by、sort by、 clusrer by、 union all

[The bottom layer is through mapreduce]

2. Common aggregation operations

2.1 count

count(*) When all values ​​are not NULL, add 1

count(1) No matter whether there is a value or not, as long as there is this record, the value is incremented by 1

count(col) The value in the col column is null, the value will not be incremented by 1, and the value in this column is not null, only 1 will be added

【example】

Raw data

name ,adx,tran_id ,cost,ts

select count(*) from t2;//This line will only be counted when not all nulls, (null means nothing, '' is still worthwhile, he is empty)

select count(1) from t2; //As long as this row has a record, it will be counted

select count(adx) from t2;//If this column of statistics is null, it will not be counted

//The second column of the last three rows is null

analyze:

When the amount of data is large,

count(*) will judge whether each row of data is empty, which is inefficient, do not use it, discard it

count(1) direct statistics, high efficiency, recommended

count(field), efficiency intermediate

2.2 sum sum [to find the value of a column]

sum (value that can be converted to a number) returns bigint

select sum(adx) from t2;

2.3 avg to find the average [to find the average of a column]

avg (value that can be converted to a number) returns double

select avg(adx) from t2;

2.4 The number of distinct distinct values ​​[distinct is placed in front of the field name, and the duplicated data in this field is removed]

count(distinct col)

select distinct name from t2;

//Filter out duplicate data, display non-repeated data, and empty data is also counted

select count(distnct name) from t2;

3、ORDER BY

Sort by some fields

select col1,other...

from table

where conditio

order by col1,col2 [asc|desc]

Notice

Order by can be followed by multiple columns, the default is lexicographical sorting

order by is global sorting

order by requires a reduce operation, and there is only one reduce, regardless of configuration (although we can increase the number of reduce through configuration, but it is useless). Use with caution when the amount of data is large .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325184717&siteId=291194637