1. Query operation
group by、 order by、 join 、 distribute by、sort by、 clusrer by、 union all
[The bottom layer is through mapreduce]
2. Common aggregation operations
2.1 count
count(*) When all values are not NULL, add 1
count(1) No matter whether there is a value or not, as long as there is this record, the value is incremented by 1
count(col) The value in the col column is null, the value will not be incremented by 1, and the value in this column is not null, only 1 will be added
【example】
Raw data
name ,adx,tran_id ,cost,ts
select count(*) from t2;//This line will only be counted when not all nulls, (null means nothing, '' is still worthwhile, he is empty)
select count(1) from t2; //As long as this row has a record, it will be counted
select count(adx) from t2;//If this column of statistics is null, it will not be counted
//The second column of the last three rows is null
analyze:
When the amount of data is large,
count(*) will judge whether each row of data is empty, which is inefficient, do not use it, discard it
count(1) direct statistics, high efficiency, recommended
count(field), efficiency intermediate
2.2 sum sum [to find the value of a column]
sum (value that can be converted to a number) returns bigint
select sum(adx) from t2;
2.3 avg to find the average [to find the average of a column]
avg (value that can be converted to a number) returns double
select avg(adx) from t2;
2.4 The number of distinct distinct values [distinct is placed in front of the field name, and the duplicated data in this field is removed]
count(distinct col)
select distinct name from t2;
//Filter out duplicate data, display non-repeated data, and empty data is also counted
select count(distnct name) from t2;
3、ORDER BY
Sort by some fields
select col1,other...
from table
where conditio
order by col1,col2 [asc|desc]
Notice
Order by can be followed by multiple columns, the default is lexicographical sorting
order by is global sorting
order by requires a reduce operation, and there is only one reduce, regardless of configuration (although we can increase the number of reduce through configuration, but it is useless). Use with caution when the amount of data is large .