There are a few days children did not write a blog ~
1.order by
order by the same global data will sort, and so on oracle and mysql order by the effect of the database. The data do global ordering, plus sorting, will start a new job to sort all the data into the same will reduce the processing, regardless of how much data, no matter how many files are processed to enable a reduce. If you specify hive.mapred.mode = strict (default is a nonstrict), this time limit must be specified to limit the number of outputs, because: all data will be at the same end of a reducer, a large amount of data where possible, can not be the result, then under such strict mode, you must specify the number of output.
And when settings: set hive.mapred.mode = strict time limit is not specified, will select execution given as follows:
LIMIT must also be specifie
Stage-1: number of mappers: 73; number of reducers: 105
会再次启动一个jod,reduce个数为1,作数据归并排序
Stage-2: number of mappers: 60; number of reducers: 1
vs sort by
Stage-1: number of mappers: 73; number of reducers: 105
仅1个job
2、sort by
sort by individually reduce in their respective sort, local order, can not guarantee that the global order, and distribute by generally performed together.
Is a local sort, will do at each side sorting reduce, reduce each end is ordered, that is, each reduce out data is ordered, but not necessarily all orderly, unless a reduce, under normal circumstances can first after partial sequencing is complete, then the global order, will raise a lot of efficiency (after performing a local sort, in fact, do a merge sort can be done globally sort of).
If mapred.reduce.tasks = 1 order by the same effect and, if more than 1 will be divided into several files for each file will be sorted according to the output field specified not guarantee the global order.
select prov_id,nvl(count(*),0) from A.a where part_id='201807' and day_id='21' group by prov_id distribute by prov_id sort by prov_id;
3、distribute by
distribute by side on the control map is how to distinguish reduce, the specified value will distribute by a hair to reduce the same . The same value is placed in a reduce id performed, a value of a not reduce, but the same value into a reduce. Will distribute by a designated field in accordance with the value modulo hashCode reduce the number of, and to assign them to the corresponding execution to reduce the
In the partitioning process is patition mapreduce program, the default specified key.hashCode () & Integer.MAX_VALUE% numReduce reduce the processing of the task is determined
public class myPartitioner extends Partitioner<TextPair, Text>{
@Override
public int getPartition(TextPair key, Text value, int num) {
// TODO Auto-generated method stub
if(num == 0 ){
return 0;
}
int a = (key.getFirst().hashCode()&Integer.MAX_VALUE)%num;
return a;
}
}
4、Cluster By
sort by binding and distribute by using cluster by the equivalent, but the cluster by the sort rule can not be specified or asc, desc, only desc reverse order.
select prov_id,nvl(count(*),0) from zba_dwd.dwd_d_use_cb_sms where part_id='201807' and day_id='21' group by prov_id Cluster By prov_id;