hive- order by、sort by 、distribute by、cluster by

There are a few days children did not write a blog ~

1.order by

order by the same global data will sort, and so on oracle and mysql order by the effect of the database. The data do global ordering, plus sorting, will start a new job to sort all the data into the same will reduce the processing, regardless of how much data, no matter how many files are processed to enable a reduce. If you specify hive.mapred.mode = strict (default is a nonstrict), this time limit must be specified to limit the number of outputs, because: all data will be at the same end of a reducer, a large amount of data where possible, can not be the result, then under such strict mode, you must specify the number of output.

And when settings: set hive.mapred.mode = strict time limit is not specified, will select execution given as follows:

LIMIT must also be specifie

Stage-1: number of mappers: 73; number of reducers: 105
会再次启动一个jod,reduce个数为1,作数据归并排序
Stage-2: number of mappers: 60; number of reducers: 1

vs sort by

Stage-1: number of mappers: 73; number of reducers: 105
仅1个job

2、sort by

sort by individually reduce in their respective sort, local order, can not guarantee that the global order, and distribute by generally performed together.

Is a local sort, will do at each side sorting reduce, reduce each end is ordered, that is, each reduce out data is ordered, but not necessarily all orderly, unless a reduce, under normal circumstances can first after partial sequencing is complete, then the global order, will raise a lot of efficiency (after performing a local sort, in fact, do a merge sort can be done globally sort of).

If mapred.reduce.tasks = 1 order by the same effect and, if more than 1 will be divided into several files for each file will be sorted according to the output field specified not guarantee the global order.

select prov_id,nvl(count(*),0) from A.a   where part_id='201807' and day_id='21' group by prov_id distribute by prov_id sort by prov_id;

 

3、distribute by

distribute by side on the control map is how to distinguish reduce, the specified value will distribute by a hair to reduce the same . The same value is placed in a reduce id performed, a value of a not reduce, but the same value into a reduce. Will distribute by a designated field in accordance with the value modulo hashCode reduce the number of, and to assign them to the corresponding execution to reduce the

In the partitioning process is patition mapreduce program, the default specified key.hashCode () & Integer.MAX_VALUE% numReduce reduce the processing of the task is determined

public class myPartitioner extends Partitioner<TextPair, Text>{

    @Override
    public int getPartition(TextPair key, Text value, int num) {
        // TODO Auto-generated method stub

        if(num == 0 ){
            return 0;
        }
        int a = (key.getFirst().hashCode()&Integer.MAX_VALUE)%num;
        return a;
    }

}

4、Cluster By

sort by binding and distribute by using cluster by the equivalent, but the cluster by the sort rule can not be specified or asc, desc, only desc reverse order.

select prov_id,nvl(count(*),0) from zba_dwd.dwd_d_use_cb_sms   where part_id='201807' and day_id='21' group by prov_id Cluster By prov_id;

 

Published 131 original articles · won praise 79 · views 310 000 +

Guess you like

Origin blog.csdn.net/qq_31780525/article/details/81978213