[Big Data] MapReduce components: Partition and sorting

Problem leads to

It is required to output the statistical results to different files (partitions) according to the conditions. For example: output the statistical results to different files (partitions) according to the different provinces where the mobile phone belongs

Analogous to freshmen <k,v> entering the school, different students are allocated dormitories, and then enter different dormitories (reduce tasks). If the amount of data sent by the map is too large, it means that these data will all go to the default reduce node. Execution does not play the purpose of reduce parallel computing, and the IO pressure is also great. This is the reason for the partition.

The default partition is obtained by modulating the number of Reduce Tasks based on the hashCode of the key. The user cannot control which key is stored in which partition.

public class HashPartitioner<K, V> extends Partitioner<K, V> {
    public int getPartition(K key, V value, int numReduceTasks) {
        return (key.hashCode() & Integer.MAX_VALUE
    }
}

a) Assign a zone by default

b) Assigning several areas corresponds to several reduce tasks, and each task will share the code in reduce when it is executed

c) For custom partitions, the number of partitions returned must be the same as the defined reduce task, specifically: the custom partition class extends HashPartitioner, when rewriting getPartition, the number of returned branches must be the same as job.setNumReduceTasks(X); The number of X is the same.

d) Custom partitions need to rely on the custom partitioner Partitioner, the working principle is as shown in (picture). From the figure, there must be a reduce for one partition, and each reduce processes the results of different partitions and outputs them to different parts. -r--0000X.

Case:

Export the statistical results to different files (partitions) according to different provinces where the mobile phone belongs

(1) Input data

(2) Input format

id mobile number network ip URL upstream traffic downstream traffic network status code

1,  13736230513, 192.196.100.1, www.atguigu.com, 2481,     24681,  200

2,  13846544121, 192.196.100.2,,                 264,      0,      200

3,  13956435636, 192.196.100.3,,                132,       1512,   200

4,  13966251146,  192.168.100.1,,                240,      0,      404

  1. Expected output data format

Files starting with 137

id mobile number network ip URL upstream traffic downstream traffic network status code

1,  13736230513, 192.196.100.1, www.atguigu.com, 2481,     24681,  200

Files starting with 138

id mobile number network ip URL upstream traffic downstream traffic network status code

2,  13846544121, 192.196.100.2,,                 264,      0,      200

Files starting with 139

id mobile number network ip URL upstream traffic downstream traffic network status code

3,  13956435636, 192.196.100.3,,                132,       1512,   200

4,  13966251146,  192.168.100.1,,                240,      0,      404

Idea: The mobile phone number is used as the key, and the row value is used as the value

(1) In Mapreduce, the kv pairs output by the map are grouped according to the same key, and then distributed to different reducetasks. The default distribution rule is: distribute according to the hashcode%reducetask number of the key

(2) If you want to group according to our own needs, you need to rewrite the data distribution (grouping) component Partitioner

(3) The FlowPartitioner that sets up a custom partition inherits the abstract class: Partitione , and puts the first reduce at the beginning of the mobile phone 136 to complete the statistics. The output result is in the partition number 0, and the same is placed for 137, 138, 139 and other results. To the partition number 1-4;

( 4 ) Start to filter the results output by mappper through the getPartition() method, and output in different partition numbers through the comparison set in 2;

( 5 ) Processing partition data through different reduce and output to different part-r-0000x;

( 6 ) Add a custom partition class and number of tasks in the driver class job :

 job.setPartitionerClass(CustomPartitioner.class)

 job.setNumReduceTasks(5)

Add a FlowPartitioner.class

package hdfs_demo.partiyioner;


import hdfs_demo.telFlow.FlowBean;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class FlowPartitioner extends Partitioner<Text, FlowBean> {
    /**
     * 返回分区号
     * @param text
     * @param flowBean
     * @param numPartitions
     * @return
     */
    //进行分区
    public int getPartition(Text text, FlowBean flowBean, int numPartitions) {

        String phone = text.toString();//获取手机号

        switch (phone.substring(0,3)){
            case "136":
                return 0;
            case "137":
                return 1;
            case "138":
                return 2;
            case "139":
                return 3;
            default:
                return 4;
        }
    }
}

Add partition settings in FlowDriver.class:

package hdfs_demo.partiyioner;

import hdfs_demo.telFlow.FlowBean;
import hdfs_demo.telFlow.FlowMapper;
import hdfs_demo.telFlow.FlowReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //创建配置对象
        Configuration conf = new Configuration();
        //创建一个job对象
        Job job = Job.getInstance(conf, "telFlowCount");

        //mapreduce的启动类
        job.setJarByClass(FlowDriver.class);

        //设置mapper 和reducer
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        //设置map的输出类型  Text, FlowBean
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);

        //设置reduce的输出类型  Text, FlowBean
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        // 在驱动类job中添加自定义分区类和任务数量
        job.setPartitionerClass(FlowPartitioner.class);
        job.setNumReduceTasks(5);

        //设置输入数据路径
        FileInputFormat.setInputPaths(job, new Path("G:\\idea-workspace\\hdfs_java_api\\Resource\\telinfo.txt"));
        //设置reducer输出结果的路径
        FileOutputFormat.setOutputPath(job, new Path("G:\\idea-workspace\\hdfs_java_api\\Resource\\result"));

        //提交任务
        boolean b = job.waitForCompletion(true);

        System.out.println(b);
    }
}

Sorting only needs to implement the compareTo method in FlowBean

public int compareTo(Object o) {
        return 0;
    }
}
 public int compareTo(Object o) {
        FlowBean bean =(FlowBean )o;
        // 倒序排列,从大到小
        return this.sumFlow>bean.getSumFlow() ? -1 : 1;
    }

 

 

 

Guess you like

Origin blog.csdn.net/Qmilumilu/article/details/104676456