04-常见mr算法实现和shuffle的机制

hadoop中的序列化机制

Writable,接口是序列化的接口,Comparable是排序实现接口

 1.自定义传递值的类型
    
     
     
  1. package com.apollo.mr.flowsum;
  2. import org.apache.hadoop.io.WritableComparable;
  3. import java.io.DataInput;
  4. import java.io.DataOutput;
  5. import java.io.IOException;
  6. /**
  7. * Writable是序列化接口,Comparable是排序接口,MR中排序是按照key来排序的
  8. * Created by 85213 on 2016/12/13.
  9. */
  10. public class FlowBean implements WritableComparable<FlowBean> {
  11. private String phoneNB;
  12. private long u_flow;
  13. private long d_flow;
  14. private long s_flow;
  15. //在反序列化时,反射机制需要调用空参构造函数,所以显示定义了一个空参构造函数
  16. public FlowBean(){}
  17. //为了对象数据的初始化方便,加入一个带参的构造函数
  18. public FlowBean(String phoneNB,long u_flow,long d_flow){
  19. this.phoneNB = phoneNB;
  20. this.u_flow = u_flow;
  21. this.d_flow = d_flow;
  22. this.s_flow = u_flow+d_flow;
  23. }
  24. //将对象数据序列化到流中
  25. @Override
  26. public void write(DataOutput out) throws IOException {
  27. out.writeUTF(phoneNB);
  28. out.writeLong(u_flow);
  29. out.writeLong(d_flow);
  30. out.writeLong(s_flow);
  31. }
  32. //注意顺序要和序列化的顺序一致
  33. @Override
  34. public void readFields(DataInput in) throws IOException {
  35. phoneNB = in.readUTF();
  36. u_flow = in.readLong();
  37. d_flow = in.readLong();
  38. s_flow = in.readLong();
  39. }
  40. //排序规则,这里实现倒序排序
  41. @Override
  42. public int compareTo(FlowBean o) {
  43. return s_flow > o.s_flow ?-1:1;
  44. }
  45. public String toString() {
  46. return "" + u_flow + "\t" +d_flow + "\t" + s_flow;
  47. }
  48. public String getPhoneNB() {
  49. return phoneNB;
  50. }
  51. public void setPhoneNB(String phoneNB) {
  52. this.phoneNB = phoneNB;
  53. }
  54. public long getU_flow() {
  55. return u_flow;
  56. }
  57. public void setU_flow(long u_flow) {
  58. this.u_flow = u_flow;
  59. }
  60. public long getD_flow() {
  61. return d_flow;
  62. }
  63. public void setD_flow(long d_flow) {
  64. this.d_flow = d_flow;
  65. }
  66. public long getS_flow() {
  67. return s_flow;
  68. }
  69. public void setS_flow(long s_flow) {
  70. this.s_flow = s_flow;
  71. }
  72. }
2.mapper中输出类型使用自定义的类型
    
     
     
  1. package com.apollo.mr.flowsum;
  2. import org.apache.hadoop.io.LongWritable;
  3. import org.apache.hadoop.io.Text;
  4. import org.apache.hadoop.mapreduce.Mapper;
  5. import java.io.IOException;
  6. /**
  7. * FlowBean 是我们自定义的一种数据类型,要在hadoop的各个节点之间传输,应该遵循hadoop的序列化机制
  8. * 就必须实现hadoop相应的序列化接口
  9. * Created by 85213 on 2016/12/13.
  10. */
  11. public class FlowMapper extends Mapper<LongWritable,Text,Text,FlowBean> {
  12. //拿到日志中的一行数据,切分各个字段,抽取出我们需要的字段:手机号,上行流量,下行流量,然后封装成kv发送出去
  13. @Override
  14. protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  15. //拿到一行数据
  16. String line = value.toString();
  17. //解析字段
  18. String[] fileds = line.split("\t");
  19. String phoneNB = fileds[1];
  20. long u_flow = Long.parseLong(fileds[7]);
  21. long d_flow = Long.parseLong(fileds[8]);
  22. context.write(new Text(phoneNB),new FlowBean(phoneNB,u_flow,d_flow));
  23. }
  24. }
3.reducer中输出自定义的类型还需要在自定义类型中实现头string()方法
     
      
      
  1. package com.apollo.mr.flowsum;
  2. import org.apache.hadoop.io.Text;
  3. import org.apache.hadoop.mapreduce.Reducer;
  4. import java.io.IOException;
  5. /**
  6. * Created by 85213 on 2016/12/13.
  7. */
  8. public class FlowReducer extends Reducer<Text,FlowBean,Text,FlowBean> {
  9. @Override
  10. protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
  11. long u_flow_counter = 0;
  12. long d_flow_counter = 0;
  13. for(FlowBean value : values){
  14. u_flow_counter += value.getU_flow();
  15. d_flow_counter += value.getD_flow();
  16. }
  17. context.write(key,new FlowBean(key.toString(),u_flow_counter,d_flow_counter));
  18. }
  19. }
4.标准的runner
      
       
       
  1. package com.apollo.mr.flowsum;
  2. import org.apache.hadoop.conf.Configuration;
  3. import org.apache.hadoop.conf.Configured;
  4. import org.apache.hadoop.fs.Path;
  5. import org.apache.hadoop.io.Text;
  6. import org.apache.hadoop.mapreduce.Job;
  7. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  8. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  9. import org.apache.hadoop.util.Tool;
  10. import org.apache.hadoop.util.ToolRunner;
  11. /**
  12. * Created by 85213 on 2016/12/13.
  13. */
  14. public class FlowRunner extends Configured implements Tool {
  15. @Override
  16. public int run(String[] strings) throws Exception {
  17. Configuration conf = new Configuration();
  18. Job job = Job.getInstance(conf);
  19. job.setJarByClass(FlowRunner.class);
  20. job.setMapperClass(FlowMapper.class);
  21. job.setReducerClass(FlowReducer.class);
  22. job.setMapOutputKeyClass(Text.class);
  23. job.setMapOutputValueClass(FlowBean.class);
  24. job.setOutputKeyClass(Text.class);
  25. job.setOutputValueClass(FlowBean.class);
  26. FileOutputFormat.setOutputPath(job,new Path(strings[0]));
  27. FileInputFormat.setInputPaths(job,new Path(strings[1]));
  28. return job.waitForCompletion(true)?0:1;
  29. }
  30. public static void main(String[] args) throws Exception {
  31. int res = ToolRunner.run(new Configuration(),new FlowRunner(),args);
  32. System.exit(res);
  33. }
  34. }

Hadoop的自定义排序

mr中的排序是在reducer阶段按照key的值排序的
1.实现comparable接口
       
        
        
  1. package com.apollo.mr.flowsum;
  2. import org.apache.hadoop.io.WritableComparable;
  3. import java.io.DataInput;
  4. import java.io.DataOutput;
  5. import java.io.IOException;
  6. /**
  7. * Writable是序列化接口,Comparable是排序接口,MR中排序是按照key来排序的
  8. * Created by 85213 on 2016/12/13.
  9. */
  10. public class FlowBean implements WritableComparable<FlowBean> {
  11. private String phoneNB;
  12. private long u_flow;
  13. private long d_flow;
  14. private long s_flow;
  15. //在反序列化时,反射机制需要调用空参构造函数,所以显示定义了一个空参构造函数
  16. public FlowBean(){}
  17. //为了对象数据的初始化方便,加入一个带参的构造函数
  18. public FlowBean(String phoneNB,long u_flow,long d_flow){
  19. this.phoneNB = phoneNB;
  20. this.u_flow = u_flow;
  21. this.d_flow = d_flow;
  22. this.s_flow = u_flow+d_flow;
  23. }
  24. //将对象数据序列化到流中
  25. @Override
  26. public void write(DataOutput out) throws IOException {
  27. out.writeUTF(phoneNB);
  28. out.writeLong(u_flow);
  29. out.writeLong(d_flow);
  30. out.writeLong(s_flow);
  31. }
  32. //注意顺序要和序列化的顺序一致
  33. @Override
  34. public void readFields(DataInput in) throws IOException {
  35. phoneNB = in.readUTF();
  36. u_flow = in.readLong();
  37. d_flow = in.readLong();
  38. s_flow = in.readLong();
  39. }
  40. //排序规则,这里实现倒序排序
  41. @Override
  42. public int compareTo(FlowBean o) {
  43. return s_flow > o.s_flow ?-1:1;
  44. }
  45. public String toString() {
  46. return "" + u_flow + "\t" +d_flow + "\t" + s_flow;
  47. }
  48. public String getPhoneNB() {
  49. return phoneNB;
  50. }
  51. public void setPhoneNB(String phoneNB) {
  52. this.phoneNB = phoneNB;
  53. }
  54. public long getU_flow() {
  55. return u_flow;
  56. }
  57. public void setU_flow(long u_flow) {
  58. this.u_flow = u_flow;
  59. }
  60. public long getD_flow() {
  61. return d_flow;
  62. }
  63. public void setD_flow(long d_flow) {
  64. this.d_flow = d_flow;
  65. }
  66. public long getS_flow() {
  67. return s_flow;
  68. }
  69. public void setS_flow(long s_flow) {
  70. this.s_flow = s_flow;
  71. }
  72. }
2.mr
        
         
         
  1. package com.apollo.mr.flowsort;
  2. import com.apollo.mr.flowsum.FlowBean;
  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.fs.Path;
  5. import org.apache.hadoop.io.LongWritable;
  6. import org.apache.hadoop.io.NullWritable;
  7. import org.apache.hadoop.io.Text;
  8. import org.apache.hadoop.mapreduce.Job;
  9. import org.apache.hadoop.mapreduce.Mapper;
  10. import org.apache.hadoop.mapreduce.Reducer;
  11. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  12. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  13. import java.io.IOException;
  14. /**
  15. * Created by 85213 on 2016/12/14.
  16. */
  17. public class SortMR {
  18. public static class SortMapper extends Mapper<LongWritable,Text,FlowBean,NullWritable>{
  19. @Override
  20. protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  21. String line = value.toString();
  22. String[] fields = line.split("\t");
  23. String phoneNB = fields[0];
  24. long u_flow = Long.parseLong(fields[1]);
  25. long d_flow = Long.parseLong(fields[2]);
  26. context.write(new FlowBean(phoneNB,u_flow,d_flow),NullWritable.get());
  27. }
  28. }
  29. public static class SortReducer extends Reducer<FlowBean,NullWritable,Text,FlowBean> {
  30. @Override
  31. protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
  32. String phoneNB = key.getPhoneNB();
  33. context.write(new Text(phoneNB),key);
  34. }
  35. }
  36. public static void main(String[] args) throws Exception {
  37. Configuration conf = new Configuration();
  38. Job job = Job.getInstance(conf);
  39. job.setJarByClass(SortMR.class);
  40. job.setMapperClass(SortMapper.class);
  41. job.setReducerClass(SortReducer.class);
  42. job.setMapOutputKeyClass(FlowBean.class);
  43. job.setMapOutputValueClass(NullWritable.class);
  44. job.setOutputKeyClass(Text.class);
  45. job.setOutputValueClass(FlowBean.class);
  46. FileInputFormat.setInputPaths(job,new Path(args[0]));
  47. FileOutputFormat.setOutputPath(job,new Path(args[1]));
  48. System.exit(job.waitForCompletion(true)?0:1);
  49. }
  50. }

MR的自定义分区

实现自定义的分组主要实现:1.自定义分区规则    2.设置reduce任务的并发数

 1.自定义分组规则
          
           
           
  1. package com.apollo.mr.areapartition;
  2. import org.apache.hadoop.mapreduce.Partitioner;
  3. import java.util.HashMap;
  4. /**
  5. * Created by 85213 on 2016/12/14.
  6. */
  7. public class AreaPartitioner<KEY,VALUE> extends Partitioner<KEY,VALUE>{
  8. private static HashMap<String,Integer> areaMap = new HashMap<>();
  9. static{
  10. areaMap.put("135", 0);
  11. areaMap.put("136", 1);
  12. areaMap.put("137", 2);
  13. areaMap.put("138", 3);
  14. areaMap.put("139", 4);
  15. }
  16. //reducer阶段会聚合组号分配到不同的reducer中
  17. @Override
  18. public int getPartition(KEY key, VALUE value, int i) {
  19. //从key中拿到手机号,查询手机归属地字典,不同的省份返回不同的组号
  20. int areaCoder = areaMap.get(key.toString().substring(0, 3))==null?5:areaMap.get(key.toString().substring(0, 3));
  21. return areaCoder;
  22. }
  23. }
2.设置reduce的并发任务数量
           
            
            
  1. package com.apollo.mr.areapartition;
  2. import com.apollo.mr.flowsum.FlowBean;
  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.fs.Path;
  5. import org.apache.hadoop.io.LongWritable;
  6. import org.apache.hadoop.io.Text;
  7. import org.apache.hadoop.mapreduce.Job;
  8. import org.apache.hadoop.mapreduce.Mapper;
  9. import org.apache.hadoop.mapreduce.Reducer;
  10. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  11. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  12. import java.io.IOException;
  13. /**
  14. * 对流量原始日志进行流量统计,将不同省份的用户统计结果输出到不同文件
  15. * 需要自定义改造两个机制:
  16. * 1、改造分区的逻辑,自定义一个partitioner
  17. * 2、自定义reduer task的并发任务数
  18. * Created by 85213 on 2016/12/14.
  19. */
  20. public class FlowSumArea {
  21. public static class FlowSumAreaMapper extends Mapper<LongWritable,Text,Text,FlowBean>{
  22. @Override
  23. protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  24. String line = value.toString();
  25. String[] fields = line.split("\t");
  26. String phoneNB = fields[1];
  27. long u_flow = Long.parseLong(fields[7]);
  28. long d_flow = Long.parseLong(fields[8]);
  29. context.write(new Text(phoneNB),new FlowBean(phoneNB,u_flow,d_flow));
  30. }
  31. }
  32. public static class FlowSumAreaReducer extends Reducer<Text,FlowBean,Text,FlowBean>{
  33. @Override
  34. protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
  35. long u_flow_count = 0;
  36. long d_flow_count= 0;
  37. for(FlowBean flowBean : values){
  38. u_flow_count += flowBean.getU_flow();
  39. d_flow_count += flowBean.getU_flow();
  40. }
  41. context.write(key,new FlowBean(key.toString(),u_flow_count,d_flow_count));
  42. }
  43. }
  44. public static void main(String[] args) throws Exception {
  45. Configuration conf = new Configuration();
  46. Job job = Job.getInstance(conf);
  47. job.setJarByClass(FlowSumArea.class);
  48. job.setMapperClass(FlowSumAreaMapper.class);
  49. job.setReducerClass(FlowSumAreaReducer.class);
  50. //设置我们自定义的分组逻辑定义
  51. job.setPartitionerClass(AreaPartitioner.class);
  52. //设置reduce的任务并发数,应该跟分组的数量保持一致
  53. job.setNumReduceTasks(6);
  54. job.setMapOutputKeyClass(Text.class);
  55. job.setMapOutputValueClass(FlowBean.class);
  56. job.setOutputKeyClass(Text.class);
  57. job.setOutputValueClass(FlowBean.class);
  58. FileOutputFormat.setOutputPath(job,new Path(args[0]));
  59. FileInputFormat.setInputPaths(job,new Path(args[1]));
  60. System.exit(job.waitForCompletion(true)?0:1);
  61. }
  62. }

Hadoop中Combiner的使用

在MapReduce中,当map生成的数据过大时,带宽就成了瓶颈,怎样精简压缩传给Reduce的数据,有不影响最终的结果呢。有一种方法就是使用Combiner,Combiner号称本地的Reduce,Reduce最终的输入,是Combiner的输出。
         
          
          
  1. @Override
  2. public int run(String[] arg0) throws Exception {
  3. Configuration conf = getConf();
  4. Job job = new Job(conf, "demo1");
  5. String inputPath = ArgsTool.getArg(arg0, "input");
  6. String outputPath = ArgsTool.getArg(arg0, "output");
  7. FileInputFormat.addInputPath(job, new Path(inputPath));
  8. FileOutputFormat.setOutputPath(job, new Path(outputPath));
  9. job.setJarByClass(Demo1.class);
  10. job.setMapperClass(DemoMap.class);
  11. job.setReducerClass(DemoReduce.class);
  12. //job.setCombinerClass(DemoReduce.class);
  13. job.setOutputKeyClass(Text.class);
  14. job.setOutputValueClass(IntWritable.class);
  15. return job.waitForCompletion(true)?0:1;
  16. }

Hadoop中的Shuffle


简要说明:

------------shuffler-----------------
一、map task shuffler
1.partition,分区 
2.sort,排序
[3.combiner,合并]默认情况下没有combiner(合并内存中数据)

4.merge on disk,合并小的溢写文件合并成一个大文件

二、fetch/copy shuffler
1.reduce从所有的maptask的文件按照分区号抓取数据。

三、reduce shuffler
1.sort,因为从各个maptask取过来的数据所以数据已经乱了,二次排序
2.group,分组 默认分组算法是根据key相等为一组
------------shuffler-----------------

详细解说:


1.map输出后放在buffer in memory

2.partition
MapReduce提供Partitioner接口,它的作用就是根据key或value及reduce的数量来决定当前的这对输出数据最终应该交由哪个reduce task处理。默认对key hash后再以reduce task数量(默认一个MR的reduce task数量是1)取模。默认的取模方式只是为了平均reduce的处理能力,如果用户自己对Partitioner有需求,可以订制并设置到job上。 

接下来,需要将数据写入内存缓冲区中,缓冲区的作用是批量收集map结果,减少磁盘IO的影响。我们的key/value对以及Partition的结果都会被写入缓冲区。当然写入之前,key与value值都会被序列化成字节数组。
3.spill to disk
这个内存缓冲区是有大小限制的,默认是100MB。当map task的输出结果很多时,就可能会撑爆内存,所以需要在一定条件下将缓冲区中的数据临时写入磁盘,然后重新利用这块缓冲区。这个从内存往磁盘写数据的过程被称为Spill,中文可译为溢写,字面意思很直观。这个溢写是由单独线程来完成,不影响往缓冲区写map结果的线程。溢写线程启动时不应该阻止map的结果输出,所以整个缓冲区有个溢写的比例spill.percent。这个比例默认是0.8,也就是当缓冲区的数据已经达到阈值(buffer size * spill percent = 100MB * 0.8 = 80MB),溢写线程启动,锁定这80MB的内存,执行溢写过程。Map task的输出结果还可以往剩下的20MB内存中写,互不影响。
4.sort
当溢写线程启动后,需要对这80MB空间内的key做排序(Sort)。排序是MapReduce模型默认的行为,这里的排序也是对序列化的字节做的排序。 
5.combiner(对内存中的数据进行合并)
如果设置了combiner会启用执行分组函数
将有相同key的key/value对的value加起来,减少溢写到磁盘的数据量。
6.Merge(对大磁盘文件的合并)
每次溢写会在磁盘上生成一个溢写文件,如果map的输出结果真的很大,有多次这样的溢写发生,磁盘上相应的就会有多个溢写文件存在。当map task真正完成时,内存缓冲区中的数据也全部溢写到磁盘中形成一个溢写文件。最终磁盘中会至少有一个这样的溢写文件存在(如果map的输出结果很少,当map执行完成时,只会产生一个溢写文件),因为最终的文件只有一个,所以需要将这些溢写文件归并到一起,这个过程就叫做Merge。
请注意,因为merge是将多个溢写文件合并到一个文件,所以可能也有相同的key存在,在这个过程中如果client设置过Combiner,也会使用Combiner来合并相同的key。 

------------------
1.Copy过程,简单地拉取数据。Reduce进程启动一些数据copy线程(Fetcher),通过HTTP方式请求map task所在的TaskTracker获取map task的输出文件。因为map task早已结束,这些文件就归TaskTracker管理在本地磁盘中。 

2.Merge阶段。这里的merge如map端的merge动作,只是数组中存放的是不同map端copy来的数值。Copy过来的数据会先放入内存缓冲区中,这里的缓冲区大小要比map端的更为灵活,它基于JVM的heap size设置,因为Shuffle阶段Reducer不运行,所以应该把绝大部分的内存都给Shuffle用。这里需要强调的是,merge有三种形式:1)内存到内存  2)内存到磁盘  3)磁盘到磁盘。默认情况下第一种形式不启用,让人比较困惑,是吧。当内存中的数据量到达一定阈值,就启动内存到磁盘的merge。与map 端类似,这也是溢写的过程,这个过程中如果你设置有Combiner,也是会启用的,然后在磁盘中生成了众多的溢写文件。第二种merge方式一直在运行,直到没有map端的数据时才结束,然后启动第三种磁盘到磁盘的merge方式生成最终的那个文件。 

3.Sort排序,因为从各个maptask取过来的数据所以数据已经乱了,所以需要重新排序(二次排序)

4.Group分组,排序之后进行分组,默认按照key是否相等来决定是否是一组,一组数据对应一个reduce。

split的计算公式

MapReduce的 Split大小
– max.split(100M)
– min.split(10M)
– block(64M)
– max( min.split , min( max.split , block ))

 Mapreduce的输出进行压缩



 








猜你喜欢

转载自blog.csdn.net/wang11yangyang/article/details/58140157