mapreduce overview of Big Data learning route

  mapreduce overview of Big Data learning route, mapreduce: Offline distributed parallel computing framework, is a distributed computing programming framework program is user-developed core framework "based on data analysis applications hadoop," the; Mapreduce core function is the user-written own business logic code and default component integrated into a complete distributed computing program, run concurrently on a hadoop cluster;

  And similar principles to solve the problem of HDFS, HDFS is a large file into several small files, which are then stored in each host in the cluster.

  The same principle, MapReduce is cut into a complex operation if the sub-operation, respectively, and then each host to cluster, parallel operation by the respective host.

  Background generated 1.1 mapreduce

  Processing huge amounts of data on a single machine because the hardware resource constraints, can not do the job.

  Once the stand-alone version of the program to extend to the cluster distributed operation, will greatly increase the complexity and difficulty of development programs.

  Introducing the mapreduce framework, developers can concentrate most of the work on development of business logic, while the complexity of distributed computing - referred framework process.

  1.2 mapreduce programming model

  A distributed computing model.

  MapReduce this parallel calculations to abstract two functions.

  Map (mapping): some separate elements each element of a list designating operation can be highly parallel.

  Reduce (simplified reduction): a list of the elements to merge.

  A simple MapReduce program need only specify the map (), reduce (), input and output, the remaining subject matter framework is completed.



Mapreduce several key nouns

  Job: calculated for each user request is called a job.

  Task: Each job will need to split open, and handed over more than one host to complete spin-off of the unit is to perform the task.

  Task is divided into the following three types of tasks:

  The entire process is responsible for the data processing phase map: Map

  Reduce: reduce phase responsible for the overall data processing flow

  MRAppMaster: responsible for the entire process of program scheduling and coordination of state


1.4 mapreduce running processes


Flow showed:

A mr program start time, is the first to start MRAppMaster, MRAppMaster after starting the description information of the current job, the number of calculating a desired maptask example, and the machine starts the application corresponding to the cluster number of process maptask

After maptask process starts, for data processing according to a given data slice ranges, the subject process as follows:

using customer specified inputformat RecordReader acquires read data, KV form an input pair.

input KV (k is the row number of the file, v is the line of the file data) is passed to the map () method of the customer-defined, do logical operations, and the map () method KV collected outputted to the cache.

The cache KV to continue in accordance with the K partitions sort overflow to disk file

After MRAppMaster monitor all maptask process task is completed, the customer will be specified by the start parameter corresponding number of reducetask process and inform reducetask process to deal with the range of data (data partition)

After Reducetask process starts, the position MRAppMaster inform the data to be processed is located, running from a plurality of sets maptask located on the machine acquired several maptask output file, and re-merge sort locally, and then follow KV same key as a group, customer-defined call reduce () method of performing a logic operation, and outputs the result of operation to collect KV, and then call the customer specified outputformat outputs the resulting data to an external storage

1.5 writing MapReduce programs
  • 基于MapReduce 计算模型编写分布式并行程序非常简单,程序员的主要编码工作就是实现Map 和Reduce函数。
  • 其它的并行编程中的种种复杂问题,如分布式存储,工作调度,负载平衡,容错处理,网络通信等,均由YARN框架负责处理。
  • MapReduce中,map和reduce函数遵循如下常规格式:

 map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)

  • Mapper的接口:

 protected void map(KEY key, VALUE value, Context context)
    throws IOException, InterruptedException {  
}

  • Reduce的接口:

 protected void reduce(KEY key, Iterable<VALUE> values,
 Context context) throws IOException, InterruptedException {
}

  • Mapreduce程序代码基本结构


 maprecue实例开发

2.1 编程步骤

用户编写的程序分成三个部分:Mapper,Reducer,Driver(提交运行mr程序的客户端)

Mapper的输入数据是KV对的形式(KV的类型可自定义)

Mapper的输出数据是KV对的形式(KV的类型可自定义)

Mapper中的业务逻辑写在map()方法中

map()方法(maptask进程)对每一个<K,V>调用一次

Reducer的输入数据类型对应Mapper的输出数据类型,也是KV

Reducer的业务逻辑写在reduce()方法中

Reducetask进程对每一组相同k的<k,v>组调用一次reduce()方法

用户自定义的Mapper和Reducer都要继承各自的父类

整个程序需要一个Drvier来进行提交,提交的是一个描述了各种必要信息的job对象

2.2 经典的wordcount程序编写

需求:有一批文件(规模为TB级或者PB级),如何统计这些文件中所有单词出现次数

 如有三个文件,文件名是qfcourse.txt、qfstu.txt 和 qf_teacher

 qf_course.txt内容:

 php java linux
bigdata VR
C C++ java web
linux shell

 qf_stu.txt内容:

 tom jim lucy
lily sally
andy
tom jim sally

 qf_teacher内容:

 jerry Lucy tom
jim

方案

– 分别统计每个文件中单词出现次数 - map()

– 累加不同文件中同一个单词出现次数 - reduce()

实现代码

– 创建一个简单的maven项目

– 添加hadoop client依赖的jar,pom.xml主要内容如下:

 <dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>

 

– 编写代码

– 自定义一个mapper类

 import java.io.IOException;

  import org.apache.hadoop.io.IntWritable;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Mapper;

  /**
   * Maper里面的泛型的四个类型从左到右依次是:
   * 
   * LongWritable KEYIN: 默认情况下,是mr框架所读到的一行文本的起始偏移量,Long,  类似于行号但是在hadoop中有自己的更精简的序列化接口,所以不直接用Long,而用LongWritable
   * Text VALUEIN:默认情况下,是mr框架所读到的一行文本的内容,String,同上,用Text
   *
   * Text KEYOUT:是用户自定义逻辑处理完成之后输出数据中的key,在此处是单词,String,同上,用Text
   * IntWritable VALUEOUT:是用户自定义逻辑处理完成之后输出数据中的value,在此处是单词次数,Integer,同上,用IntWritable
   */
  public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

   /**
    * map阶段的业务逻辑就写在自定义的map()方法中
    * maptask会对每一行输入数据调用一次我们自定义的map()方法
    */
   @Override
   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  
   //将maptask传给我们的一行的文本内容先转换成String
   String line = value.toString();
   //根据空格将这一行切分成单词
   String[] words = line.split(" ");
  
   /**
    *将单词输出为<单词,1> 
    *<lily,1> <lucy,1>  <c,1> <c++,1> <tom,1> 
    */
   for(String word:words){
   //将单词作为key,将次数1作为value,以便于后续的数据分发,可以根据单词分发,以便于相同单词会到相同的reduce task
   context.write(new Text(word), new IntWritable(1));
   }
   }
  }

 

– 自定义一个reduce类

  import java.io.IOException;

  import org.apache.hadoop.io.IntWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Reducer;

  /**
   * Reducer里面的泛型的四个类型从左到右依次是:
   *  Text KEYIN: 对应mapper输出的KEYOUT
   *  IntWritable VALUEIN: 对应mapper输出的VALUEOUT
   * 
   *  KEYOUT, 是单词
   *  VALUEOUT 是自定义reduce逻辑处理结果的输出数据类型,是总次数
   */
  public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

   /**
    * <tom,1>
    * <tom,1>
    * <linux,1>
    * <banana,1>
    * <banana,1>
    * <banana,1>
    * 入参key,是一组相同单词kv对的key
    * values是若干相同key的value集合
    *  <tom,[1,1]>   <linux,[1]>   <banana,[1,1,1]>
    */
   @Override
   protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

   int count=0;  //累加单词的出现的次数
  
   for(IntWritable value:values){
   count += value.get();
   }
   context.write(key, new IntWritable(count));
   }
  }

 

– 编写一个Driver类

   import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.io.IntWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Job;
  import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

  /**
   * 相当于一个yarn集群的客户端
   * 需要在此封装我们的mr程序的相关运行参数,指定jar包
   * 最后提交给yarn
   */
  public class WordcountDriver {
   /**
    * 该类是运行在hadoop客户端的,main一运行,yarn客户端就启动起来了,与yarn服务器端通信
    * yarn服务器端负责启动mapreduce程序并使用WordcountMapper和WordcountReducer类
    */
   public static void main(String[] args) throws Exception {
   //此代码需要两个输入参数  第一个参数支持要处理的源文件;第二个参数是处理结果的输出路径
   if (args == null || args.length == 0) {
   args = new String[2];
             //路径都是 hdfs系统的文件路径
   args[0] = "hdfs://192.168.18.64:9000/wordcount/input/";
   args[1] = "hdfs://192.168.18.64:9000/wordcount/output";
   }
   /**
    * 什么也不设置时,如果在安装了hadoop的机器上运行时,自动读取
    * /home/hadoop/app/hadoop-2.7.1/etc/hadoop/core-site.xml
    * 文件放入Configuration中
    */
   Configuration conf = new Configuration();
   Job job = Job.getInstance(conf);
  
   //指定本程序的jar包所在的本地路径
   job.setJarByClass(WordcountDriver.class);
  
   //指定本业务job要使用的mapper业务类
   job.setMapperClass(WordcountMapper.class);
   //指定mapper输出数据的kv类型
   job.setMapOutputKeyClass(Text.class);
   job.setMapOutputValueClass(IntWritable.class);
        
         //指定本业务job要使用的Reducer业务类
         job.setReducerClass(WordcountReducer.class);
   //指定最终输出的数据的kv类型
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(IntWritable.class);
  
   //指定job的输入原始文件所在目录
   FileInputFormat.setInputPaths(job, new Path(args[0]));
   //指定job的输出结果所在目录
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
  
   //将job中配置的相关参数,以及job所用的java类所在的jar包,提交给yarn去运行
   /*job.submit();*/
   boolean res = job.waitForCompletion(true);
   System.exit(res?0:1);
   }
  }

wordcount处理过程

将文件拆分成splits,由于测试用的文件较小,所以每个文件为一个split,并将文件按行分割形成<key,value>对,下图所示。这一步由MapReduce框架自动完成,其中偏移量(即key值)包括了回车所占的字符数(Windows/Linux环境不同)。


将分割好的<key,value>对交给用户定义的map方法进行处理,生成新的<key,value>对,下图所示。


得到map方法输出的<key,value>对后,Mapper会将它们按照key值进行排序,并执行Combine过程,将key至相同value值累加,得到Mapper的最终输出结果。下图所示。


Reducer先对从Mapper接收的数据进行排序,再交由用户自定义的reduce方法进行处理,得到新的<key,value>对,并作为WordCount的输出结果,下图所示。


Guess you like

Origin www.cnblogs.com/gcghcxy/p/11346514.html