Hadoop2.X MR作业流

情景概述:作为HFDS的高层建筑,MR被设计与在大型分布式文件系统之上的离线数据运算,在对一些运算时效性要求不高的场景中更适合于MR作业,MR在ETL流不同阶段可扮演不同的角色,甚至在某些场景下基于MR的链式操作可完成ETL的整个流程.

MR概述:Hadoop MR(Mapper Reduce) 是一个软件架构的实现,用户处理大批量的的离线数据作业,运行于大型集群中,硬件可靠和容错.MR核心思想将作业的输入转化为多个并行的Map程序,然后进行Reduce输出运算结果.作业的输入和输出都存储于文件系统,由MR 框架负责作业的调度,监控和失败重试.

| 基于基于计算优于移动数据的设计,MR运算节点与数据节点通常是同一个节点.

| MR框架有一个ResourceManager主节点每一个集群节点上有一个NodeManager每一个应用程序有一个MRAppMaster

| 一个MR程序至少应该有输入/输出,输入Map以及 Reduce函数,通常由实现适当的接口或者抽象类而来. 还应该有其它的作业参数包括作业的配置参数

| Hadoop作业由job client进行提交,jobclient配置ResourceManager,ResourceManager分配资源及集群节点,然后开始调度,监控任务,为job-client提供任务状态和诊断信息

MR作业流,图片来源于互联网

输入,输出:

| MR作业基于<key,value>对展开作业,框架以一组<key,value>作为输入,一组<key,value>作为输出,期间数据类型可以发生变化.

| 分布式作业中,MR要求key,value必须是可序列化的需要实现Writable接口,此外key必须实现WritableComparable接口用于MR框架进行以key为依据的排序操作

| 一个MR作业的输入/输入看起来是如下的一种演变:

(input) <k1, v1> -> map -><k2, v2> -> combine -> <k2, v2> -> reduce -> <k3,v3> (output)

MR接口

| Mapper接口用于将作业的输入映射到key/value集合,Mapper 操作输入key/value集合生成中间key/values记录,一个给定的输入可能对应0个或者多个输出key/values对,Mapper接口定义如下

public class Mapper<KEYIN, VALUEIN,KEYOUT, VALUEOUT> {

         …

public voidrun(Context context) throws IOException, InterruptedException {

}

}

| Shuffle,对Maplper阶段的输出数据进行Shuffle操作,期间伴随数据的排序操作.

| Sort,MR框架默认使用keys对Reduce的输入数据进行分组,Mapper操作处理的是Reduce阶段输出按key分组后的数据.

| Secondary Sort,可通过实现Comparator接口通过Job.setSortComparatorClass(Class) 来自定义key的比较规则来实现key的排序,可通过 Job.setGroupingComparatorClass(Class)来指定key分组的规则,

| Reduce 接口用户合并多个Mapper操作的中间结果,一个MR作业中可允许多个Reduce并发处理. Reduce通常涉及三个阶段,shuffle,sort,reduce,Reduce接口如下

public classReducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>{

…

protected void reduce(KEYIN key, Iterable<VALUEIN> values,Context context

                        ) throws IOException,InterruptedException {

      }

}

| Reducer NONE, MR框架允许一个MR作业只有Mapper操作而没有Reduce操作,这样Mapper的输出将直接输出到对应的文件系统,且在输出到文件系统前不会发生排序操作.

| Partitioner, Partitioner 对Mapper操作的中间结果按key进行划分,分区数和Reduce数量一致, Partitioner控制着中间结果的key会被分配到多个Reduce中的哪一个进行Reduce操作.

| Counter, MR提供的计数组件,用于统计MR状态.

附件:

| 一个默认的MR程序,补全了默认的加载项:,该例是完整的可直接运行

public class T {

         publicstatic void main(String[] args) throws Exception {

                   Configurationconf = new Configuration();

                   Jobjob = Job.getInstance(conf, "TEST");

                   //默认的输入配置

                   {

                            /*

                             * key:LongWriteable ;value:Text

                             */

                            job.setInputFormatClass(TextInputFormat.class);

 

                   }

                   //Mapper默认配置

                   {

                            job.setMapperClass(Mapper.class);

                            job.setMapOutputKeyClass(Text.class);

                            job.setMapOutputValueClass(IntWritable.class);

                   }

                   //默认的Map路由算法

                   {

                            /*对每条记录进行Hash操作决定该记录应该属于哪个分区,每个分区对应一个Reduce任务    ==>分区数等于作业Reduce的个数*/

                            job.setPartitionerClass(HashPartitioner.class);

                   }

                   //Reduce默认配置

                   {

                            /*Reduce之前数据被使用key值进行过排序操作*/

                            job.setReducerClass(Reducer.class);

                            job.setNumReduceTasks(1);

                   }

                   //默认的输出配置

                   {

                            /*将键/值转换成字符串并使用制表符分开,然后一条一条的进行输出*/

                            job.setOutputFormatClass(TextOutputFormat.class);

                            job.setMapOutputKeyClass(IntWritable.class);

                            job.setMapOutputValueClass(Text.class);

                   }

 

                   FileInputFormat.setInputPaths(job,new Path(args[0]));

                   FileOutputFormat.setOutputPath(job,new Path(args[1]));

 

                   if(!job.waitForCompletion(true))

                            return;

         }

}

| WorldCout 实现,该例是完整的

public class WordCount {

         publicstatic void main(String[] args) throws Exception {

                   Configurationconf = new Configuration();

                   Jobjob = Job.getInstance(conf, "WordCount");

                   job.setInputFormatClass(CombineTextInputFormat.class);

                   job.setJarByClass(cn.com.xiaofen.A.WordCount.class);

                   job.setMapperClass(WordMaper.class);

                   job.setReducerClass(WordReduce.class);

                   job.setOutputKeyClass(Text.class);

                   job.setOutputValueClass(IntWritable.class);

                   CombineTextInputFormat.setInputPaths(job,new Path(args[0]));

                   FileOutputFormat.setOutputPath(job,new Path(args[1]));

                   if(!job.waitForCompletion(true))

                            return;

         }

}

 

class WordMaper extendsMapper<LongWritable, Text, Text, IntWritable> {

         publicvoid map(LongWritable ikey, Text ivalue, Context context) throws IOException,InterruptedException {

                   Textword = null;

                   finalIntWritable one = new IntWritable(1);

                   StringTokenizerst = new StringTokenizer(ivalue.toString());

                   while(st.hasMoreTokens()) {

                            word= new Text(st.nextToken());

                            context.write(word,one);

                   }

         }

}

 

class WordReduce extends Reducer<Text,IntWritable, Text, IntWritable> {

         @Override

         protectedvoid reduce(Text key, Iterable<IntWritable> values, Context context)

                            throwsIOException, InterruptedException {

                   intsum = 0;

                   for(IntWritable intw : values) {

                            sum+= intw.get();

                   }

                   context.write(key,new IntWritable(sum));

         }

}

| 一个稍复杂的示例,TOPN实现

public class CarTopN {

         publicstatic final String TOPN = "TOPN";

         publicstatic void main(String[] args) throws Exception {

                   Pathinput = new Path(args[0]);

                   Pathoutput = new Path(args[1]);

                   IntegerN = Integer.parseInt(args[2]);

                   Configurationconf = new Configuration();

                   //define the N

                   conf.setInt(CarTopN.TOPN,N);

                   Jobjob = Job.getInstance(conf, "CAR_Top10_BY_TGSID");

                   job.setJarByClass(cn.com.zjf.MR_04.CarTopN.class);

                   job.setInputFormatClass(CombineTextInputFormat.class);

                   job.setMapperClass(CarTopNMapper.class);

                   //not use

                   //job.setCombinerClass(Top10Combine.class);

                   job.setReducerClass(CarTopNReduce.class);

                   job.setNumReduceTasks(1);

                   job.setMapOutputValueClass(IntWritable.class);

                   job.setOutputKeyClass(Text.class);

                   job.setOutputValueClass(IntWritable.class);

                   job.setPartitionerClass(ITGSParition.class);

                   FileSystemfs = FileSystem.get(conf);

                   //预处理文件 .只读取写完毕的文件 .writed结尾 .只读取文件大小大于0的文件

                   {

                            FileStatuschilds[] = fs.globStatus(input, new PathFilter() {

                                     publicboolean accept(Path path) {

                                               if(path.toString().endsWith(".writed")) {

                                                        returntrue;

                                               }

                                               returnfalse;

                                     }

                            });

                            Pathtemp = null;

                            for(FileStatus file : childs) {

                                     temp= new Path(file.getPath().toString().replaceAll(".writed",""));

                                     if(fs.listStatus(temp)[0].getLen() > 0) {

                                               FileInputFormat.addInputPath(job,temp);

                                     }

                            }

                   }

                   CombineTextInputFormat.setMaxInputSplitSize(job,67108864);

 

                   //强制清理输出目录

                   if(fs.exists(output)) {

                            fs.delete(output,true);

                   }

                   FileOutputFormat.setOutputPath(job,output);

 

                   if(!job.waitForCompletion(true))

                            return;

         }

}

 

class ITGSParition extendsPartitioner<Text, Text> {

         @Override

         publicint getPartition(Text key, Text value, int numPartitions) {

                   return(Math.abs(key.hashCode())) % numPartitions;

         }

}

 

class CarTopNMapper extendsMapper<LongWritable, Text, Text, IntWritable> {

         @Override

         protectedvoid map(LongWritable key, Text value, Mapper<LongWritable, Text, Text,IntWritable>.Context context)

                            throwsIOException, InterruptedException {

                   Stringtemp = value.toString();

                   if(temp.length() > 13) {

                            temp= temp.substring(12);

                            String[]items = temp.split(",");

                            if(items.length > 10) {

                                     //CarPlate As Key

                                     try{

                                               Stringtgsid = items[14].substring(6);

                                               Integer.parseInt(tgsid);

                                               context.write(newText(tgsid), new IntWritable(1));

                                     }catch (Exception e) {

                                               e.printStackTrace();

                                     }

                            }

                   }

         }

}

 

class CarTopNCombine extendsReducer<Text, IntWritable, Text, IntWritable> {

         //有序Map 始终存储TOPN,不存储多余的数据

         privatefinal TreeMap<Integer, String> tm = new TreeMap<Integer, String>();

         privateint N;

 

         @Override

         protectedvoid setup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)

                            throwsIOException, InterruptedException {

                   Configurationconf = context.getConfiguration();

                   N= conf.getInt(CarTopN.TOPN, 10);

         }

         @Override

         protectedvoid reduce(Text key, Iterable<IntWritable> values,

                            Reducer<Text,IntWritable, Text, IntWritable>.Context context) throws IOException,InterruptedException {

                   Integerweight = 0;

                   for(IntWritable iw : values) {

                            weight+= iw.get();

                   }

                   tm.put(weight,key.toString());

                   //保证只有TOPN

                   if(tm.size() > N) {

                            tm.remove(tm.firstKey());

                   }

         }

         @Override

         protectedvoid cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Contextcontext)

                            throwsIOException, InterruptedException {

                   //将最终的数据进行发射输出给下一阶段

                   for(Integer key : tm.keySet()) {

                            context.write(newText("byonet:" + tm.get(key)), new IntWritable(key));

                   }

         }

}

 

// Top10核心计算方法

// 尽量避免在Java集合中存储Hadoop数据类型,可能会出现奇怪的问题

class CarTopNReduce extendsReducer<Text, IntWritable, Text, IntWritable> {

         privatefinal TreeMap<Integer, String> tm = new TreeMap<Integer, String>();

         privateint N;

 

         @Override

         protectedvoid setup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)

                            throwsIOException, InterruptedException {

                   Configurationconf = context.getConfiguration();

                   N= conf.getInt(CarTopN.TOPN, 10);

         }

 

         @Override

         protectedvoid reduce(Text key, Iterable<IntWritable> values,

                            Reducer<Text,IntWritable, Text, IntWritable>.Context arg2) throws IOException,InterruptedException {

                   Integerweight = 0;

                   for(IntWritable iw : values) {

                            weight+= iw.get();

                   }

                   tm.put(weight,key.toString());

                   if(tm.size() > N) {

                            tm.remove(tm.firstKey());

                   }

         }

 

         @Override

         protectedvoid cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Contextcontext)

                            throwsIOException, InterruptedException {

                   for(Integer key : tm.keySet()) {

                            context.write(newText("byonet:" + tm.get(key)), new IntWritable(key));

                   }

         }

}

Hadoop2.X MR作业流

Hadoop2.X MR作业流

猜你喜欢