mapreduce实例--统计文本中的单词数

mapreduce实例–统计文本中的单词数

一:环境描述:

hadoop2.8.1
文件上传至hdfs,程序从hdfs读取计算,计算结果存储到hdfs

二:前期准备

2.1 上传文件word.txt至hdfs

word.txt 文件内容:

Could not obtain block, Could not obtain block, Could not obtain block
Could not obtain block
Could not obtain block
Could not obtain block
> hdfs dfs -put ./word.txt /user/zhangsan/

2.2 依赖包引入项目:

hadoop-2.8.1\share\hadoop\hdfs\hadoop-hdfs-2.8.1.jar
hadoop-2.8.1\share\hadoop\hdfs\lib\所有jar包

hadoop-2.8.1\share\hadoop\common\hadoop-common-2.8.1.jar
hadoop-2.8.1\share\hadoop\common\lib\所有jar包

hadoop-2.8.1\share\hadoop\mapreduce\除hadoop-mapreduce-examples-2.8.1.jar之外的jar包
hadoop-2.8.1\share\hadoop\mapreduce\lib\所有jar包

hadoop-2.8.1\share\hadoop\yarn\所有jar包
hadoop-2.8.1\share\hadoop\yarn\lib\所有jar包

三:编码

2.1 map类编写

package wordcount;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/* 
 * KEYIN:输入kv数据对中key的数据类型 
 * VALUEIN:输入kv数据对中value的数据类型 
 * KEYOUT:输出kv数据对中key的数据类型 
 * VALUEOUT:输出kv数据对中value的数据类型 
 */  
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{  

    /* 
     * map方法是提供给map task进程来调用的,map task进程是每读取一行文本来调用一次我们自定义的map方法 
     * map task在调用map方法时,传递的参数: 
     *      一行的起始偏移量LongWritable作为key 
     *      一行的文本内容Text作为value 
     */  
    @Override  
    protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException {  
        //拿到一行文本内容,转换成String 类型  
        String line = value.toString();  
        //将这行文本切分成单词  
        String[] words=line.split(" ");  

        //输出<单词,1>  
        for(String word:words){  
            context.write(new Text(word), new IntWritable(1));  
        }  
    }  
}  

2.2 reduce类编写

package wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;

/* 
 * KEYIN:对应mapper阶段输出的key类型 
 * VALUEIN:对应mapper阶段输出的value类型 
 * KEYOUT:reduce处理完之后输出的结果kv对中key的类型 
 * VALUEOUT:reduce处理完之后输出的结果kv对中value的类型 
 */  
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{  
    @Override  
    /* 
     * reduce方法提供给reduce task进程来调用 
     *  
     * reduce task会将shuffle阶段分发过来的大量kv数据对进行聚合,聚合的机制是相同keykv对聚合为一组 
     * 然后reduce task对每一组聚合kv调用一次我们自定义的reduce方法 
     * 比如:<hello,1><hello,1><hello,1><tom,1><tom,1><tom,1> 
     *  hello组会调用一次reduce方法进行处理,tom组也会调用一次reduce方法进行处理 
     *  调用时传递的参数: 
     *          key:一组kv中的key 
     *          values:一组kv中所有value的迭代器 
     */  
    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {  
        //定义一个计数器  
        int count = 0;  
        //通过value这个迭代器,遍历这一组kv中所有的value,进行累加  
        for(IntWritable value:values){  
            count+=value.get();  
        }  

        //输出这个单词的统计结果  
        context.write(key, new IntWritable(count));  
    }  
}  

2.3 job类编写

package wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class WordCountJobSubmitter {  

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {  
        Configuration conf = new Configuration();  
        Job wordCountJob = Job.getInstance(conf);  

        //重要:指定本job所在的jarwordCountJob.setJarByClass(WordCountJobSubmitter.class);  

        //设置wordCountJob所用的mapper逻辑类为哪个类  
        wordCountJob.setMapperClass(WordCountMapper.class);  
        //设置wordCountJob所用的reducer逻辑类为哪个类  
        wordCountJob.setReducerClass(WordCountReducer.class);  

        //设置map阶段输出的kv数据类型  
        wordCountJob.setMapOutputKeyClass(Text.class);  
        wordCountJob.setMapOutputValueClass(IntWritable.class);  

        //设置最终输出的kv数据类型  
        wordCountJob.setOutputKeyClass(Text.class);  
        wordCountJob.setOutputValueClass(IntWritable.class);  

        //设置要处理的文本数据所存放的路径  
        FileInputFormat.setInputPaths(wordCountJob, "hdfs://172.16.29.11:9000/user/zhangsan/word.txt");  
        FileOutputFormat.setOutputPath(wordCountJob, new Path("hdfs://172.16.29.11:9000/output/"));  

        //提交jobhadoop集群  
        wordCountJob.waitForCompletion(true);  
    }  
}  

2.4 eclipse 调试运行

2.5 打包运行

打包成mapreduce.jar,并运行

> hadoop jar mapreduce.jar

查看yarn: http://172.16.29.11:8088/cluster/apps 执行成功。

注意
output目录不能存在;如果已存在,执行此命令会报错:Output directory hdfs://172.16.29.11:9000/output already exists

查看输出文件的内容:

> hdfs dfs -ls /output/  # 列出文件
> hdfs dfs -cat /output/part-r-00000  # 查看此文件内容
> 
Could   6
block   4
block,  2
not     6
obtain  6

参考博文:https://blog.csdn.net/litianxiang_kaola/article/details/71154302

猜你喜欢

转载自blog.csdn.net/apple9005/article/details/80582845
今日推荐