使用MapReduce框架做词频分析案例(案例一)

  • 在使用MapReduce框架编写程序时,对于MapReduce的key-value,输入输出数据,只能使用Hadoop提供的数据类型,不能使用Java的基本数据类型,例如long-LongWritable,int-IntWritable,String-Text等。
  • 在节点间的内部通讯使用的是RPC,RPC协议把消息翻译成二进制字节流发送到远程节点,
    远程节点再通过反序列化把二进制流转成原始的信息。
  • 想要自定义MapReduce程序中key-value的数据类型,则需要实现相应的接口,如Writable、WritableComparable接口。

Map部分代码

map 阶段 每一行数据都进行切分,切分之后输出数据
* 四个参数:
* KEYIN 输入数据的key 行偏移量(行的起始位置)
* VALUEIN 输入的value 每一行数据的类型
* KEYOUT 输出的 key类型
* VALUEOUT 输出的value类型


public class MapTask extends Mapper<LongWritable, Text, Text, IntWritable>{
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //value  代表每行的数据
        //将每行数据进行分割,分割出的单词存入数组
        String[] split = value.toString().split(" ");
        for (String word : split) {
        //将单词作为key,将1作为value,输出格式为:hello 1
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

Reduce部分代码

作用:将key值相同的数据放在一起,去重,统计词频数
* KEYIN Map输出的 key类型,即输入Reduce的key类型
* VALUEIN Map输出的value类型,即输入Reduce的value类型
* KEYOUT Reduce输出的key类型
* VALUEOUT Reduce输出的value类型

public class ReduceTask extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        int count = 0;
        //去重,并计算同一个key值有几条数据,写入value中
        for (IntWritable value : values) {
            count = count+value.get();
            //count++;
        }
        //输出数据
        context.write(key, new IntWritable(count));
    }
}

Driver部分代码

1.hdfs集群上运行

public class Driver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        //设置提交到哪 yarn  local
        conf.set("fs.defaultFS", "hdfs://hadoop01:9000");
        Job job = Job.getInstance(conf);
        //设置job的map和reduce是哪一个,并且设置是哪一任务做提交
        job.setMapperClass(MapTask.class);
        job.setReducerClass(ReduceTask.class);
        job.setJarByClass(Driver.class);

        //设置输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path("/hello.txt"));
        FileOutputFormat.setOutputPath(job, new Path("/wordcount/wc-output"));

        //判断输出目录是否存在
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(new Path("/wordcount/wclipse-out"))) {
            fs.delete(new Path("/wordcount/wclipse-out"),true);
        }

        //提交之后会监控运行状态
        boolean completion = job.waitForCompletion(true);
        System.out.println(completion?"程序执行完毕":"程序出bug了");
    }
}

将项目打成jar包,上传到虚拟机中的hdfs集群上,执行hadoop jar 要执行类的全类名

2.在eclipse本地执行jar包

public class Driver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        System.setProperty("HADOOP_USER_NAME", "root");//声明使用哪个用户提交
        conf.set("fs.defaultFS", "hdfs://hadoop01:9000");//提交到哪里 yarn local
        conf.set("mapreduce.framework.name", "yarn");//使用yarn计算
        conf.set("yarn.resourcemanager.hostname", "hadoop01");主机名为hadoop01
        conf.set("mapreduce.app-submission.sross-platform", "true");
        Job job = Job.getInstance(conf,"eclipseToCluster");

        job.setMapperClass(MapTask.class);
        job.setReducerClass(ReduceTask.class);
        //job.setJarByClass(Driver.class);
        //将项目打成jar包,将jar包存放位置输入job.setTar()中
        job.setJar("C:\\Users\\dell\\Desktop\\wc.jar");

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path("/hello.txt"));
        FileOutputFormat.setOutputPath(job, new Path("/wordcount/wclipse-out"));

        //判断输出目录是否存在
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(new Path("/wordcount/wclipse-out"))) {
            fs.delete(new Path("/wordcount/wclipse-out"),true);
        }

        boolean completion = job.waitForCompletion(true);
        System.out.println(completion?0:1);
    }
}

要将项目打为jar包,写入指定位置,run as执行的是jar包

3.在eclipse本地执行

public class Driver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        //System.setProperty("HADOOP_USER_NAME", "root");//声明使用哪个用户提交
        /*conf.set("fs.defaultFS", "hdfs://hadoop01:9000");
        conf.set("mapreduce.framework.name", "yarn");
        conf.set("yarn.resourcemanager.hostname", "hadoop01");
        conf.set("mapreduce.app-submission.sross-platform", "true");*/
        Job job = Job.getInstance(conf,"eclipseToCluster");

        job.setMapperClass(MapTask.class);
        job.setReducerClass(ReduceTask.class);
        job.setJarByClass(Driver.class);
        //job.setJar("C:\\Users\\dell\\Desktop\\wc.jar");

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path("D:\\a\\hello.txt"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\a\\wordcount\\wclipse-out"));

        //判断文件是否存在
        File file = new File("D:\\a\\wordcount\\wclipse-out");
        if(file.exists()){
            FileUtils.deleteDirectory(file);
        }

        boolean completion = job.waitForCompletion(true);
        System.out.println(completion?"程序执行完毕":"程序出bug了");
    }

直接在eclipse本地执行,在本机查看产生的文件是否执行成功

猜你喜欢

转载自blog.csdn.net/amin_hui/article/details/82020826
今日推荐