在使用MapReduce框架编写程序时，对于MapReduce的key-value,输入输出数据，只能使用Hadoop提供的数据类型，不能使用Java的基本数据类型，例如long-LongWritable,int-IntWritable,String-Text等。
在节点间的内部通讯使用的是RPC，RPC协议把消息翻译成二进制字节流发送到远程节点，
远程节点再通过反序列化把二进制流转成原始的信息。
想要自定义MapReduce程序中key-value的数据类型，则需要实现相应的接口，如Writable、WritableComparable接口。

Map部分代码

map 阶段每一行数据都进行切分，切分之后输出数据
* 四个参数：
* KEYIN 输入数据的key 行偏移量（行的起始位置）
* VALUEIN 输入的value 每一行数据的类型
* KEYOUT 输出的 key类型
* VALUEOUT 输出的value类型


public class MapTask extends Mapper<LongWritable, Text, Text, IntWritable>{
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //value  代表每行的数据
        //将每行数据进行分割，分割出的单词存入数组
        String[] split = value.toString().split(" ");
        for (String word : split) {
        //将单词作为key,将1作为value，输出格式为：hello 1
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

Reduce部分代码

作用：将key值相同的数据放在一起，去重，统计词频数
* KEYIN Map输出的 key类型，即输入Reduce的key类型
* VALUEIN Map输出的value类型，即输入Reduce的value类型
* KEYOUT Reduce输出的key类型
* VALUEOUT Reduce输出的value类型

public class ReduceTask extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        int count = 0;
        //去重，并计算同一个key值有几条数据，写入value中
        for (IntWritable value : values) {
            count = count+value.get();
            //count++;
        }
        //输出数据
        context.write(key, new IntWritable(count));
    }
}

Driver部分代码

1.hdfs集群上运行

public class Driver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        //设置提交到哪 yarn  local
        conf.set("fs.defaultFS", "hdfs://hadoop01:9000");
        Job job = Job.getInstance(conf);
        //设置job的map和reduce是哪一个，并且设置是哪一任务做提交
        job.setMapperClass(MapTask.class);
        job.setReducerClass(ReduceTask.class);
        job.setJarByClass(Driver.class);

        //设置输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path("/hello.txt"));
        FileOutputFormat.setOutputPath(job, new Path("/wordcount/wc-output"));

        //判断输出目录是否存在
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(new Path("/wordcount/wclipse-out"))) {
            fs.delete(new Path("/wordcount/wclipse-out"),true);
        }

        //提交之后会监控运行状态
        boolean completion = job.waitForCompletion(true);
        System.out.println(completion?"程序执行完毕":"程序出bug了");
    }
}

将项目打成jar包，上传到虚拟机中的hdfs集群上，执行hadoop jar 要执行类的全类名

2.在eclipse本地执行jar包

public class Driver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        System.setProperty("HADOOP_USER_NAME", "root");//声明使用哪个用户提交
        conf.set("fs.defaultFS", "hdfs://hadoop01:9000");//提交到哪里 yarn local
        conf.set("mapreduce.framework.name", "yarn");//使用yarn计算
        conf.set("yarn.resourcemanager.hostname", "hadoop01");主机名为hadoop01
        conf.set("mapreduce.app-submission.sross-platform", "true");
        Job job = Job.getInstance(conf,"eclipseToCluster");

        job.setMapperClass(MapTask.class);
        job.setReducerClass(ReduceTask.class);
        //job.setJarByClass(Driver.class);
        //将项目打成jar包，将jar包存放位置输入job.setTar()中
        job.setJar("C:\\Users\\dell\\Desktop\\wc.jar");

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path("/hello.txt"));
        FileOutputFormat.setOutputPath(job, new Path("/wordcount/wclipse-out"));

        //判断输出目录是否存在
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(new Path("/wordcount/wclipse-out"))) {
            fs.delete(new Path("/wordcount/wclipse-out"),true);
        }

        boolean completion = job.waitForCompletion(true);
        System.out.println(completion?0:1);
    }
}

要将项目打为jar包，写入指定位置，run as执行的是jar包

3.在eclipse本地执行

public class Driver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        //System.setProperty("HADOOP_USER_NAME", "root");//声明使用哪个用户提交
        /*conf.set("fs.defaultFS", "hdfs://hadoop01:9000");
        conf.set("mapreduce.framework.name", "yarn");
        conf.set("yarn.resourcemanager.hostname", "hadoop01");
        conf.set("mapreduce.app-submission.sross-platform", "true");*/
        Job job = Job.getInstance(conf,"eclipseToCluster");

        job.setMapperClass(MapTask.class);
        job.setReducerClass(ReduceTask.class);
        job.setJarByClass(Driver.class);
        //job.setJar("C:\\Users\\dell\\Desktop\\wc.jar");

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path("D:\\a\\hello.txt"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\a\\wordcount\\wclipse-out"));

        //判断文件是否存在
        File file = new File("D:\\a\\wordcount\\wclipse-out");
        if(file.exists()){
            FileUtils.deleteDirectory(file);
        }

        boolean completion = job.waitForCompletion(true);
        System.out.println(completion?"程序执行完毕":"程序出bug了");
    }

直接在eclipse本地执行，在本机查看产生的文件是否执行成功

使用MapReduce框架做词频分析案例（案例一）

Map部分代码

Reduce部分代码

Driver部分代码

1.hdfs集群上运行

2.在eclipse本地执行jar包

3.在eclipse本地执行

猜你喜欢