The wordcount of Mapreduce

The wordcount of Mapreduce

Mapreduce is a programming model that is responsible for the operation of massive data. It will perform distributed data operations on different nodes, which can greatly improve the efficiency of operations and facilitate data analysis.

When the mapreduce operation is started, it will first run many map tasks. After the map task has processed its own data, it also needs to start many reduce tasks. At this time, it is not scientific if the user manually starts it by himself, so this time you need one As an automated scheduling platform, Hadoop has developed an automated scheduling platform—yarn for running mapreduce-type distributed computing programs.

Below we use java code to write a wordcount program of mr, the main idea of ​​the Mapreduce program: map mapping and reduce reduction.

  1. map side
package demo;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * !!!!!!! 这个几个数据类型导包一定不要导到错了!!!!!!!
 * mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>,KEYIN是maptask调用的读文本的工
 * 具读到的数据key,默认是一行数据的偏移量,为Long类型,VALUEIN是所读取到的
 * 这一行数据的值value,默认是String类型,KEYOUT,是我们自定义的mapper类中
 * 逻辑将要返回的数据的key,类型根据自己逻辑设定,VALUEOUT也是同样的道理。
 * 
 * 由于map reduce做的是一个分布式的计算,所以数据需要在各个节点之间传递持久 	
 * 化存储,所以数据需要进行序列化,而jdk中自带的序列化机制是非常重的,效率
 * 和很低,所以hadoop开发了自己的序列化机制,那么,程序中需要传递的持久化数
 * 据类型,就需要实现hadoop自己的额序列化框架。
 * 
 * hadoop为一些常见的数据类型封装了实现自己序列化机制的类型
 * LongWritable ==> Long
 * Text ==> String
 * IntWritable ==> Integer
 * DoubleWritable ==> Double
 * .....
 */
public class MapWordCount extends Mapper<LongWritable, Text,Text, IntWritable> {
    
    
    /**
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     * 每一个map task都会调用这个方法,一个map每读一行就会调用一次map()方法,key就是一行的起始偏移量,
     * value就是行内容,
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        // 按照指定字符行内容切分成一个个的单词
        String[] words = line.split(" ");
        // 将每个单词的值定为1,然后交给reduce
        for (String word : words) {
    
    
            context.write(new Text(word),new IntWritable(1));
        }
    }
}

  1. reduce side
package demo;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;


import java.io.IOException;
import java.util.Iterator;

/**
 * Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>,这里面的KEYIN、VALUEIN对应的就是mapper的KEYOUT、VALUEOUT,
 * KEYOUT、VALUEOUT就是我们reduce逻辑想要输出的数据类型
 *
 */
public class ReduceWordCount extends Reducer<Text, IntWritable,Text,IntWritable> {
    
    
    /**
     * 众多的reduce都会调用这个reduce()方法,每拿一组相同key的数据调用一次
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    
    
        int count = 0;
        Iterator<IntWritable> it = values.iterator();
        while(it.hasNext()){
    
    
            count += it.next().get();
        }
        context.write(key,new IntWritable(count));
        System.out.println(key + " " + count);
    }
}
  1. Client
package demo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class JobSubmitter {
    
    
    public static void main(String[] args) throws Exception {
    
    
        Configuration conf = new Configuration();
        // 描述在哪个平台运行,这里使用的是在本地运行mr程序
        conf.set("mapreduce.framework.name", "local");

        // 如果在yarn上运行需要指定yarn的位
        //conf.set("yarn.resourcemanager.hostname", "lx01");


        // 客户端
        Job job = Job.getInstance(conf);
        // 指定mr程序的jar包获取路径,通过类加载机制动态获取
        job.setJarByClass(JobSubmitter.class);
        job.setMapperClass(MapWordCount.class);
        job.setReducerClass(ReduceWordCount.class);

        // 指定map的key,value输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //指定reduce的key,value输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 指定job所要读取数据源的目录
        FileInputFormat.setInputPaths(job,new Path("F:\\etl_test_data\\wordcount.txt"));
        //指定job数据结果目录
        FileOutputFormat.setOutputPath(job,new Path("F:\\etl_test_data\\output\\"));

        // 提交job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : -1);
    }
}
  1. The original data and the result data are compared, Figure 1 is the original data, and Figure 2 is the result data

Figure one
Insert picture description here

Figure 2
Insert picture description here
You can see that the data we want has been calculated through the map reduce program.

Guess you like

Origin blog.csdn.net/AnameJL/article/details/109862846