Hadoop Distributed Computing Framework MapReduce for Big Data

1. Overview of MapReduce

  • Hadoop MapReduce is a distributed computing framework for writing batch applications. The written program can be submitted to the Hadoop cluster for parallel processing of large-scale data sets.

  • A MapReduce job works by splitting the input dataset into independent chunks, which are processed in parallel by the map, and the framework sorts the output of the map, which is then fed into the reduce.

  • The MapReduce framework is dedicated to <key, value> key-value pair processing, which treats the input of a job as a set of <key, value> pairs and generates a set of <key, value> pairs as output.

  • Both input and output key and value must implement Writable interface.

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

2. Brief description of MapReduce programming model

insert image description here

  1. input : read text file;
  2. splitting : Split the file according to the line, the K1 line number obtained at this time, V1 indicates the text content of the corresponding line;
  3. mapping: Split each line in parallel according to the space, and split the obtained List(K2, V2), where K2 represents each word. Since it is used for word frequency statistics, the value of V2 is 1, which means it appears once;
  4. shuffling: Since the Mapping operation may be processed in parallel on different machines, it is necessary to distribute the data of the same key value to the same node for merging through shuffling, so that the final result can be counted. At this time, K2 is obtained for each word , List(V2) is an iterable collection, and V2 is V2 in Mapping;
  5. Reducing: The case here is to count the total number of occurrences of words, so Reducing performs a reduction and sum operation on List(V2) and finally outputs it.

In the MapReduce programming model, the splitting and shuffing operations are implemented by the framework, and only mapping and reducing need to be implemented by our own programming, which is the source of the name MapReduce.

3. MapReduce word frequency statistics case

2.1 Project Introduction

Here is a classic case of word frequency statistics: count the number of occurrences of each word in the following sample data.

Spark	HBase
Hive	Flink	Storm	Hadoop	HBase	Spark
Flink
HBase	Storm
HBase	Hadoop	Hive	Flink
HBase	Flink	Hive	Storm
Hive	Flink	Hadoop
HBase	Hive
Hadoop	Spark	HBase	Storm
HBase	Hadoop	Hive	Flink
HBase	Flink	Hive	Storm
Hive	Flink	Hadoop
HBase	Hive

For the convenience of everyone's development, I placed a tool class WordCountDataUtils in the project source code, which is used to simulate and generate word frequency statistics samples, and the generated files can be output to the local or directly written to HDFS.

2.2 Project dependencies
If you want to program MapReduce, you need to import hadoop-client dependencies:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${
    
    hadoop.version}</version>
</dependency>

2.3 WordCountMapper
splits each row of data according to the specified separator. It should be noted here that the types defined by Hadoop must be used in MapReduce, because the types predefined by Hadoop are all serializable and comparable, and all types implement the WritableComparable interface.

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    
    

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, 
                                                                      InterruptedException {
    
    
        String[] words = value.toString().split("\t");
        for (String word : words) {
    
    
            context.write(new Text(word), new IntWritable(1));
        }
    }

}

WordCountMapper corresponds to the Mapping operation in the following figure:
insert image description here

WordCountMapper inherits from the Mapper class, which is a generic class defined as follows:

WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
    
    
   ......
}
  • KEYIN : The type of mapping input key, that is, the offset of each line (the position of the first character of each line in the entire text), Long type, corresponding to the LongWritable
    type in Hadoop;
  • VALUEIN : mapping input value type, that is, each row of data; String type, corresponding to Text type in Hadoop;
  • KEYOUT: The type of key output by mapping, that is, each word; String type, corresponding to the Text type in Hadoop;
  • VALUEOUT: The type of value output by mapping, that is, the number of occurrences of each word; the int type is used here, corresponding to the IntWritable
    type.

2.4 WordCountReducer

Count the occurrences of words in Reduce:

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, 
                                                                                  InterruptedException {
    
    
        int count = 0;
        for (IntWritable value : values) {
    
    
            count += value.get();
        }
        context.write(key, new IntWritable(count));
    }
}

As shown in the figure below, the output of shuffling is the input of reduce. The key here is each word, and the values ​​are an iterable data type, like (1,1,1,…).
insert image description here

2.4 WordCountApp

Assemble the MapReduce job and submit it to the server for running. The code is as follows:

/**
 * 组装作业 并提交到集群运行
 */
public class WordCountApp {
    
    


    // 这里为了直观显示参数 使用了硬编码,实际开发中可以通过外部传参
    private static final String HDFS_URL = "hdfs://192.168.0.107:8020";
    private static final String HADOOP_USER_NAME = "root";

    public static void main(String[] args) throws Exception {
    
    

        //  文件输入路径和输出路径由外部传参指定
        if (args.length < 2) {
    
    
            System.out.println("Input and output paths are necessary!");
            return;
        }

        // 需要指明 hadoop 用户名,否则在 HDFS 上创建目录时可能会抛出权限不足的异常
        System.setProperty("HADOOP_USER_NAME", HADOOP_USER_NAME);

        Configuration configuration = new Configuration();
        // 指明 HDFS 的地址
        configuration.set("fs.defaultFS", HDFS_URL);

        // 创建一个 Job
        Job job = Job.getInstance(configuration);

        // 设置运行的主类
        job.setJarByClass(WordCountApp.class);

        // 设置 Mapper 和 Reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

  		// 设置Combiner(自行选择)
        job.setCombinerClass(WordCountReducer.class);
        // 设置自定义分区规则(自行选择)
        job.setPartitionerClass(CustomPartitioner.class);

        // 设置 Mapper 输出 key 和 value 的类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 设置 Reducer 输出 key 和 value 的类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 如果输出目录已经存在,则必须先删除,否则重复运行程序时会抛出异常
        FileSystem fileSystem = FileSystem.get(new URI(HDFS_URL), configuration, HADOOP_USER_NAME);
        Path outputPath = new Path(args[1]);
        if (fileSystem.exists(outputPath)) {
    
    
            fileSystem.delete(outputPath, true);
        }

        // 设置作业输入文件和输出文件的路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, outputPath);

        // 将作业提交到群集并等待它完成,参数设置为 true 代表打印显示对应的进度
        boolean result = job.waitForCompletion(true);

        // 关闭之前创建的 fileSystem
        fileSystem.close();

        // 根据作业结果,终止当前运行的 Java 虚拟机,退出程序
        System.exit(result ? 0 : -1);

    }
}

It should be noted that if the output type of the Mapper operation is not set, the program defaults that it is the same as the output type of the Reducer operation.

2.5 Submit to the server to run

In actual development, you can configure the hadoop development environment locally and start it directly in the IDE for testing. Here we mainly introduce the packaging and submission to the server for operation. Since this project does not use third-party dependencies other than Hadoop, it can be packaged directly:

mvn clean package

Submit the job with the following command:

hadoop jar /usr/appjar/hadoop-word-count-1.0.jar \
com.heibaiying.WordCountApp \
/wordcount/input.txt /wordcount/output/WordCountApp

Check the generated directory on HDFS after the job is completed:

# 查看目录
hadoop fs -ls /wordcount/output/WordCountApp

# 查看统计结果
hadoop fs -cat /wordcount/output/WordCountApp/part-r-00000

insert image description here

4. Advanced Combiner of word frequency statistics case

3.1 Code implementation
To use the combiner function, just add the following line of code when assembling the job:

// 设置 Combiner
job.setCombinerClass(WordCountReducer.class);

3.2 Execution Results
After adding the combiner, the statistical results will not change, but the effect of the combiner can be seen from the printed log:

Print log without combiner:
insert image description here

The print log after joining the combiner is as follows:
insert image description here

Here we only have one input file and it is less than 128M, so there is only one Map for processing. It can be seen that after the combiner, the records are reduced from 3519 to 6 (there are only 6 types of words in the sample). In this use case, the combiner can greatly reduce the amount of data to be transmitted.

5. Advanced Partitioner of word frequency statistics case

4.1 Default Partitioner

Suppose there is a requirement here: output the statistical results of different words to different files. This requirement is actually quite common. For example, when counting product sales, it is necessary to split the results by product category. To achieve this function, you need to use a custom Partitioner.

Here we first introduce the default classification rules of MapReduce: when building a job, if not specified, the default is to use HashPartitioner: hash the key value and take the remainder of numReduceTasks. It is implemented as follows:

public class HashPartitioner<K, V> extends Partitioner<K, V> {
    
    

  public int getPartition(K key, V value,
                          int numReduceTasks) {
    
    
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

4.2 Custom Partitioner

Here we inherit Partitioner's custom classification rules, and here we classify according to words:

public class CustomPartitioner extends Partitioner<Text, IntWritable> {
    
    

    public int getPartition(Text text, IntWritable intWritable, int numPartitions) {
    
    
        return WordCountDataUtils.WORD_LIST.indexOf(text.toString());
    }
}

Specify to use our own classification rules when building a job, and set the number of reduce:

// 设置自定义分区规则
job.setPartitionerClass(CustomPartitioner.class);
// 设置 reduce 个数
job.setNumReduceTasks(WordCountDataUtils.WORD_LIST.size());

4.3 Execution Results

The execution results are as follows, 6 files are generated respectively, and each file contains the statistical results of the corresponding words:
insert image description here

6. Case 2 Introduction

reference link

The author of the link cleans the text content, manually uploads it to the Hdfs platform, then analyzes it, imports it into the hive database for sorting and merging, then uses the sqoop tool to import it into mysql, and then uses the ssm framework to return it to the front-end page for display.

In fact, it can be optimized automatically:

  • The code in this article can realize the cleaning of text content and automatically upload it to the HDFS platform
  • The analysis result is to execute the liunx command, you can use the runtime class to execute the script
  • Use the code to read the contents of the directory and import the resulting data into the hive database
  • Then there is hive to data to mysql
  • Then there is the ssm frame page display data

I don't know if it's feasible. Most of the online processing is done manually. I will provide ideas here, and netizens can realize it by themselves.
Use java code to import hive table data into mysql

Guess you like

Origin blog.csdn.net/zouyang920/article/details/130398499