The execution process of each stage of Hadoop MapReduce and Python code to implement a simple WordCount program

Video material : Dark Horse programmer big data Hadoop introductory video tutorial, suitable for zero-based self-study big data Hadoop tutorial

Map stage execution process

  1. The files in the input directory are processed one by one according to certain standards 逻辑切片. The default size of each block is Split size = Block size(128M), and those less than 128M are one block, and each slice is processed by a MapTask.
  2. Read and analyze the data in the slice according to certain rules and return <key, value>the pair.
  3. Call the map method in the Mapper class to process the data. Every time a <key, value> is read and parsed, the map method is called once.
  4. According to certain rules, the key-value pairs output by Map are processed 分区partition. By default there is no partitioning because there is only one reducetask. The number of partitions is the number of reducetask runs.
  5. The Map output data is written 内存缓冲区Memory Bufferand overflowed to the disk when the ratio is reached. 溢出spillWhen sorting according to key sort. By default, they are sorted lexicographically by key.
  6. Do a final on all overflow files 合并mergeinto one file.
    insert image description here

Reduce phase execution process

  1. ReduceTask will take the initiative to transfer data that needs to be processed by itself from MapTask 复制拉取.
  2. Merge all the pulled data merge, that is, merge scattered data into one big data. Then merge the data 排序.
  3. Call the reduce method on the sorted key-value pairs. 键相同的键值对Call the reduce method once. Finally, these output key-value pairs are written to HDFS files.

Python code implements WordCount instance of MapReduce

Firstly, we introduce a Hadoop Streamingtool called , which can help users create a special kind of map/reducejob. These special jobs are made by some executable files or script files as mapper or reducer. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper mapper.py \
    -reducer reducer.py

When the Mapper task runs, it splits the input into lines and feeds each line to the executable process 标准输入STDIN. At the same time, the mapper collects 标准输出STDOUTthe content of the executable file process, and converts each line of content received key/valueinto a pair as the output of the mapper.

When the Reducer task runs, it splits the input into lines and feeds each line to the standard input of the executable process. At the same time, the reducer collects the content of the standard output of the executable file process, and converts each line of content into a key/value pair as the output of the reducer.

The following content is a code example:
create a new folder WordCountTask , and create a new text document word.txt under this folder, and enter the following content:

Hello World
Hello Hadoop
Hello MapReduce

mapper.pyCreate and reducer.pytwo files under the WordCountTask folder :

mapper.py

#!/usr/bin/python3

import sys

for line in sys.stdin:
    # 去除输入内容首位的空白字符
    line = line.strip()
    # 将输入内容分割为单词
    words = line.split()
    for word in words:
        # 将结果写到标准输出STDOUT,作为Reduce阶段代码的输入
        print("%s\t%s" % (word, 1))

Enter the command cat word.txt | ./mapper.py, and the running results are as follows:
insert image description here

reducer.py

#!/usr/bin/python3

import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split("\t", 1)
    try:
        count = int(count)
    except ValueError:
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print("%s\t%s" % (current_word, current_count))
        current_count = count
        current_word = word

if word == current_word:
    print("%s\t%s" % (current_word, current_count))

Enter the command cat word.txt | ./mapper.py | sort | ./reducer.py, and the running results are as follows:
insert image description here

  • To explain, the symbol |is the pipe character in the Linux system. The pipe character is mainly used for multiple command processing, and the print result of the previous command is used as the input of the subsequent command.
  • sortcommand to sort the contents of a text file.

Runs on the Hadoop HDFS file system

Run the mapper and reducer just written on the Hadoop pseudo-distributed system built by three virtual machines

  1. First you need to start Hadoop and the required components:
    insert image description here
  2. Create a new folder WordCountTask under the root directory of the HDFS file system, and upload word.txt to this directory:
[root@master ~]# hadoop fs -mkdir /WordCountTask
[root@master ~]# hadoop fs -put WordCountTask/word.txt  /WordCountTask
  1. Run the command:
[root@master ~]# hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-input /WordCountTask/ \
-output /WordCountTask/out \
-file /root/WordCountTask/mapper.py \
-mapper /root/WordCountTask/mapper.py  \
-file /root/WordCountTask/reducer.py \
-reducer /root/WordCountTask/reducer.py 
  1. The result of the final program is in -outputthe path specified by the parameter. This path is automatically generated by the program, and this path cannot exist before the program is executed. Hadoop StreamingIt is the stream processing package that comes with Hadoop. The flow of the program is that the original text is streamed to the Map function, and the Map function passes the result to the Reduce function after processing, and the final result is saved on HDFS.

Guess you like

Origin blog.csdn.net/weixin_45735297/article/details/129904802