Video material : Dark Horse programmer big data Hadoop introductory video tutorial, suitable for zero-based self-study big data Hadoop tutorial
Article Directory
Map stage execution process
- The files in the input directory are processed one by one according to certain standards
逻辑切片
. The default size of each block isSplit size = Block size(128M)
, and those less than 128M are one block, and each slice is processed by a MapTask. - Read and analyze the data in the slice according to certain rules and return
<key, value>
the pair. - Call the map method in the Mapper class to process the data. Every time a <key, value> is read and parsed, the map method is called once.
- According to certain rules, the key-value pairs output by Map are processed
分区partition
. By default there is no partitioning because there is only one reducetask. The number of partitions is the number of reducetask runs. - The Map output data is written
内存缓冲区Memory Buffer
and overflowed to the disk when the ratio is reached.溢出spill
When sorting according to keysort
. By default, they are sorted lexicographically by key. - Do a final on all overflow files
合并merge
into one file.
Reduce phase execution process
- ReduceTask will take the initiative to transfer data that needs to be processed by itself from MapTask
复制拉取
. - Merge all the pulled data
merge
, that is, merge scattered data into one big data. Then merge the data排序
. - Call the reduce method on the sorted key-value pairs.
键相同的键值对
Call the reduce method once. Finally, these output key-value pairs are written to HDFS files.
Python code implements WordCount instance of MapReduce
Firstly, we introduce a Hadoop Streaming
tool called , which can help users create a special kind of map/reduce
job. These special jobs are made by some executable files or script files as mapper or reducer. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper mapper.py \
-reducer reducer.py
When the Mapper task runs, it splits the input into lines and feeds each line to the executable process 标准输入STDIN
. At the same time, the mapper collects 标准输出STDOUT
the content of the executable file process, and converts each line of content received key/value
into a pair as the output of the mapper.
When the Reducer task runs, it splits the input into lines and feeds each line to the standard input of the executable process. At the same time, the reducer collects the content of the standard output of the executable file process, and converts each line of content into a key/value pair as the output of the reducer.
The following content is a code example:
create a new folder WordCountTask , and create a new text document word.txt under this folder, and enter the following content:
Hello World
Hello Hadoop
Hello MapReduce
mapper.py
Create and reducer.py
two files under the WordCountTask folder :
mapper.py
#!/usr/bin/python3
import sys
for line in sys.stdin:
# 去除输入内容首位的空白字符
line = line.strip()
# 将输入内容分割为单词
words = line.split()
for word in words:
# 将结果写到标准输出STDOUT,作为Reduce阶段代码的输入
print("%s\t%s" % (word, 1))
Enter the command cat word.txt | ./mapper.py
, and the running results are as follows:
reducer.py
#!/usr/bin/python3
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split("\t", 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print("%s\t%s" % (current_word, current_count))
current_count = count
current_word = word
if word == current_word:
print("%s\t%s" % (current_word, current_count))
Enter the command cat word.txt | ./mapper.py | sort | ./reducer.py
, and the running results are as follows:
- To explain, the symbol
|
is the pipe character in the Linux system. The pipe character is mainly used for multiple command processing, and the print result of the previous command is used as the input of the subsequent command. sort
command to sort the contents of a text file.
Runs on the Hadoop HDFS file system
Run the mapper and reducer just written on the Hadoop pseudo-distributed system built by three virtual machines
- First you need to start Hadoop and the required components:
- Create a new folder WordCountTask under the root directory of the HDFS file system, and upload word.txt to this directory:
[root@master ~]# hadoop fs -mkdir /WordCountTask
[root@master ~]# hadoop fs -put WordCountTask/word.txt /WordCountTask
- Run the command:
[root@master ~]# hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-input /WordCountTask/ \
-output /WordCountTask/out \
-file /root/WordCountTask/mapper.py \
-mapper /root/WordCountTask/mapper.py \
-file /root/WordCountTask/reducer.py \
-reducer /root/WordCountTask/reducer.py
- The result of the final program is in
-output
the path specified by the parameter. This path is automatically generated by the program, and this path cannot exist before the program is executed.Hadoop Streaming
It is the stream processing package that comes with Hadoop. The flow of the program is that the original text is streamed to the Map function, and the Map function passes the result to the Reduce function after processing, and the final result is saved on HDFS.