Summary of MapReduce using Hadoop 3 in Python 3

How MapReduce works

Introduction to MapReduce

MapReduce is a distributed computing model proposed by Google. It is mainly used in the search field to solve computing problems of massive data.

MapReduce is divided into two parts: Map (mapping) and Reduce (induction).

  1. When you submit a computing job to the MapReduce framework, it will first split the computing job into several Map tasks, and then assign them to different nodes for execution. Each Map task processes a part of the input data.
  2. When the Map task is completed, it will generate some intermediate files, which will be used as input data for the Reduce task. The main goal of the Reduce task is to summarize and output the output of the previous Maps

MapReduce basic patterns and processing ideas

When processing large-scale data, the basic concept of MapReduce is at three levels:

1. Dealing with big data processing: divide and conquer

        For big data that does not have computational dependencies on each other, the most natural way to achieve parallelism is to adopt a divide-and-conquer strategy.

2. Rising to the abstract model: Mapper and Reduce

         Parallel computing methods such as MPI lack high-level parallel programming models. Programmers need to specify storage, calculation, distribution and other tasks by themselves. In order to overcome this shortcoming, MapReduce draws on ideas from the Lisp functional language and provides two functions: Map and Reduce . A high-level abstraction of concurrent programming models.

3. Rising to the architecture: Unifying the architecture to hide system layer details for programmers

        Parallel computing methods such as MPI lack unified computing framework support. Programmers need to consider many details such as data storage, partitioning, distribution, result collection, error recovery, etc. To this end, MapReduce designs and provides an agreed computing framework, hiding the problems for programmers. Most system-level processing systems.

Big Data Processing: Divide and Conquer

 Establish Map and Reduce abstract models

Drawing on ideas from the functional programming language Lisp, two abstract operation functions Map and Reduce are defined:

Map:(k1:v1)->[(k2:v2)]
Reduce:(k2:[v2])->[(k3:v3)]
Each map processes initial data blocks with the same structure and size, that is (k1:v1), where k1 is the primary key, which can be the data block index or the data block address;

v1 is data. After processing by the Map node, many intermediate data sets are generated, and [] is used to represent the meaning of the data set. The data received by the Reduce node is the data after merging the intermediate data, that is, the data with equal key values ​​are merged together, that is (k2:[v2]); after Reduce processing, the processing result is generated.

Rising to the architecture: unified architecture, hiding system layer details for programmers

 Core process description:

1. There is a large data to be processed, which is divided into databases of the same size (such as 64MB), and corresponding user operating programs.

2. There is a master node (Master) responsible for scheduling in the system, as well as data Map and Reduce work nodes (Worker).

3. User jobs are submitted to a master node.

4. The master node finds and configures available Map nodes for the job program, and transmits the program to the map node.

5. The master node also finds and configures available reduce nodes for the job program, and sends the program to the reduce node.

6. The master node starts each Map node to execute the program, and each Map node reads as much local or local rack data as possible for calculation. (Allow the code to move closer to the data and reduce the communication volume of data in the cluster).

7. Each Map node processes the read data blocks, does some data sorting work (combining, sorting, etc.) and stores the data on the local machine; at the same time, it notifies the master node that the computing task is completed and informs the master node of the storage of intermediate result data. Location.

8. After the calculation of all Map nodes such as the master node is completed, the Reduce node starts to run; the Reduce node reads the data remotely from the intermediate result data location information mastered by the master node.

9. The calculation results of the Reduce node are summarized and output to a result file, that is, the entire processing result is obtained.
 

Python implements MapReduce 

Python MapReduce code

        The "trick" of writing MapReduce in python is to use the Hadoop stream API to transfer data between the Map function and the Reduce function through STDIN (standard input) and STDOUT (standard output).

      The only thing we need to do is read the input data using Python's sys.stdin and pipe our output to sys.stdout. Hadoop streaming will help us with everything else.

Map stage

PyCharm functional test code:

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/14 15:38
# 文件名称 : pythonMap_2
# 开发工具 : PyCharm
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print("%s\t%s" % (word, 1))

Effect screenshot:

Reduce phase

PyCharm functional test code:

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:  # count如果不是数字的话,直接忽略掉
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print("%s\t%s" % (current_word, current_count))
        current_count = count
        current_word = word

if word == current_word:  # 不要忘记最后的输出
    print("%s\t%s" % (current_word, current_count))

Effect screenshot:

Hadoop Streaming 

Hadoop streaming is a tool of Hadoop that helps users create and run a special type of map/ reduce jobs.

Example: We can use Python to write scripts: mapper.py and reducer.py.

grammar:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper mapper.py \
-reducer reducer.py

The Hadoop Streaming tool will create a Map/Reduce job, send it to the appropriate cluster, and monitor the entire execution of the job. Therefore, for specific tasks, the focus is how do we write python scripts?

Summary: 1. The code written must comply with the standard input and output stream;

          2. Because the program is to be uploaded to the cluster for execution, some Python libraries may not be supported, so you should pay attention to this.
 

Operation example

To be added 

Guess you like

Origin blog.csdn.net/zhouzhiwengang/article/details/132258157