hadoop study notes (8): MapReduce

1. MapReduce programming model

A distributed computing framework that solves computing problems with massive data.

MapReduce abstracts the entire parallel computing process into two functions:

  Map (mapping): The specified operation for each element of a list of independent elements, which can be highly parallel.

  Reduce: Combine the elements of a list.

A simple MapReduce program only needs to specify Map(), reduce(), input and output, and the framework does the rest.

2. Map process (take wordcount as an example):

1 Read line by line, each line is parsed into key/value form. For each key-value pair, the Map function is called once.

Suppose there is a file with the content:

hello hadoop!

hello world!

Then the reading process of Map is:

key value operate
0 hello hadoop! --> hello:1 hadoop!:1
13 hello world! --> hello:1 world!:1

2 Write your own logic, process the input key/value, and convert it into a new key/value output.

key value
hello 1
hadoop! 1
hello 1
world!  1

 

3 Partition the output key/value .

 

Note : Shuffling includes: partition and sort.

4 Sort and group data in different partitions by key . Put the value of the same key into a set.

key list<value>
hello 2
hadoop! 1
world! 1

 

5 (Optional) Reduce the grouped data.

 3. Reduce process:

1 The output of multiple map tasks is copied to different reduce nodes through the network according to different partitions .

2 Merge and sort the outputs of multiple map tasks . Write the logic of the reduce function, process the input key/value, and convert it into a new key/value output.

3 Save the output of reduce to a file .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324932769&siteId=291194637