1. MapReduce programming model
A distributed computing framework that solves computing problems with massive data.
MapReduce abstracts the entire parallel computing process into two functions:
Map (mapping): The specified operation for each element of a list of independent elements, which can be highly parallel.
Reduce: Combine the elements of a list.
A simple MapReduce program only needs to specify Map(), reduce(), input and output, and the framework does the rest.
2. Map process (take wordcount as an example):
1 Read line by line, each line is parsed into key/value form. For each key-value pair, the Map function is called once.
Suppose there is a file with the content:
hello hadoop!
hello world!
Then the reading process of Map is:
key | value | operate |
0 | hello hadoop! | --> hello:1 hadoop!:1 |
13 | hello world! | --> hello:1 world!:1 |
2 Write your own logic, process the input key/value, and convert it into a new key/value output.
key | value |
hello | 1 |
hadoop! | 1 |
hello | 1 |
world! | 1 |
3 Partition the output key/value .
Note : Shuffling includes: partition and sort.
4 Sort and group data in different partitions by key . Put the value of the same key into a set.
key | list<value> |
hello | 2 |
hadoop! | 1 |
world! | 1 |
5 (Optional) Reduce the grouped data.
3. Reduce process:
1 The output of multiple map tasks is copied to different reduce nodes through the network according to different partitions .
2 Merge and sort the outputs of multiple map tasks . Write the logic of the reduce function, process the input key/value, and convert it into a new key/value output.
3 Save the output of reduce to a file .