mapreduce the basis of content

MapReduce: distributed computing framework used to decompose a large amount of data processing

Map separate elements on the stage of an operation of designating a data set, generating an intermediate key to a result, the Reduce all values ​​the same stage of the intermediate result is the key statute, to obtain a final result.

advantage:

1) Ease of programming: simple implementation of a number of interfaces

2) When computing scalability when lack of resources, you can expand computing power by increasing his machine

3) High Resilience: when linked to a machine, computing tasks on the machine will be transferred to a node on another

Disadvantages: not suitable for real-time statistical process (flow calculation), MapReduce data processing is static. For example: processing historical data, (because every time the result of the operation is to disk, if the process again first read the disk)

We first get to know the following categories:

1) Mapper: it is a generic type, there are four parameter, namely, the map function key input, the input value, the output key, output values. Generic type parameter is a reference type can not be primitive type (e.g., int, double, char) output type (key-Value pair)

2) Shuffle : Map of migrating data to Reduce stage.

3) Combine :( partial statistics: For a map of the statistics, minimize data migration).

4) Partitioner: Partition processor key he get the hash value modulo 3, a first value of 0 into the Reducer, the value 1 if the Reducer into the second, if the value 2 into the first three Reducer, regardless of each mapper delivery over the key, as long as the same key will enter the same Reducer.

5) Reduce: a globally accepted statistical statistics conducted a Map each, according to statistics take over all of the key values.

The basic principle of MapReduce

 

1) map: read data into the ring buffer (100m)

80% read, the file will be passed in the disk buffer, when the buffer is written to disk will be partitioned, sorting, combin will partition the sort overflow file will generate overflow files, those files will be final statistics sort partition, reduce wait to come to take away

2) reduce: According to the calculated hash value to the map, and then merge the files, merge sort performed according to the interior of the k / v, and then grouped according to the key.

Guess you like

Origin www.cnblogs.com/tudousiya/p/11241556.html