MapReduce's personal understanding

MapReduce hadoop system is the role of distributed computing, the idea of ​​divide and rule.

Simply put, the whole process is the Map + Shuffle + Reduce

MapReduce process can only have a Map and a Reduce, more complicated, then you can run MapReduce serial

Prior to Map Map

The client needs to submit slice information, jar packages and xml configuration file to HDFS

Yarn starts the corresponding number of slices according MapTask (all play each)

Map Task by writing user RecordReader, input InputSplit parsed in one Key / value .

Each key-value pair called once the map function call OutputCollector.collect () output.

Map: Enter key (long type offset) value (Text string line) outputs key value

Shuffle

First data partition, and then enter the ring buffer zone.

The ring buffer data reaches the overflow begins to write 80% of (cyclo default size 100M).

File sorting area overflow written (fast discharge).

Small files merge sort all the different batches of the same district overflow written (merge sort).

Reduce other end can be pulled to take combiner

Start the appropriate reduce task

The pull-up data taken from the end of the memory map, the memory is filled into the disk

ReduceTask will be taken to the same partition from different maptask result file, ReduceTask will then merge the files (merge sort)

Packets in the same key

Reduce to end

Reduce method calls

TextOutputFormat output

 

 

MrAppMaster : coordination processes responsible for the entire program schedule and status.

Guess you like

Origin blog.csdn.net/jnb_985027859/article/details/94649687