MapReduce brief introduction and workflow

Execution steps of MR programming model:

  1. Prepare input data for map processing

  2. mapper processing

  3、Shuffle

  4. Reduce processing

  5. Result output

 (input)<k1,v1> -> map -><k2,v2> -> combine -> <k2,v2> ->reduce -> <k3,v3>(output)

 

 

 

Processing flow:

  

process:

  1. Input text information, from InputFormat -> FileInputFormat -> TextInputFormat, obtain the Split array through the getSplits method, and then use the getRecordReader method to process the Split, and hand each line read to a map for processing

  2. All maps on each node are handed over to the Partitioner on the node for processing (Shuffling process). According to the key, whether to place the map on other nodes or continue processing under this node

  3. Sorting

  4. The results are processed by reduce

  5. After the processing is completed, OutputFormat ->FileOutputFormat ->TextOutputFormat is written to the local or Hadoop

 

Split: The data block processed by MR, the smallest computing unit in MR, the default is one-to-one correspondence with the Block in HDFS (the smallest storage unit in HDFS, default 128M), it can also be set manually (modification is not recommended)

InputFormat: Split the input data (Split) InputSplit[] getSplits(JobConf var1, int var2)

  TextInputFormat: used to process data in text format

OutputFormat: output

 

 

Illustration of the above picture:

  Generally speaking, one Split corresponds to one Block, but the picture above is after setting.

  A file is divided into n Blocks, corresponding to 2n Splits. After InputFormat processing, each Split is processed by a Mapper. After Shuffling grouping and sorting, multiple Reducers are generated, and each Reducer generates a document

 

 

 

MapReduce 1.x architecture: one JobTracker + multiple taskTrackers

    JobTracker: Responsible for resource management and job scheduling

    TrakTracker: Regularly reports node health, resources, and job status to JobTracker, and receives JT commands, such as starting/killing tasks.

 

MapReduce 2.x:

  

 

 
 
 
 
 
 
 

Guess you like

Origin blog.csdn.net/asd54090/article/details/80920592