MapReduce implementation mechanisms of MapTask

Map stage of the process: the INPUT File by split is logically segmented into multiple split files by Record reading the contents of a row to map processing (user's own implementation), data is map to the end of the processing OutputCollector collector, its results key partition (the default hash partitions), then write buffer , each map task has a memory buffer, the memory map of the output to the buffer nearly full when data needs to be in a temporary buffer way to store files to disk, when the entire map task after the end of the disk in the map task all temporary files created to do the merger to generate the final official output file, and then wait reduce task to pull data.

detailed steps:

  1. First, read the data component InputFormat (default TextInputFormat ) will pass getSplits be input files in a directory method logic chips planning to get the splits , the number of split would correspond to the number of start MapTask . split the block correspondence between the default one.
  2. The input file segmented into splits after the RecordReader object (the default LineRecordReader read) to \ n as a delimiter, reads a line of data, return <Key, value> . Key represents the first character of each line offset value, value represents this line of text.
  3. Read split return to <Key, value> , enter the user's own inherited Mapper class, execute a user to rewrite the map function . RecordReader read a line called once here.
  4. map after the logical finished, the map of each results context.write for collect data collection. In the collect , the first will be zoning process , the default use HashPartitioner .
    1. MapReduce provides Partitioner interfaces, its role is based on key or value and reduce to decide which of the current output data which should be completed by the final number of reduce task handling. The default number of key hash to reduce task then modulo . The default modulo way just to average reduce the processing power, if the users themselves to Partitioner there is a demand, can be customized and set to the job on.
  5. Next, the data is written to memory, the memory in this area is called the ring buffer , a buffer zone of bulk collection map results, to reduce disk IO effects. Our key / value pairs and Partition result will be written to the buffer. Before writing course, Key and value values are serialized into a byte array.
    1. In fact, the ring buffer is an array , the array storage Key , value sequence data and Key , value metadata information, including Partition , Key starting position, value starting position and the value length. Ring structure is an abstraction.
    2. Buffer is limited in size, the default is 100MB . When the map task when the output of many, it may explode memory, so it is necessary under certain conditions, the temporary data buffer is written to disk, and then re-use this buffer. From this to write data to disk memory process called Spill ( overflow write) . Completed by a separate thread, does not affect the write buffer to map the results of threads. It should not prevent overflows write thread starts map output, so there is a whole buffer overflow written proportion spill.percent. The default ratio is 0.8 , that is, when the buffer data has reached the threshold value ( Buffer size = 100MB * Percent spill * 0.8 = 80MB ), write overflow thread starts, this lock 80MB memory, perform overflow writing process. Map task outputs also the rest of the previous 20MB memory write, independently of each other.
  6. When the overflow write thread starts, this requires 80MB internal space key to do the sort (the Sort) . Sorting is MapReduce model the default behavior of the sort here is the sequence of bytes to do the sort.
    1. If the job is set Combiner , it is to use Combiner time. Will have the same key of the key / value pairs value add up, reducing the amount of data written to disk overflow. Combiner will optimize MapReduce intermediate results, so it will be used multiple times throughout the model.
    2. What a scene that you can use Combiner it? From this analysis, Combiner is output is Reducer input, Combiner is no change in the final results. Combiner should only be used that Reduce input key / value and the output key / value type consistent and does not affect the final result of the scene. Such accumulation, the maximum value and the like. Combiner use got to be careful, if used well, it job execution efficiency help, on the contrary will affect reduce the final result.
  7. Every overflow write on the disk will generate a temporary file (must first determine whether there is Combiner ), if the map output really big, there are many such write overflow occurs, the disk will have the corresponding number of temporary file exists. When processing starts after the entire data of the temporary file on disk by merge merge , because ultimately only one file, written to disk, and provides a file for the index file , each record in order to reduce offset of the corresponding data.
  8. map the entire phase ends.
 



Guess you like

Origin www.cnblogs.com/TiePiHeTao/p/b538975b98d02effba4b125491f5d398.html