The problem Hadoop shuffle

A, Mapper 's Shuffle

  1. MapTask carried out in line, after receiving the read FileSplit
  2. Each read a line called once the map method
  3. Data output will be executed after completion of a map written to the buffer
  4. The default buffer size is 100M, can be adjusted by io.sort.mb
  5. In the buffer, data will partition -partition, sort - sort, merge - combine operation
  6. When the buffer capacity utilization reaches a threshold value of 0.8, will start a background thread to the data in the buffer is written to overflow write files in the specified directory, this process is called overflow write (Spill)
  7. Every time the Spill will have a new spill write file
  8. Etc. After all the data is finally written, the overflow will write all files once the merger (merge), merged into a new partition and sorting files
  9. If the final time of the merger, the number of overflow files written> = 3, then after the completion of the merger will be executed once again Combiner

 

Second, pay attention to the problem

  1. When underflow time of writing, the last remaining data will flush the buffer overflow to the last write file
  2. Spill theoretically default is 80M, but to consider the serialization and finally erosion and other factors
  3. Not by virtue of the size of a slice MapTask process to measure how much the output data after MapTask
  4. Each slice corresponds to a MapTask, each corresponding to a buffer MapTask
  5. It is essentially an array of bytes on buffer
  6. Buffer called ring buffer, the benefits that can be reused with a buffer address
  7. Threshold effect is to avoid blocking process produces Spill
  8. merge process may not occur

 

Three, Reducer 's Shuffle

  1. ReduceTask partition output file obtained by Http way, this process is called fetch
  2. Each ReduceTask the acquisition of data partitions were merge again, then sort
  3. The key to do the same polymerization, into the iterator value, which is called grouping step
  4. Reduce method calls, and the key iterator incoming

 

Fourth, pay attention to the problem

  1. The default number of threads that fetch 5
  2. ReduceTask threshold value is 5%, i.e. 5% after completion of MapTask, started ReduceTask
  3. Merge factor default is 10, i.e. every 10 files into one file

 

Fifth, flow chart

 

Six, Shuffle tune

  1. Map stage tuning:
    1. Transfer large buffers may generally be adjusted to 250 ~ 350M
    2. It can be introduced into the combine process
    3. After the merge file can be compressed to reduce the consumption of the network transmission
  2. Reduce tuning stages:
    1. Increase the number of threads fetch
    2. The lower the threshold ReduceTask
    3. Improve merge factor

 

Guess you like

Origin blog.csdn.net/yang134679/article/details/93780797