A, Mapper 's Shuffle
- MapTask carried out in line, after receiving the read FileSplit
- Each read a line called once the map method
- Data output will be executed after completion of a map written to the buffer
- The default buffer size is 100M, can be adjusted by io.sort.mb
- In the buffer, data will partition -partition, sort - sort, merge - combine operation
- When the buffer capacity utilization reaches a threshold value of 0.8, will start a background thread to the data in the buffer is written to overflow write files in the specified directory, this process is called overflow write (Spill)
- Every time the Spill will have a new spill write file
- Etc. After all the data is finally written, the overflow will write all files once the merger (merge), merged into a new partition and sorting files
- If the final time of the merger, the number of overflow files written> = 3, then after the completion of the merger will be executed once again Combiner
Second, pay attention to the problem
- When underflow time of writing, the last remaining data will flush the buffer overflow to the last write file
- Spill theoretically default is 80M, but to consider the serialization and finally erosion and other factors
- Not by virtue of the size of a slice MapTask process to measure how much the output data after MapTask
- Each slice corresponds to a MapTask, each corresponding to a buffer MapTask
- It is essentially an array of bytes on buffer
- Buffer called ring buffer, the benefits that can be reused with a buffer address
- Threshold effect is to avoid blocking process produces Spill
- merge process may not occur
Three, Reducer 's Shuffle
- ReduceTask partition output file obtained by Http way, this process is called fetch
- Each ReduceTask the acquisition of data partitions were merge again, then sort
- The key to do the same polymerization, into the iterator value, which is called grouping step
- Reduce method calls, and the key iterator incoming
Fourth, pay attention to the problem
- The default number of threads that fetch 5
- ReduceTask threshold value is 5%, i.e. 5% after completion of MapTask, started ReduceTask
- Merge factor default is 10, i.e. every 10 files into one file
Fifth, flow chart
Six, Shuffle tune
- Map stage tuning:
- Transfer large buffers may generally be adjusted to 250 ~ 350M
- It can be introduced into the combine process
- After the merge file can be compressed to reduce the consumption of the network transmission
- Reduce tuning stages:
- Increase the number of threads fetch
- The lower the threshold ReduceTask
- Improve merge factor