Understanding of Hadoop operating mechanism
-
Shuffle process
-
Map端Shuffle
- Spill: Write the data processed by MapTask to disk
- All MapTask data will be partition marked first
- All marked data will be written into a ring buffer [memory: 100M]
- When the buffer reaches 80% of the storage threshold, this part is locked and ready for overflow
- Sort K2V2 in 80%: put the data of the same partition together
- Memory: Quick Sort
- Write this part of the data to disk into a small file
- In the end, each MapTask will generate many ordered small files
- Merge: Merge all the small files corresponding to each MapTask into one large file
- Merge sort: Merge sort: Merge sort based on sorted files
- Each MapTask gets an overall ordered large file
- The end of the MapTask program notifies the program manager APPMaster, and the APPMaster will notify the ReduceTask
- Spill: Write the data processed by MapTask to disk
-
Reduce端SHuffle
- Merge: Each ReduceTask to each MapTask fetches its own data
- Merge sort: Merge and sort all the data that belongs to you
- merge sort
- Merge sort: Merge and sort all the data that belongs to you
- Finally realize that the data in each ReduceTask is overall ordered and grouped
- Merge: Each ReduceTask to each MapTask fetches its own data
-
Thinking: Custom grouping: order id, sorting comparator: order price
Order_0000001 Pdt_01 222.8 Order_0000001 Pdt_05 25.8 Order_0000002 Pdt_03 522.8 Order_0000002 Pdt_04 122.4 Order_0000002 Pdt_05 722.4 Order_0000003 Pdt_01 222.8 Order_0000003 Pdt_01 1000.8 Order_0000003 Pdt_01 999.8
Order_0000003 Pdt_01 1000.8 Order_0000002 Pdt_05 722.4 Order_0000003 Pdt_01 222.8 Order_0000001 Pdt_01 222.8 Order_0000002 Pdt_04 122.4 Order_0000001 Pdt_05 25.8
-
grouping rules
-
Default: call K2's sorter as the grouping comparator
-
Customization: Inherit WritableComparator and implement the compare method
job.setGroupingComparatorClass
-
-
-
Shuffle optimization
- Try to let the program avoid the shuffle process
- Map Join
- ComBiner: Aggregation on the Map side
- The number of MapTasks is relatively large, and each MapTask is responsible for processing relatively small data. Let each MapTask do an aggregation in each MapTask in advance to reduce the amount of data entering Reduce.
- Aggregation logic: Reduce logic
- Implementation: job.setCombinerClass(Reduce.class)
- Occurrence: Every time the sorting ends, the Combiner will be done once
- Compress: compression
- Use compression to reduce data disk and network IO bandwidth to increase transmission speed
- Try to let the program avoid the shuffle process
-
Supplements in MapReduce
- Fragmentation rules
- file size/shard size > 1.1
- If it is greater than 1.1 times, a shard size is a shard
- The rest is a shard
- If not larger, the entire file is a fragment
- If it is greater than 1.1 times, a shard size is a shard
- Fragment size: max (minimum fragment size, min (maximum fragment size, block size))
- |
- Whether the file is larger than 1.1 times a block
- 135M > 128 x 1.1
- as a slice
- 145M
- split1:128M
- split2:17M
- 135M > 128 x 1.1
- file size/shard size > 1.1
- MapReduce Join
- join rules
- Join Algorithm
- Reduce Join: Occurs on the reduce side, through shuffle, the associated words of the two data are associated when grouping
- Must go through Shuffle
- Suitable for big data join big data
- Map Join: put small data into distributed memory, and let each complete small data join with every part of big data
- No need to shuffle
- Suitable for small data join big data
- Reduce Join: Occurs on the reduce side, through shuffle, the associated words of the two data are associated when grouping
- Fragmentation rules
-
Resource management and task scheduling of YARN
- master-slave architecture
- MapReduce program running process on YARN
- Task Scheduling Mechanism in YARN
- FIFO: single queue, cannot achieve concurrency or parallelism of multiple programs
- Capacity: capacity scheduling, multiple queues, each queue is internally FIFO, multiple queues can be parallelized, allowing resources to be dynamically preempted
- Fair: Fair scheduling, multiple queues, each queue shares resources internally, multiple queues can be parallelized, each queue can be concurrent internally, allows dynamic preemption of resources, and allows configuration of weight priorities