Understanding of Hadoop operating mechanism

Understanding of Hadoop operating mechanism

  1. Shuffle process

    • Map端Shuffle

      • Spill: Write the data processed by MapTask to disk
        • All MapTask data will be partition marked first
        • All marked data will be written into a ring buffer [memory: 100M]
        • When the buffer reaches 80% of the storage threshold, this part is locked and ready for overflow
        • Sort K2V2 in 80%: put the data of the same partition together
          • Memory: Quick Sort
        • Write this part of the data to disk into a small file
        • In the end, each MapTask will generate many ordered small files
      • Merge: Merge all the small files corresponding to each MapTask into one large file
        • Merge sort: Merge sort: Merge sort based on sorted files
        • Each MapTask gets an overall ordered large file
      • The end of the MapTask program notifies the program manager APPMaster, and the APPMaster will notify the ReduceTask
    • Reduce端SHuffle

      • Merge: Each ReduceTask to each MapTask fetches its own data
        • Merge sort: Merge and sort all the data that belongs to you
          • merge sort
      • Finally realize that the data in each ReduceTask is overall ordered and grouped
    • Thinking: Custom grouping: order id, sorting comparator: order price

      Order_0000001	Pdt_01	222.8
      Order_0000001	Pdt_05	25.8
      Order_0000002	Pdt_03	522.8
      Order_0000002	Pdt_04	122.4
      Order_0000002	Pdt_05	722.4
      Order_0000003	Pdt_01	222.8
      Order_0000003	Pdt_01	1000.8
      Order_0000003	Pdt_01	999.8
      
      Order_0000003	Pdt_01	1000.8
      
      
      Order_0000002	Pdt_05	722.4
      
      
      Order_0000003	Pdt_01	222.8
      
      Order_0000001	Pdt_01	222.8
      
      Order_0000002	Pdt_04	122.4
      
      Order_0000001	Pdt_05	25.8
      
    • grouping rules

      • Default: call K2's sorter as the grouping comparator

      • Customization: Inherit WritableComparator and implement the compare method

        job.setGroupingComparatorClass
        
  2. Shuffle optimization

    • Try to let the program avoid the shuffle process
      • Map Join
    • ComBiner: Aggregation on the Map side
      • The number of MapTasks is relatively large, and each MapTask is responsible for processing relatively small data. Let each MapTask do an aggregation in each MapTask in advance to reduce the amount of data entering Reduce.
      • Aggregation logic: Reduce logic
      • Implementation: job.setCombinerClass(Reduce.class)
      • Occurrence: Every time the sorting ends, the Combiner will be done once
    • Compress: compression
      • Use compression to reduce data disk and network IO bandwidth to increase transmission speed
  3. Supplements in MapReduce

    • Fragmentation rules
      • file size/shard size > 1.1
        • If it is greater than 1.1 times, a shard size is a shard
          • The rest is a shard
        • If not larger, the entire file is a fragment
      • Fragment size: max (minimum fragment size, min (maximum fragment size, block size))
      • |
      • Whether the file is larger than 1.1 times a block
        • 135M > 128 x 1.1
          • as a slice
        • 145M
          • split1:128M
          • split2:17M
    • MapReduce Join
      • join rules
      • Join Algorithm
        • Reduce Join: Occurs on the reduce side, through shuffle, the associated words of the two data are associated when grouping
          • Must go through Shuffle
          • Suitable for big data join big data
        • Map Join: put small data into distributed memory, and let each complete small data join with every part of big data
          • No need to shuffle
          • Suitable for small data join big data
  4. Resource management and task scheduling of YARN

    • master-slave architecture
    • MapReduce program running process on YARN
    • Task Scheduling Mechanism in YARN
      • FIFO: single queue, cannot achieve concurrency or parallelism of multiple programs
      • Capacity: capacity scheduling, multiple queues, each queue is internally FIFO, multiple queues can be parallelized, allowing resources to be dynamically preempted
      • Fair: Fair scheduling, multiple queues, each queue shares resources internally, multiple queues can be parallelized, each queue can be concurrent internally, allows dynamic preemption of resources, and allows configuration of weight priorities

Guess you like

Origin blog.csdn.net/mitao666/article/details/110474056