Map is responsible for filtering and distribution, reduce is merged and sorted, and the process from map output to reduce input is the shuffle process.
Implemented functionality
partition
Determine which reducer to process the current key
Default: partition the number of reduce according to the hash value of the key
grouping
Merge values of the same key
sort
Sort each keyvalue by key, lexicographically
Process
map端shuffle
spill stage: overflow write
The result of each map task processing will enter the ring buffer (memory 100M)
partition
Partition each key (mark to which reduce)
hadoop 1 reduce0 hive 1 reduce0 spark 1 reduce1 hadoop 1 reduce0 hbase 1 reduce1
sort
Sort by key, sort the data in the same partition within the partition
hadoop 1 reduce0 hadoop 1 reduce0 hive 1 reduce0 hbase 1 reduce1 spark 1 reduce1
overflow
When the entire buffer reaches 80% of the threshold, start overflow writing
Write the sorted data of the current partition to disk into a file file1
and finally generate multiple spill small files
The memory size and overflow threshold can be set in mapred-site.xml
Set the size of memory in mapred-site.xml <property> <name>mapreduce.task.io.sort.mb</name> <value>100</value> </property> Set the threshold for out-of-memory writes in mapred-site.xml <property> <name>mapreduce.task.io.sort.spill.percent</name> <value>0.8</value> </property>
merge: merge
Combine multiple small files generated by spill
Sorting: Sort the data in the same partition within the partition, and implement the comparator for comparison. Finally a file is formed.
file1 hadoop 1 reduce0 hadoop 1 reduce0 hive 1 reduce0 hbase 1 reduce1 spark 1 reduce1 file2 hadoop 1 reduce0 hadoop 1 reduce0 hive 1 reduce0 hbase 1 reduce1 spark 1 reduce1 end_file: hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hive 1 reduce0 hive 1 reduce0 hbase 1 reduce1 hbase 1 reduce1 spark 1 reduce1 spark 1 reduce1
When the map task ends, notify the app master, and the app master notifies reduce to pull the data
reduce端shuffle
map task1 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hive 1 reduce0 hive 1 reduce0 hbase 1 reduce1 hbase 1 reduce1 spark 1 reduce1 spark 1 reduce1 map task2 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hive 1 reduce0 hive 1 reduce0 hbase 1 reduce1 hbase 1 reduce1 spark 1 reduce1 spark 1 reduce1
reduce starts multiple threads to pull data belonging to its own partition to each machine through http
reduce0: hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hadoop 1 reduce0 hive 1 reduce0 hive 1 reduce0 hive 1 reduce0 hive 1 reduce0
merge: merge, merge the data of its own partitions in the results of each map task
Sort: Sort the data that belongs to my partition as a whole
Grouping: Merge the values of the same key and use Comparable to complete the comparison.
hadoop,list<1,1,1,1,1,1,1,1> hive,list<1,1,1,1>
optimization
combine
A merge is performed early in the map phase. Generally equivalent to executing reduce ahead of time
job.setCombinerClass(WCReduce.class);
compress
Compress intermediate result sets, reduce disk IO and network IO
Compression configuration
1. default: the default configuration items in all hadoop
2. site: used to customize the configuration file, if it is modified, it must be restarted to take effect
3. conf object configures the custom configuration of each program
4. User customization through parameters at runtime configure
bin / yarn jar xx.jar -Dmapreduce.map.output.compress = true -Dmapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.Lz4Codec main_class input_path ouput_path _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
See what compressions are supported by the native library
bin / hadoop checknative
Configure compression via the conf configuration object
public static void main(String[] args) { Configuration configuration = new Configuration(); //Configure map intermediate result set compression configuration.set("mapreduce.map.output.compress","true"); configuration.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.Lz4Codec"); //Configure reduce result set compression configuration.set("mapreduce.output.fileoutputformat.compress","true"); configuration.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.Lz4Codec"); try { int status = ToolRunner.run(configuration, new MRDriver(), args); System.exit(status); } catch (Exception e) { e.printStackTrace (); } }