In the shuffle stage, a large amount of data from the map output stage, is sent to the reduce phase, this process may involve a large number of network IO .
When the output data is large, use hadoop compression mechanism provided by the data compression, the compressed mode can be specified . Reduce network bandwidth consumption and storage ;
Can map to compress the output ( map output to reduce process inputs can shuffle the amount of data transferred during the network)
Can reduce the output produced compressed (to save final hdfs data, primarily to reduce the occupation HDFS storage)
1.1. hadoop which support compression algorithms
Use hadoop checknative to view hadoop variety of compression algorithms supported, if there openssl is false , then the line is installed at the dependencies.
hadoop supported compression algorithms
Compression format |
tool |
algorithm |
File name extension |
Whether segmentation |
DEFLATE |
no |
DEFLATE |
.deflate |
no |
Gzip |
gzip |
DEFLATE |
.gz |
no |
bzip2 |
bzip2 |
bzip2 |
bz2 |
Yes |
LZO- |
lzop |
LZO- |
.lzo |
no |
LZ4 |
no |
LZ4 |
.lz4 |
no |
Snappy |
no |
Snappy |
.snappy |
no |
Corresponding to various compression algorithms using java class
Compression format |
The corresponding use of java classes |
DEFLATE |
org.apache.hadoop.io.compress.DeFaultCodec |
gzip |
org.apache.hadoop.io.compress.GZipCodec |
bzip2 |
org.apache.hadoop.io.compress.BZip2Codec |
LZO- |
com.hadoop.compression.lzo.LzopCodec |
LZ4 |
org.apache.hadoop.io.compress.Lz4Codec |
Snappy |
org.apache.hadoop.io.compress.SnappyCodec |
A mode: compression code set
Set map compression stage
Configuration configuration = new Configuration(); configuration.set("mapreduce.map.output.compress","true"); configuration.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.SnappyCodec"); |
Setting reduce stage compression
configuration.set("mapreduce.output.fileoutputformat.compress","true"); configuration.set("mapreduce.output.fileoutputformat.compress.type","RECORD"); configuration.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec"); |
Second way: configure a global MapReduce compression
We can modify mapred-site.xml configuration file, then restart the cluster, so that all mapreduce compression tasks.
map the output data is compressed
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> |
reduce the output data is compressed
<property><name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property><name>mapreduce.output.fileoutputformat.compress.type</name> <value>RECORD</value> </property> <property><name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> |
所有节点都要修改mapred-site.xml,修改完成之后记得重启集群