Shuffle stage compression mechanism data

In the shuffle stage, a large amount of data from the map output stage, is sent to the reduce phase, this process may involve a large number of network IO .

When the output data is large, use hadoop compression mechanism provided by the data compression, the compressed mode can be specified . Reduce network bandwidth consumption and storage ;

Can map to compress the output ( map output to reduce process inputs can shuffle the amount of data transferred during the network)

Can reduce the output produced compressed (to save final hdfs data, primarily to reduce the occupation HDFS storage)

1.1. hadoop which support compression algorithms

Use hadoop checknative to view hadoop variety of compression algorithms supported, if there openssl is false , then the line is installed at the dependencies.

 

hadoop supported compression algorithms

Compression format

tool

algorithm

File name extension

Whether segmentation

DEFLATE

no

DEFLATE

.deflate

no

Gzip

gzip

DEFLATE

.gz

no

bzip2

bzip2

bzip2

bz2

Yes

LZO-

lzop

LZO-

.lzo

no

LZ4

no

LZ4

.lz4

no

Snappy

no

Snappy

.snappy

no

 

Corresponding to various compression algorithms using java class

Compression format

The corresponding use of java classes

DEFLATE

org.apache.hadoop.io.compress.DeFaultCodec

gzip

org.apache.hadoop.io.compress.GZipCodec

bzip2

org.apache.hadoop.io.compress.BZip2Codec

LZO-

com.hadoop.compression.lzo.LzopCodec

LZ4

org.apache.hadoop.io.compress.Lz4Codec

Snappy

org.apache.hadoop.io.compress.SnappyCodec

 

A mode: compression code set

Set map compression stage

Configuration configuration = new Configuration();

configuration.set("mapreduce.map.output.compress","true");

configuration.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");

 

Setting reduce stage compression

configuration.set("mapreduce.output.fileoutputformat.compress","true");

configuration.set("mapreduce.output.fileoutputformat.compress.type","RECORD");

configuration.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");

 

Second way: configure a global MapReduce compression

We can modify mapred-site.xml configuration file, then restart the cluster, so that all mapreduce compression tasks.

map the output data is compressed

<property>

<name>mapreduce.map.output.compress</name>

<value>true</value>

</property>

<property>

<name>mapreduce.map.output.compress.codec</name>

<value>org.apache.hadoop.io.compress.SnappyCodec</value>

</property>

reduce the output data is compressed

<property><name>mapreduce.output.fileoutputformat.compress</name>

       <value>true</value>

</property>

<property><name>mapreduce.output.fileoutputformat.compress.type</name>

<value>RECORD</value>

</property>

 <property><name>mapreduce.output.fileoutputformat.compress.codec</name>

<value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>

所有节点都要修改mapred-site.xml,修改完成之后记得重启集群

 

 

 



Guess you like

Origin www.cnblogs.com/TiePiHeTao/p/f097cc586147923ea16dce673c2e5b74.html