MapReduce data compression schemes

Compression Purpose:

    IO reduce the amount of data disk storage space and reduce transmission of data
    Compression pursuit of indicators:
            The shorter the time the better compression
            Compression of the bigger the better
            Hardware requirements such as: CPU algorithm supports
   mr compression can be used in place:
            The output data of the map data compression, reducing the amount of data to reduce the shuff    
            Reduce output of data compression, reducing the final result in disk storage space occupied
   Check the compression algorithms supported Hadoop:
            
[root@node-1 ~]# hadoop checknative
Native library checking:
hadoop: true /export/servers/hadoop-2.6.0cdh5.14.0/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /usr/lib64/libsnappy.so.1
lz4: true revision:10301
bzip2: true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so
If a loss of an algorithm can be loaded online Online yum, or recompile Hadoop 
 

Recommended compression algorithm:

Snappy

mr in how to use compression:

    MapReduce set in the program, the impact of the current program mr
    Configured in mapreduce-site.xml, affects all of mr program

Common MapReduce algorithm:

  • Word count
  • Data de-duplication
  • Sequence
  • Top K
  • select
  • projection
  • Packet
  • Multi-table joins
  • Single-table related

to sum up:

  • In the Hadoop, codec is represented by the implementation of CompressionCode. Here are a few to achieve:
  • Compression properties of the output:

  • Properties to achieve compression code output

 



Guess you like

Origin www.cnblogs.com/TiePiHeTao/p/225e91b6bf460e7bc20d86c4502a8a88.html