Compression Purpose:
IO reduce the amount of data disk storage space and reduce transmission of data
Compression pursuit of indicators:
The shorter the time the better compression
Compression of the bigger the better
Hardware requirements such as: CPU algorithm supports
mr compression can be used in place:
The output data of the map data compression, reducing the amount of data to reduce the shuff
Reduce output of data compression, reducing the final result in disk storage space occupied
Check the compression algorithms supported Hadoop:
[root@node-1 ~]# hadoop checknative Native library checking: hadoop: true /export/servers/hadoop-2.6.0cdh5.14.0/lib/native/libhadoop.so.1.0.0 zlib: true /lib64/libz.so.1 snappy: true /usr/lib64/libsnappy.so.1 lz4: true revision:10301 bzip2: true /lib64/libbz2.so.1 openssl: true /usr/lib64/libcrypto.so |
If a loss of an algorithm can be loaded online Online yum, or recompile Hadoop
Recommended compression algorithm:
Snappy
mr in how to use compression:
MapReduce set in the program, the impact of the current program mr
Configured in mapreduce-site.xml, affects all of mr program
- Data de-duplication
- Sequence
- Top K
- select
- projection
- Packet
- Multi-table joins
- Single-table related
to sum up:
- In the Hadoop, codec is represented by the implementation of CompressionCode. Here are a few to achieve: