- Rational use of file storage format
When the construction of the table, to make use of orc, parquet storage format columns because the columns of storage tables, each column of data is stored together physically, will only need to traverse the column Hive data query, data processing greatly reduced the amount.
- Use the appropriate file compression
Hive is ultimately converted MapReduce programs to be executed, and MapReduce performance bottleneck is the disk and network IO IO, to resolve performance bottlenecks, the most important is to reduce the amount of data, data compression is a good way. Although the compression the amount of data, but the compression process consumes the CPU, but in Hadoop, often not the performance bottleneck CPU, CPU pressure is not great, the full use of compressed comparison idle CPU.
Common file compression format:
Compression format | Can Split | Whether that comes | Compression ratio | speed | Whether that comes hadoop |
gzip | no | Yes | High | faster | Yes |
LZO- | Yes | Yes | Relatively high | quickly | No need to install |
snappy | no | Yes | Relatively high | quickly | No need to install |
bzip2 | Yes | no | highest | slow | Yes |
Corresponding to the respective compression classes:
Compression format | class |
gzip | org.apache.hadoop.io.compress.GzipCodec |
LZO- | org.apache.hadoop.io.compress.lzo.LzoCodec |
snappy | org.apache.hadoop.io.compress.SnappyCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
zlib | org.apache.hadoop.io.compress.DefaultCodec |
lz4 | org.apache.hadoop.io.compress.Lz4Codec |
Compression options:
Compression ratio
Compression and decompression speed
Whether to support Split
Compression use:
Job output files according to block compressed in GZip way:
set mapreduce.output.fileoutputformat.compress=true // 默认值是 false set mapreduce.output.fileoutputformat.compress.type=BLOCK // 默认值是 Record set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec // 默认值是 org.apache.hadoop.io.compress.DefaultCodec
Map output also Gzip compression:
set mapred.map.output.compress=true set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec // 默认值是 org.apache.hadoop.io.compress.DefaultCodec
Hive and the output of the intermediate compression are:
hive.exec.compress.output = true // set the default value is false, no compression set hive.exec.compress.intermediate = true // The default value is false, it is enabled to MR compression set to true