Hive performance tuning (a) ---- file storage format and compression mode selection

  • Rational use of file storage format

    When the construction of the table, to make use of orc, parquet storage format columns because the columns of storage tables, each column of data is stored together physically, will only need to traverse the column Hive data query, data processing greatly reduced the amount.

  • Use the appropriate file compression

    Hive is ultimately converted MapReduce programs to be executed, and MapReduce performance bottleneck is the disk and network IO IO, to resolve performance bottlenecks, the most important is to reduce the amount of data, data compression is a good way. Although the compression the amount of data, but the compression process consumes the CPU, but in Hadoop, often not the performance bottleneck CPU, CPU pressure is not great, the full use of compressed comparison idle CPU.

    Common file compression format:

Compression format Can Split Whether that comes Compression ratio speed Whether that comes hadoop
gzip no Yes High faster Yes
LZO- Yes Yes Relatively high quickly No need to install
snappy no Yes Relatively high quickly No need to install
bzip2 Yes no highest slow Yes

    Corresponding to the respective compression classes:

Compression format class
gzip org.apache.hadoop.io.compress.GzipCodec
LZO- org.apache.hadoop.io.compress.lzo.LzoCodec
snappy org.apache.hadoop.io.compress.SnappyCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
zlib org.apache.hadoop.io.compress.DefaultCodec
lz4 org.apache.hadoop.io.compress.Lz4Codec

    Compression options:

      Compression ratio

      Compression and decompression speed

      Whether to support Split

    Compression use:

      Job output files according to block compressed in GZip way:

    set mapreduce.output.fileoutputformat.compress=true // 默认值是 false

    set mapreduce.output.fileoutputformat.compress.type=BLOCK // 默认值是 Record

    set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec // 默认值是 org.apache.hadoop.io.compress.DefaultCodec

       Map output also Gzip compression:

    set mapred.map.output.compress=true

    set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec // 默认值是 org.apache.hadoop.io.compress.DefaultCodec 

       Hive and the output of the intermediate compression are:

    hive.exec.compress.output = true // set the default value is false, no compression 

    set hive.exec.compress.intermediate = true // The default value is false, it is enabled to MR compression set to true

 

Guess you like

Origin www.cnblogs.com/zbw1112/p/11898368.html