Hive supported file formats and compression formats and their characteristics

Hive Files format

1-TEXTFILE

  • Text format, Hive default format, the data is not compressed, large disk overhead, large data parsing overhead.
  • Hive API corresponding to: org.apache.hadoop.mapred.TextInputFormat and org.apache.hive.ql.io.HiveIgnoreKeyTextOutputFormat;
  • Can be combined Gzip, Bzip2 used (automatic inspection system, automatically extracts the query is executed), but this way, the data Hive not be segmented, and thus can not operate on data in parallel

2-SequenceFile

  • Binaries provided by Hadoop, Hadoop supports standard file;
  • Serialized data directly to the file, the file can not directly view SequenceFile be viewed by -text Hadoop fs;
  • SequenceFile easy to use, it can be split, compressible, can be sliced, compression support NONE, RECORD, BLOCK (preferred);
  • 对应hive API:org.apache.hadoop.mapred.SequenceFileInputFormat和org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

3-RCFILE

  • rcfile storage is stored in rows and columns combining first data row by row storage block then ensure that the same record in a block, to avoid reading the plurality of blocks, and data compression is conducive to rapid column storage;
  • 对应 hive API为:org.apache.hadoop.hive.ql.io.RCFileInputFormat和org.apache.hadoop.hive.ql.io.RCFileOutputFormat

4-orcfile

  • orcfile formula rcfile optimization can improve read and write, the data processing performance of the hive, to provide higher compression efficiency;
  • advantage:
    • Each output task only a single file, reducing the load namenode;
    • Support complex data types, such as: datetime, decima and complex type struct, list, map;
    • Files stored in a number of lightweight index data;
    • Based on the block mode of data compression type: integer type column with a run length encoding, string type columns using the dictionary encoding;
    • A plurality of independent parallel read the same file recordReaders
    • You can split the file without scanning markers
    • Bind the required memory read and write
    • metadata stored with protocol buffers, support for adding and removing columns

5-parquet

  • Parquet also a storage column, but also has good compression properties; also can reduce the number of table scans and deserialization time.

to sum up

  • Minimum textfile storage space consumption is relatively large, and the compressed text can not be split and merge query efficiency, can be stored directly, the maximum speed for loading data;
  • sequencefile maximum storage space consumption, the compressed files can be split and merge query efficiency, to load text files need to be converted by;
  • The minimum orcfile, rcfile storage space, the lowest maximum efficiency queries, need to convert text files to load, load speed;
  • parquet storage format is a column, and has good compression performance table scan function;

SequenceFile, ORCFile (ORC), rcfile table format can not be introduced directly from the local data file, first data into the table textfile format, and then introduced from the textfile table SequenceFile, ORCFile (ORC), rcfile table.

Hive supported compression formats

  • Hive supported compression format Gzip, Bzip2, lzo, snappy


     
    .Jpg compression features

Guess you like

Origin www.cnblogs.com/sx66/p/12039248.html