Several storage format table Hive

Hive file storage format:

  • textFile

    textFile as the default format

    Storage: storage line    

    Disadvantages: large disk overhead; large data parsing overhead; compressed text files, hive can not merge and split

  • sequencefile

    Binary to <key, value> into the form of a sequence of file

    Storage: storage line

    Advantages: segmentation, compression, compression is generally selected block, with the mapfile hadoop api are compatible with each other.

  • Rcfile

    Storage: in accordance with the data block rows each column of memory

    Quick access to fast compression column

    try to read the recording block relates to a minimum

    Read column need only need to read the header is defined for each row group.

    Read the full amount of operating performance data may not have a clear advantage over sequencefile

  • ORC

    Storage: in accordance with the data block rows each column of memory

    Quick access to fast compression column

    Efficient than rcfile, it is a modified version of rcfile

    Official website:

    The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

    ORC actually RC file storage format several optimizations, its main advantages are:
      (1), only the output of a single task for each file, this can reduce the load of the NameNode;
      (2), supports a variety of complex data types , for example: datetime, decimal, and some complex type (struct, List, Map, and Union);
      (. 3), some lightweight stored in the file index data;
      (4), based on the data type of the compression block mode : a, integer type column with a run length encoding (run-length encoding); b , String type column with a dictionary encoding (encoding Dictionary);
      (. 5), with a plurality of mutually parallel separate RecordReaders read the same file;
      ( 6), without scanning markers can separate document;
      (7), bind to read and write memory required;
      (8), the Metadata is stored Protocol Buffers, so it supports adding and deleting some of the columns.

  • Custom format

    The user may define the input and output format by implementing inoutformat outputformat.

 

 

 

 

Guess you like

Origin www.cnblogs.com/zbw1112/p/11897866.html