Hive file storage format:
- textFile
textFile as the default format
Storage: storage line
Disadvantages: large disk overhead; large data parsing overhead; compressed text files, hive can not merge and split
- sequencefile
Binary to <key, value> into the form of a sequence of file
Storage: storage line
Advantages: segmentation, compression, compression is generally selected block, with the mapfile hadoop api are compatible with each other.
- Rcfile
Storage: in accordance with the data block rows each column of memory
Quick access to fast compression column
try to read the recording block relates to a minimum
Read column need only need to read the header is defined for each row group.
Read the full amount of operating performance data may not have a clear advantage over sequencefile
- ORC
Storage: in accordance with the data block rows each column of memory
Quick access to fast compression column
Efficient than rcfile, it is a modified version of rcfile
Official website:
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
ORC actually RC file storage format several optimizations, its main advantages are:
(1), only the output of a single task for each file, this can reduce the load of the NameNode;
(2), supports a variety of complex data types , for example: datetime, decimal, and some complex type (struct, List, Map, and Union);
(. 3), some lightweight stored in the file index data;
(4), based on the data type of the compression block mode : a, integer type column with a run length encoding (run-length encoding); b , String type column with a dictionary encoding (encoding Dictionary);
(. 5), with a plurality of mutually parallel separate RecordReaders read the same file;
( 6), without scanning markers can separate document;
(7), bind to read and write memory required;
(8), the Metadata is stored Protocol Buffers, so it supports adding and deleting some of the columns.
- Custom format
The user may define the input and output format by implementing inoutformat outputformat.