Comparative hive storage format

Apache Hive supports several Apache Hadoop used in the familiar file formats, such as TextFile, RCFile, SequenceFile, AVRO, ORC and Parquet format.
Cloudera Impala also supports these file formats. When using the built form STORED AS ( TextFile | RCFile | SequenceFile | AVRO | ORC | Parquet ) to specify the storage format .
 
TextFile each row is a record, each line with a newline character (\ n) at the end. Data compression is not large disk overhead, large data parsing overhead. Can be combined Gzip, Bzip2 used (automatic inspection system, automatically extracts the query is executed), but this way, the data will not Hive segmentation, the data which can not operate in parallel.
 
SequenceFile is a binary file support Hadoop API provided with easy to use, can be divided, compressible characteristics. Supports three compression Select: NONE, RECORD, BLOCK. Record compression rate, compression is generally recommended to use BLOCK.
 
RCFile is stored in rows and columns storage combination. First, according to which the data block rows, to ensure that the same record in a block, to avoid the need to read a plurality of reading a record block. Next, the block data storage row, facilitate fast data compression and column access.
 
AVRO is an open source project to provide Hadoop data serialization and data exchange services. You can exchange data between Hadoop ecosystem and written in any programming language. Avro file format is based on one of Big Data Hadoop applications popular.
 
ORC document represents the file format optimized columnar rows. ORC file format provides an efficient method of storing data in the table Hive. This file system is actually Hive To overcome the limitations of other file formats and design. Hive read from the large table, writing and processing of data, files can be used to improve the performance of ORC.

Parquet is a column-oriented binary file format. Parquet is efficient for the type of large-scale inquiry. For queries of a particular column in a table scan particular, Parquet particularly useful. Parquet table a compression Snappy, gzip; currently Snappy default.

                                                                                     Contrast storage format

                                                                                Parquet with ORC Comparison

 

 

 

 

 

 Summary : if only in HIve storage and query, recommended ORC format, if stored in the Hive, the use of Impala inquiry recommended Parquet

 

Guess you like

Origin www.cnblogs.com/hello-wei/p/11883663.html