hive file storage format

The performance of several Hive file storage formats on the Internet and the file storage format of Hadoop are sorted out.

Hive's three file formats: TEXTFILE, SEQUENCEFILE, and RCFILE, the storage formats of TEXTFILE and SEQUENCEFILE are based on row storage, and RCFILE is based on the idea of ​​mixing rows and columns. First, the data is divided into N row groups by row. Each column is stored separately in . In addition: Hive can support custom formats. For details, see: Hive file storage
format HDFS-based row storage has fast data loading and high adaptability to dynamic loads, because row storage ensures that all domains with the same record are on the same cluster node. But it does not meet the requirements of fast query response time, because when the query is only for a few columns in all columns, it cannot skip the unnecessary columns and directly locate the required columns; at the same time, in terms of storage space utilization , it also has some bottlenecks. Because the data table contains columns of different types and different data values, it is not easy to obtain a high compression ratio for row storage. RCFILE is a column storage format implemented based on SEQUENCEFILE. In addition to meeting the requirements of fast data loading and high adaptability to dynamic loads, it also solves some bottlenecks of SEQUENCEFILE.

The following is a brief introduction to these types:

TextFile: The

default format of Hive, the data is not compressed, the disk overhead is large, and the data parsing overhead is large.
It can be used in combination with Gzip, Bzip2, Snappy, etc. (the system will automatically check and automatically decompress when executing the query), but in this way, hive will not split the data, so it cannot perform parallel operations on the data.

SequenceFile:
SequenceFile is a binary file provided by the Hadoop API, which serializes data into a file in the form of <key, value>. This binary internally uses Hadoop's standard Writable interface for serialization and deserialization. It is compatible with MapFile in Hadoop API. The SequenceFile in Hive is inherited from the SequenceFile of the Hadoop API, but its key is empty, and the value is used to store the actual value, so as to avoid the sorting process of MR in the running map phase.

SequenceFile file structure diagram:


Header General header file format:
SEQ 3BYTE
Nun 1byte number
keyClassName
ValueClassName
compression (boolean) indicates whether compression is enabled in the file
blockCompression (boolean, indicates whether it is block compression)
compression codec
Metadata File metadata
Sync header file End mark
Block-Compressed SequenceFile format



RCFile
RCFile is a column-oriented data format introduced by Hive. It follows the "column first, then vertical" design philosophy. When querying for columns it doesn't care about, it skips those columns on IO. It should be noted that RCFile copies the entire data block from the remote end in the map phase, and after copying to the local directory, RCFile does not directly skip the unnecessary columns and jump to the columns that need to be read, but through It is implemented by scanning the header definition of each row group, but the header at the entire HDFS Block level does not define which row group each column starts from and which row group ends. So in the case of reading all columns, the performance of RCFile is not as high as that of SequenceFile.

The following introduces row storage and column storage (for details, refer to: Facebook Data Warehouse Secret: RCFile Efficient Storage Structure)

Row Storage

HDFS Block Intra-Row Storage Example:

The advantages of the row storage structure based on the Hadoop system are fast data loading and high adaptability to dynamic loads. This is because the row store guarantees that all domains of the same record are on the same cluster node, i.e. the same HDFS block. However, the disadvantages of row storage are also obvious. For example, it cannot support fast query processing, because it cannot skip unnecessary column reads when the query is only on a few columns in a multi-table; Value columns, row storage is not easy to obtain a very high compression ratio, that is, the space utilization is not easy to be greatly improved.

Column Store
HDFS In-Block Column Store Example

Example of storing tables by column group on HDFS. In this example, columns A and B are stored in the same column group, while columns C and D are stored in separate column groups. Column store can avoid reading unnecessary columns at query time, and compress similar data in a column to achieve a higher compression ratio. However, due to the higher overhead of tuple reconstruction, it does not provide fast query processing based on Hadoop systems. Columnstore does not guarantee that all fields of the same record are stored on the same cluster node, in the example of rowstore, 4 fields of a record are stored in 3 HDFS blocks located on different nodes. Therefore, the reconstruction of records will result in a large amount of data transfer through the network of cluster nodes. Although pre-grouping multiple columns together can reduce overhead, it is not very adaptable for highly dynamic load patterns.

RCFile combines the speed of row storage queries and the space-saving features of column storage: first, RCFile ensures that the data of the same row is located on the same node, so the overhead of tuple reconstruction is very low; second, like column storage, RCFile can take advantage of the column dimension. Data is compressed and unnecessary column reads can be skipped.
Example of RCFile storage in HDFS block:



data test Number of data records in the
source table: 67236221

Step 1: Create tables of three file types. For the syntax of table creation, refer to Hive file storage format

Sql code Collection code
--TextFile 
set hive.exec. compress.output=true; 
set mapred.output.compress=true; 
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; 
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; 
INSERT OVERWRITE table hzr_test_text_table PARTITION(product='xxx',dt='2013-04-22') 
SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22'; 
 
--SquenceFile 
set hive.exec.compress.output=true; 
set mapred.output.compress=true; 
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; 
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; 
set io.seqfile.compression.type=BLOCK; 
INSERT OVERWRITE table hzr_test_sequence_table PARTITION(product='xxx',dt='2013-04-22') 
SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt='2013-04-22'; 
 
--RCFile 
set hive.exec.compress.output=true; 
set mapred.output.compress=true; 
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; 
set io.compression.codecs=org. apache.hadoop.io.compress.GzipCodec; 
INSERT OVERWRITE table hzr_test_rcfile_table PARTITION(product='xxx',dt='2013-04-22') 
SELECT xxx,xxx.... FROM xxxtable WHERE product='xxx' AND dt ='2013-04-22'; 


Step 2: Test insert overwrite table tablename select.... Time-consuming, storage space
Type insert time-consuming (S) Storage space (G)
Sequence
97.291
7.13G
RCFile
120.901
5.73G
TextFile
290.517 6. Comparison of
80G

insert time consumption and count(1) time consumption:





Step 3: Query response time

    test 1

Sql code Collection code
Solution  1, test the query efficiency of the entire row of records:
select * from hzr_test_sequence_table where game='XXX' ;   
select * from hzr_test_rcfile_table where game='XXX' ;  
select * from hzr_test_text_table where game='XXX' ;  
 
solution 2 , to test the query efficiency of a specific column: 
select game,game_server from hzr_test_sequence_table where game ='XXX'; 
select game,game_server from hzr_test_rcfile_table where game ='XXX'; 
select game,game_server from hzr_test_text_table where game ='XXX'; 

file format
query Time- consuming for the entire row of records (S ) Time
-consuming for querying specific column records (S) Time-consuming comparison of the
sequence
42.241
39.918
rcfile
37.395
36.248
text
43.164
41.632

scheme:




   Test 2:
The purpose of this test is to verify whether the data reading method of RCFILE and the Lazy decompression method have performance advantages. The data reading method only reads metadata and related columns, saving IO; the Lazy decompression method only decompresses the relevant column data, and does not decompress the query data that does not meet the where condition, which has advantages in IO and efficiency.

Option 1:
Number of records: 698020

Sql code Favorite code
insert overwrite local directory 'XXX/XXXX' select game,game_server from hzr_test_xxx_table where game ='XXX'; 


Option 2:
Number of records: 67236221

Sql code Favorite code
insert overwrite local directory 'xxx /xxxx' select game,game_server from hzr_test_xxx_table; 

Scheme 3:
Number of records:

Sql code Favorite code
insert overwrite local directory 'xxx/xxx' 
select game from hzr_xxx_rcfile_table; 

File TypeScheme 1 Option 2 Option 3
TextFile 54.895 69.428 167.667
SequenceFile 137.096 77.03 123.667
RCFile 44.28 57.037 89.9


The above figure shows that the query efficiency of RCFILE is higher than that of SEQUENCEFILE. When reading specific field data, the query efficiency of RCFILE is still better than that of SEQUENCEFILE.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326779552&siteId=291194637