--Hive large data file storage format

A, hive file storage format

Hive support of a number of main storage formats: TEXTFILE, SEQUENCEFILE, ORC, PARQUET.

 

 

 

The figure on the left is a logic table, a right for the first line memory, the second storage column.

Rows of memory features:  Query an entire row of data when the conditions are satisfied, the storage column is required to find the aggregated each field value of each column and row memory only need to find a value which, in the rest phase values are o place, so at this speed line store queries faster.

Column storage features:  Because the data gathered is stored for each field, the query takes only a few fields in time, can greatly reduce the amount of data read; data type for each field must be the same for columnar storage can better design of compression algorithms.

TEXTFILE SEQUENCEFILE and storage formats are based on the stored row; PARQUET the ORC and is based on the storage column.

1, row storage format textFile

The default format, the data is not compressed, large disk overhead, large data parsing overhead. Can be combined Gzip, Bzip2 used, but the use of this embodiment Gzip, Hive will not be segmented data, the data which can not operate in parallel.

2, the column format storage orc

Orc (Optimized Row Columnar) is a new storage format Hive 0.11 edition introduced.

Orc can see each document by one or more stripe, each stripe250MB size, this corresponds to the actual Stripe RowGroup concept, but the size of the 4MB-> 250MB, this should read sequence can enhance throughput. Each Stripe There are three parts, namely, Index Data, Row Data, Stripe Footer:

 

 

 1) Index Data: a lightweight index, the default is every 1W OK to do an index. Here to do an index should record only the fields in a row offset in a Row Data.

    2) Row Data: specific data is stored, take the first part of the line, these lines are stored and then by column. Each column is encoded, it is divided into a plurality of Stream stored.

    3) Stripe Footer: each Stream is stored in the type, length and other information.

Each file has a File Footer, there is stored the number of lines per Stripe, the data type information for each Column and the like; the tail of each file is a PostScript, there is recorded the compressed file type and the whole of FileFooter length information. When reading the file, it will seek to the end of the file read PostScript, from the inside to resolve File Footer length, read FileFooter, parsed from the inside to the respective information Stripe, Stripe each read, i.e., read from the back.

--------------------------------------------

3, column stores format parquet

Parquet is columnar storage format for analytical services from Twitter and Cloudera Development Cooperation, 2015 Nian 5 moon from Apache Incubator in graduate as Apache top-level project.

Parquet file is stored in binary, it is not directly readable file comprising data and metadata of the file, so Parquet format is parsed from.

Typically, the memory Parquet when data will be in accordance with the size of the Block size setting line group, since in general the smallest unit of each Mapper task of data processing is a Block, which can put each row group consisting of a Mapper task processing, increases task execution parallelism. Parquet file format as shown below.

 

 

The figure shows the contents of a Parquet file, a file may be stored in the plurality of row groups, the first file is the file Magic Code, which is used to verify whether a file Parquet, Footer length records the file metadata size can be calculated by the offset metadata and the document length value, the metadata file includes information for each row group Schema metadata information and the stored data file. In addition to metadata of each row group of files, the metadata are stored beginning page of each page in Parquet, there are three types of pages: page data, dictionary and index pages. Pages for storing data values ​​in the current row of the column group, the dictionary of the dictionary page storing the encoded value column, each column block includes a maximum of a dictionary page, the index page is used to store the index of the current group of columns in the row, the current Parquet does not support the index page.

Compression ratio storing files Summary:

ORC >  Parquet >  textFile

Query speed storage file summary: similar query speed.

In a real project which, hive table selection data storage format ships: orc or parquet. Compression generally choose snappy, lzo.

 

Guess you like

Origin www.cnblogs.com/jeff190812/p/11619604.html