Hive data warehouse table data storage format selection method

file storage format

From the Hive official website, Apache Hive supports several familiar file formats used in Apache Hadoop, such as  TextFile(文本格式), RCFile(行列式文件), SequenceFile(二进制序列化文件), AVRO, ORC(优化的行列式文件)and Parquet formats, of which we currently use TextFile, SequenceFile, ORCand Parquet.

Let's take a closer look at these two determinant storage.

1、ORC

1.1 ORC storage structure

We first get the ORC storage model diagram from the official website

It looks a little complicated, so let's simplify it a little, I drew a simple diagram to illustrate

The figure on the left shows the traditional row-based database storage method, which is stored by row. If there is no storage index, if you need to query a field, you need to find out the entire row of data and then filter it, which consumes more IO. resources, so the index method was used to solve this problem in Hive at the beginning.

However, due to the high cost of indexes, "in the current Hive 3.X, indexes have been abolished" , and of course, columnar storage has long been introduced.

The storage method of columnar storage is stored one column at a time, as shown in the right figure in the above figure. In this case, if you query the data of a field, it is equivalent to an index query, which is highly efficient. However, if you need to look up the entire table, it will take up more resources because it needs to take all the columns and summarize them separately. So ORC determinant storage appeared.

  1. When a full table scan is required, it can be read by row group

  2. If you need to get column data, read the specified column on the basis of the row group, instead of the data of all rows in all row groups and the data of all fields in one row.

After understanding the basic logic of ORC storage, let's take a look at its storage model diagram.

At the same time, I also attached the detailed text below, so you can check it out:

  • Stripe: Where the ORC file stores data, each stripe is generally the block size of HDFS. (Contains the following 3 parts)

index data:保存了所在条带的一些统计信息,以及数据在 stripe中的位置索引信息。
rows data:数据存储的地方,由多个行组构成,每10000行构成一个行组,数据以流( stream)的形式进行存储。
stripe footer:保存数据所在的文件目录
  • File footer: Contains a list of sipe in the file, the number of lines in each stripe, and the data type of each column. It also contains aggregated information such as min, max, row count, summation, etc. for each column.

  • postscript: Contains information about compression parameters and compression size

So in fact, it is found that ORC provides three levels of indexes, file level, strip level, and row group level. Therefore, when querying, these indexes can be used to avoid most files and data blocks that do not meet the query conditions.

However, note that the description information of all data in ORC is put together with the stored data, and does not use an external database.

"Special note: ORC format tables also support transaction ACID, but tables that support transactions must be bucketed tables, so it is suitable for updating large batches of data. It is not recommended to frequently update small batches of data with transactions"

#开启并发支持,支持插入、删除和更新的事务
set hive. support concurrency=truei
#支持ACID事务的表必须为分桶表
set hive. enforce bucketing=truei
#开启事物需要开启动态分区非严格模式
set hive.exec,dynamicpartition.mode-nonstrict
#设置事务所管理类型为 org. apache.hive.q1. lockage. DbTxnManager
#原有的org. apache. hadoop.hive.q1.1 eckmar. DummyTxnManager不支持事务
set hive. txn. manager=org. apache. hadoop. hive. q1. lockmgr DbTxnManageri
#开启在相同的一个 meatore实例运行初始化和清理的线程
set hive. compactor initiator on=true:
#设置每个 metastore实例运行的线程数 hadoop
set hive. compactor. worker threads=l
#(2)创建表
create table student_txn
(id int,
name string
)
#必须支持分桶
clustered by (id) into 2 buckets
#在表属性中添加支持事务
stored as orc
TBLPROPERTIES('transactional'='true‘);
#(3)插入数据
#插入id为1001,名字为student 1001
insert into table student_txn values('1001','student 1001');
#(4)更新数据
#更新数据
update student_txn set name= 'student 1zh' where id='1001';
# (5)查看表的数据,最终会发现id为1001被改为 sutdent_1zh

1.2 Hive configuration about ORC

Table configuration properties (configured when creating a table, for example tblproperties ('orc.compress'='snappy');)

  • orc.compress: Indicates the compression type of the ORC file. "The optional types are NONE, ZLB and SNAPPY. The default value is ZLIB (Snappy does not support slices)" --- This configuration is the most critical.

  • orc.compress.Slze: Indicates the size of the compressed block (chunk), the default value is 262144 (256KB).

  • orc.stripe.size: write stripe, the size of the memory buffer pool that can be used, the default value is 67108864 (64MB)

  • orc.row.index.stride: The data size of the row group level index, the default is 10000, it must be set to a number greater than or equal to 10000

  • orc.create index: Whether to create a row group level index, the default is true

  • orc.bloom filter.columns: The groups that need to create bloom filters.

  • orc.bloom filter fpp: The probability of false positive (False Positive) using bloom filter, the default value is 0.

Extension: Using the bloom filter in Hive can quickly determine whether the data is stored in the table with less file space, but there is also a situation in which the data that does not belong to this table is determined to belong to this table, which is called false positive Probability, developers can adjust the probability, but the lower the probability, the Bloom filter needs

2、Parquet

After talking about ORC above, we also have a basic understanding of row-column storage, and Parquet is another high-performance row-column storage structure.

2.1 Storage structure of Parquet

Since ORC is so efficient, why should there be another Parquet, it is because "Parquet is to make a compressed, efficient columnar data representation available to any project in the Hadoop ecosystem"

Parquet is language-independent and is not bound to any data processing framework. It is suitable for a variety of languages ​​and components. Components that can work with Parquet are:

Query Engines: Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL

Computing Framework: MapReduce, Spark, Cascading, Crunch, Scalding, Kite

Data Models: Avro, Thrift, Protocol Buffers, POJOs

Let's take a look at the storage structure of Parquet, first look at the official website

Well, it's a little bit big, I'll draw a simplified version

Parquet files are stored in binary format, so they cannot be read directly. Like ORC, the metadata of the file is stored with the data, so Parquet format files are self-parsed.

  1. Row Group: Each row group contains a certain number of rows, and at least one row group is stored in an HDFS file, similar to the concept of orc's stripe.

  2. Column Chunk: Each column in a row group is stored in a column block, and all the columns in the row group are stored consecutively in this row group file. The values ​​in a column block are all of the same type, and different column blocks may be compressed using different algorithms.

  3. Page: Each column block is divided into multiple pages. A page is the smallest coding unit. Different pages in the same column block may use different coding methods.

2.2Parquet table configuration properties

  • parquet.block size: The default value is 134217728byte, which is 128MB, which indicates the block size of the Row Group in memory. If this value is set large, the reading efficiency of Parquet files can be improved, but correspondingly, more memory is consumed when writing.

  • parquet.page:size: The default value is 1048576byt, which is 1MB, indicating the size of each page (page). This specifically refers to the compressed page size, and the data of the page will be decompressed first when reading. A page is the smallest unit of Parquet operating data, and a full page of data must be read each time to access the data. If this value is set too small, it will cause performance problems when compressing

  • parquet.compression: The default value is UNCOMPRESSED, which indicates the compression method of the page. "The available compression methods are UNCOMPRESSED, SNAPPY, GZP and LZO" .

  • Parquet enable. dictionary: The default is tue, indicating whether dictionary encoding is enabled.

  • parquet.dictionary page.size: The default value is 1048576byte, which is 1MB. When using dictionary encoding, a dictionary page is created in each row and column in Parquet. Using dictionary encoding, if there are many repeated data in the stored data pages, it can have a good compression effect, and can also reduce the memory occupancy of each page.

3. Comparison of ORC and Parquet

At the same time, from the case of the author of "Hive Performance Tuning in Practice", 2 tables using ORC and Parquet storage formats respectively, import the same data, and perform sql query, "it is found that the rows read using ORC are much smaller than Parquet" , so Using ORC as storage, you can filter out more unnecessary data with the help of metadata, and query requires fewer cluster resources than Parquet. (For more detailed performance analysis, please move to https://blog.csdn.net/yu616568/article/details/51188479)

"So ORC still looks better in terms of storage"

Compression method

Format Divisible Average compression speed Text file compression efficiency Hadoop Compression Codec Pure Java implementation Native Remark
gzip no quick high org.apache.hadoop.io.compress.GzipCodec Yes Yes
lzo yes (depending on the library used) very fast medium com.hadoop.compression.lzo.LzoCodec Yes Yes Requires LZO to be installed on each node
bzip2 Yes slow very high org.apache.hadoop.io.compress.Bzip2Codec Yes Yes Use pure Java for the divisible version
zlib no slow medium org.apache.hadoop.io.compress.DefaultCodec Yes Yes Hadoop's default compression codec
Snappy no very fast Low org.apache.hadoop.io.compress.SnappyCodec no Yes Snappy has a pure Java port, but it doesn't work with Spark/Hadoop

How to choose the combination of storage and compression?

According to the requirements of ORC and parquet, there are generally

1. ORC format storage, Snappy compression

create table stu_orc(id int,name string)
stored as orc 
tblproperties ('orc.compress'='snappy');

2, Parquet format storage, Lzo compression

create table stu_par(id int,name string)
stored as parquet 
tblproperties ('parquet.compression'='lzo');

3, Parquet format storage, Snappy compression

create table stu_par(id int,name string)
stored as parquet 
tblproperties ('parquet.compression'='snappy');

Because Hive's SQL will be converted into MR tasks, if the file is stored in ORC, Snappy compresses it, because Snappy does not support file splitting operations, so the compressed file "can only be read by one task" , if the compressed file is large , then the time required to process the map of the file will be much longer than the time of reading the map of the ordinary file, which is often referred to as "the data skew of the map reading the file" .

In order to avoid this situation, it is necessary to use compression algorithms that support file segmentation such as bzip2 and Zip when compressing data. However, ORC does not support these compression methods just mentioned, so this is the reason why people do not choose ORC when they may encounter large files to avoid data skew.

In the Hve on Spark approach, the same is true. Spark, as a distributed architecture, usually tries to read data from multiple different machines together. To achieve this, each worker node must be able to find the beginning of a new record, which requires the file to be split, but some files in compressed formats that cannot be split require a single node to read all the data, This can easily create performance bottlenecks.

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/126154261