Because I mentioned in the previous article that I was in the ods layer of the data warehouse because I used
STORED AS INPUTFORMAT'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' storage mode, But I encountered a situation where the count(*) statistical result is different from select *, so I have to start to understand the file storage format in detail (although the problem cannot be solved, but at least on the way to the truth of the problem, I want to see this For this article, please jump to the Hive Environment Tuning Encyclopedia, but there is still a bug that has not been resolved, which is a headache!).

In the data warehouse, it is recommended that, except for the interface table (imported from other databases or finally exported to other databases), the storage format of the rest of the tables is consistent with the compression format.

Let's first talk about the current mainstream storage format and compression method of Hive tables

File storage format

According to Hive official website, Apache Hive supports several familiar file formats used in Apache Hadoop, such as TextFile (text format), RCFile (determinant file), SequenceFile (binary serialized file), AVRO, ORC (optimized ranks) Format file) and Parquet format, among which we currently use most of them are TextFile, SequenceFile, ORC and Parquet.

Let's take a closer look at these two types of determinant storage.

1、ORC

1.1 ORC storage structure

We first get the ORC storage model diagram from the official website

It looks a little complicated, so let’s simplify it a bit, I drew a simple diagram to illustrate

Insert picture description here

The figure on the left shows the traditional row-based database storage method, which is stored by row. If there is no storage index, if you need to query a field, you need to find out the entire row of data and then filter it. This type is more IO-consuming. Resources, so the index method was used to solve this problem in the beginning of Hive.

However, due to the high cost of indexes, in the current Hive 3.X, indexes have been abolished, and of course columnar storage has long been introduced.

The storage method of column storage is actually the same as its name. It is stored one column by column, as shown in the right figure above. In this way, if you query the data of a field, it is equivalent to an index query, which is highly efficient. But if you need to look up the entire table, it takes up more resources because it needs to take all the columns and summarize them separately. So ORC determinant storage appeared.

1. When a full table scan is required, it can be read according to the row group
. 2. If you need to fetch column data, read the specified column on the basis of the row group, instead of all rows in all row groups and within one row Data for all fields.
After understanding the basic logic of ORC storage, let's take a look at its storage model diagram.

Insert picture description here

At the same time, I have also attached the detailed text below, so you can check it out:

Stripe: The place where ORC files store data. Each stripe is generally the block size of HDFS. (Contains the following 3 parts)

1. index data:保存了所在条带的一些统计信息,以及数据在 stripe中的位置索引信息。
2. rows data:数据存储的地方,由多个行组构成，每10000行构成一个行组，数据以流( stream)的形式进行存储。
3. stripe footer:保存数据所在的文件目录

File footer: Contains the list of sipes in the file, the number of rows in each stripe, and the data type of each column. It also contains aggregate information such as the minimum, maximum, row count, and sum of each column.
postscript: Contains information about compression parameters and compression size

So in fact, it was found that ORC provides three-level indexes, file-level, strip-level, and row-group level. Therefore, when querying, using these indexes can avoid most of the files and data blocks that do not meet the query conditions.

But note that the description information of all data in the ORC is put together with the stored data, and no external database is used.

Special note: ORC format tables also support transaction ACID, but tables that support transactions must be bucketed tables, so they are suitable for updating large batches of data. It is not recommended to use transactions to update small batches of data frequently.

#开启并发支持,支持插入、删除和更新的事务
set hive. support concurrency=truei
#支持ACID事务的表必须为分桶表
set hive. enforce bucketing=truei
#开启事物需要开启动态分区非严格模式
set hive.exec,dynamicpartition.mode-nonstrict
#设置事务所管理类型为 org. apache.hive.q1. lockage. DbTxnManager
#原有的org. apache. hadoop.hive.q1.1 eckmar. DummyTxnManager不支持事务
set hive. txn. manager=org. apache. hadoop. hive. q1. lockmgr DbTxnManageri
#开启在相同的一个 meatore实例运行初始化和清理的线程
set hive. compactor initiator on=true:
#设置每个 metastore实例运行的线程数 hadoop
set hive. compactor. worker threads=l
#(2)创建表
create table student_txn
(id int,
name string
)
#必须支持分桶
clustered by (id) into 2 buckets
#在表属性中添加支持事务
stored as orc
TBLPROPERTIES('transactional'='true‘);
#(3)插入数据
#插入id为1001,名字为student 1001
insert into table student_txn values('1001','student 1001');
#(4)更新数据
#更新数据
update student_txn set name= 'student 1zh' where id='1001';
# (5)查看表的数据,最终会发现id为1001被改为 sutdent_1zh

1.2 Hive configuration about ORC

Table configuration properties (configured when the table is created, for example tblproperties ('orc.compress'='snappy');:

orc.compress: Indicates the compression type of ORC files. The available types are NONE, ZLB and SNAPPY. The default value is ZLIB (Snappy does not support slicing)-this configuration is the most critical.
orc. compress.Slze: Represents the size of compressed chunks, the default value is 262144 (256KB).
orc. stripe.size: write stripe, the memory buffer pool size that can be used, the default value is 67108864 (64MB)
orc. row. index. stride: The data size of the row group level index, the default is 10000, and it must be set to a number greater than or equal to 10000
orc. create index: Whether to create a row group level index, the default is true
orc. bloom filter. columns: need to create bloom filter group.
orc. bloom filter fpp: False Positive probability of using bloom filter, the default value is 0.

Extension: Using the bloom filter in Hive, you can quickly determine whether data is stored in the table with less file space, but there are also cases where data that does not belong to this table is judged to belong to this table, this is called false positive Probability, the developer can adjust the probability, but the lower the probability, the Bloom filter needs

2、Parquet

After talking about ORC above, we also have a basic understanding of determinant storage, and Parquet is another high-performance determinant storage structure.

2.1 Parquet storage structure

Since ORC is so efficient, why do you want to have another Parquet? That is because Parquet is to enable any project in the Hadoop ecosystem to use compressed, efficient columnar data representation . It supports compression formats: Snappy, GZIP, Lzo

Parquet 是语言无关的，而且不与任何一种数据处理框架绑定在一起，适配多种语言和组件，能够与 Parquet 配合的组件有：

查询引擎: Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL

计算框架: MapReduce, Spark, Cascading, Crunch, Scalding, Kite

数据模型: Avro, Thrift, Protocol Buffers, POJOs

Let's take a look at the storage structure of Parquet, first look at the official website

Insert picture description here

Well, it’s a little bit big, I'll draw a simple version

Insert picture description here

Parquet files are stored in binary mode, so they cannot be read directly. Like ORC, the metadata and data of the file are stored together, so Parquet format files are self-analyzing.

Row Group: Each row group contains a certain number of rows, and at least one row group is stored in an HDFS file, similar to the concept of orc's stripe.
Column Chunk: Each column in a row group is stored in a column block, and all columns in the row group are stored consecutively in this row group file. The values in a column block are all of the same type, and different column blocks may be compressed using different algorithms.
Page: Each column block is divided into multiple pages. A page is the smallest unit of coding. Different pages of the same column block may use different coding methods.

2.2Parquet table configuration properties

parquet. block size: The default value is 134217728byte, which is 128MB, which represents the block size of Row Group in memory. Setting this value to a large value can improve the reading efficiency of Parquet files, but correspondingly, it needs to consume more memory when writing.
parquet. page:size: The default value is 1048576byt, which is 1MB, which means the size of each page (page). This specifically refers to the compressed page size, and the page data will be decompressed first when reading. A page is the smallest unit of Parquet operating data. You must read a whole page of data each time you read it before you can access the data. If this value is set too small, it will cause performance problems during compression
parquet. compression: The default value is UNCOMPRESSED, which means the page compression method. The available compression methods are UNCOMPRESSED, SNAPPY, GZP and LZO.
Parquet enable. dictionary: The default is tue, indicating whether to enable dictionary encoding.
parquet. dictionary page.size: The default value is 1048576byte, which is 1MB. When using dictionary encoding, a dictionary page is created in each row and column of Parquet. Using dictionary encoding, if there are many duplicate data in the stored data pages, it can have a good compression effect and can also reduce the memory usage of each page.

3. Comparison of ORC and Parquet

Insert picture description here

At the same time, from the case of the author of "Hive Performance Tuning", two tables in ORC and Parquet storage formats were imported, the same data was imported, and SQL queries were performed. It was found that the rows read using ORC were much smaller than Parquet, so ORC was used. As a storage, more unnecessary data can be filtered out with the help of metadata, and the cluster resources required for query are less than Parquet. (For more detailed performance analysis, please move to https://blog.csdn.net/yu616568/article/details/51188479)
So ORC still looks better in terms of storage

Compression method

format	Divisible	Average compression speed	Text file compression efficiency	Hadoop compression codec	Pure Java implementation	Primitive	Remarks
gzip	no	fast	high	org.apache.hadoop.io.compress.GzipCodec	Yes	Yes
lzo	Yes (depending on the library used)	very fast	medium	com.hadoop.compression.lzo.LzoCodec	Yes	Yes	Need to install LZO on each node
bzip2	Yes	slow	very high	org.apache.hadoop.io.compress.Bzip2Codec	Yes	Yes	Use pure Java for the divisible version
zlib	no	slow	medium	org.apache.hadoop.io.compress.DefaultCodec	Yes	Yes	Hadoop's default compression codec
Snappy	no	very fast	low	org.apache.hadoop.io.compress.SnappyCodec	no	Yes	Snappy has a pure Java port, but it cannot be used in Spark/Hadoop

How to choose the combination of storage and compression?

According to the requirements of ORC and parquet, generally available

1. ORC format storage, Snappy compression

create table stu_orc(id int,name string)
stored as orc 
tblproperties ('orc.compress'='snappy');

2. Parquet format storage, Lzo compression

create table stu_par(id int,name string)
stored as orc 
tblproperties ('parquet.compression'='lzo');

3. Parquet format storage, Snappy compression

create table stu_par(id int,name string)
stored as orc 
tblproperties ('parquet.compression'='snappy');

Because Hive's SQL will be converted into MR tasks, if the file is stored in ORC and compressed by Snappy, because Snappy does not support file splitting, the compressed file will only be read by one task. If the compressed file is large, then It will take much more time to process the Map of the file than to read the Map of an ordinary file. This is the data skew of the often said Map to read the file.

In order to avoid this situation, it is necessary to use compression algorithms that support file segmentation such as bzip2 and Zip when compressing data. But it is precisely that ORC does not support the compression methods just mentioned, so this has become the reason why you do not choose ORC when you may encounter large files to avoid data skew.

In the Hve on Spark approach, the same is true. As a distributed architecture, Spark usually tries to read data from multiple different machines together. To achieve this situation, each working node must be able to find the beginning of a new record, which requires that the file can be split, but for some compressed files that cannot be split, a single node must be used to read all the data. This can easily cause performance bottlenecks. (The next article writes in detail the source code analysis of Spark reading files)

Therefore, in actual production, it is more common to use Parquet storage and lzo compression. In this case, data skew caused by reading indivisible large files can be avoided.
However, if the amount of data is not large (it is predicted that there will be no large files, more than a few G), using ORC storage, snappy compression efficiency is still very high.

Should I choose ORC or Parquet for Hive data warehouse table building, LZO or Snappy for compression?

File storage format

1、ORC

1.1 ORC storage structure

1.2 Hive configuration about ORC

2、Parquet

2.1 Parquet storage structure

2.2Parquet table configuration properties

3. Comparison of ORC and Parquet

Compression method

How to choose the combination of storage and compression?

1. ORC format storage, Snappy compression

2. Parquet format storage, Lzo compression

3. Parquet format storage, Snappy compression

Guess you like