Comparison of query performance of Hive file storage formats

1. Hive file storage format

In total, Hive supports the following file storage formats:

  • Text File
  • SequenceFile
  • RCFile
  • Avro Files
  • ORC Files
  • Parquet
  • Custom INPUTFORMAT and OUTPUTFORMAT

    Here, we mainly compare the query performance of Text File, ORC File, and Parquet file storage formats. Before the comparison, let's briefly introduce these three file storage formats. For Text File, it is our default storage format, line storage, so there is no need to explain it here.
    ORC File

1. Introduction to ORC

What is the ORC file storage format? First look at the introduction of the official website:

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

The full name of ORC (Optimized Row Columnar) is an efficient file format for storing data. It overcomes some limitations of other file formats and improves the performance of Hive reading and writing data.

ORC actually makes some optimizations in the RC file storage format. Its main advantages are:
  (1), each task only outputs a single file, which can reduce the load of the NameNode;
  (2), supports various complex data types , such as: datetime, decimal, and some complex types (struct, list, map, and union);
  (3), some lightweight index data is stored in the file;
  (4), block mode compression based on data type : a. Run-length encoding for columns of integer type; b. Dictionary encoding for columns of String type;
  (5) Use multiple independent RecordReaders to read the same file in parallel;
  ( 6) The file can be divided without scanning markers;
  (7) The memory required for binding reading and writing;
  (8) The metadata is stored in Protocol Buffers, so it supports adding and deleting some columns.

2. ORC structure

The following figure is the ORC structure diagram of the official website:
write picture description here

In a table stored in ORC format, it will horizontally split records into multiple stripes, each stripe is stored in column units, and all columns are stored in a file. The default size of each stripe is 256MB, compared to For each 4MB stripe of RCFile, larger stripes make ORC data read more efficient.
For a more detailed introduction, you can check the official website [ https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC ]
hortonwork's introduction [ https://zh.hortonworks.com/blog/orcfile-in- hdp-2-better-compression-better-performance/ ]

PARQUET format

Parquet is just a storage format, it is language- and platform-independent, and does not need to be bound to any data processing framework. Currently, the components that can be adapted to Parquet include the following, and it can be seen that basically the commonly used queries The engine and computing framework have been adapted, and data generated by other serialization tools can be easily converted into Parquet format.

  • Query Engines: Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL
  • Computing Framework: MapReduce, Spark, Cascading, Crunch, Scalding, Kite
  • Data Models: Avro, Thrift, Protocol Buffers, POJOs

2. Performance comparison

Before testing, let's take a look at the comparison and analysis of these storage formats on HortonWork's official website:
write picture description here
it is obvious from the figure that the ORC storage format and the Parquet storage format have much smaller file storage than the Text format, that is, compression Much bigger than. Let's try it out in practice.
In order to compare and test the three storage formats, first create three tables, and insert 100,000 table data into each table:

create table page_view(
user_ip string,
username string,
user_time string,
request_url string,
request_state string,
request_port string,
limited string,
des_url string,
brower string,
brower_limit string,
to_url string
)
row format delimited fields terminated by '\t'
stored as TEXTFILE;

load data local inpath '/opt/datas/page_view.log' into table page_view;

create table page_view_orc(
user_ip string,
username string,
user_time string,
request_url string,
request_state string,
request_port string,
limited string,
des_url string,
brower string,
brower_limit string,
to_url string
)
row format delimited fields terminated by '\t'
stored as ORC;

insert into table page_view_orc select * from page_view;

create table page_view_parquet(
user_ip string,
username string,
user_time string,
request_url string,
request_state string,
request_port string,
limited string,
des_url string,
brower string,
brower_limit string,
to_url string
)
row format delimited fields terminated by '\t'
stored as PARQUET;

insert into table page_view_parquet select * from page_view;

After creating the table, let's take a look at the size of 100,000 pieces of data stored in different file storage formats:

#查看这三种格式的文件的大小:
dfs -du -h /user/hive/warehouse/db_hive.db/page_view;
34.0 M  /user/hive/warehouse/db_hive.db/page_view/page_view.log

dfs -du -h /user/hive/warehouse/db_hive.db/page_view_orc;
125.4 K  /user/hive/warehouse/db_hive.db/page_view_orc/000000_0

dfs -du -h /user/hive/warehouse/db_hive.db/page_view_parquet;
34.3 M  /user/hive/warehouse/db_hive.db/page_view_parquet/000000_0

Obviously, in the file stored in ORC format, the size is much smaller than Text, which also confirms the above picture. As for why Parquet is similar to the Text format, it may be because the amount of data is not large enough, but in general, the size occupied by parquet to store the same data will be smaller than that of the TEXT format.

Next, let's write query statements and compare their query speed:

#测试三种存储格式查询数据的速度
select user_time,count(*) cnt from page_view group by user_time order by cnt desc limit 20;
Time taken: 73.15 seconds, Fetched: 1 row(s)

select user_time,count(*) cnt from page_view_orc group by user_time order by cnt desc limit 20;
Time taken: 57.091 seconds, Fetched: 1 row(s)

select user_time,count(*) cnt from page_view_parquet group by user_time order by cnt desc limit 20;
Time taken: 55.526 seconds, Fetched: 1 row(s)

select count(*) from page_view;
Time taken: 25.124 seconds, Fetched: 1 row(s)

select count(*) from page_view_orc;
Time taken: 25.759 seconds, Fetched: 1 row(s)

select count(*) from page_view_parquet;
Time taken: 24.998 seconds, Fetched: 1 row(s)

From the time-consuming results of the above query, when executing more complex query statements, the ORC format and the Parquet format are much faster than the Text format.

From the above results, it can be concluded that the ORC and Parquet storage formats have better performance of reading and writing data than the Text format, and these two storage formats are generally used in most cases in the enterprise.
At this point, we may have to ask why the performance of these two storage formats is better? When are these two storage formats used? Here is a brief explanation. In many cases, our query statements will use the where statement to filter a column. If the row storage format (TEXT File) is used, we need to find the corresponding column in a row, and then filter, Using the ORC and Parquet columnar storage formats will directly locate a column for filtering, so the performance will be much improved.

Finally, let's talk about the compression formats supported by ORC. ORC supports three compressions: Lzip, Snappy, and none. The last one is no compression.

Here we can also actually test the compression ratio of lzip and snappy:

#测试压缩:

create table page_view_orc_snappy(
user_ip string,
username string,
user_time string,
request_url string,
request_state string,
request_port string,
limited string,
des_url string,
brower string,
brower_limit string,
to_url string
)
row format delimited fields terminated by '\t'
stored as ORC tblproperties ("orc.compress"="SNAPPY");  #snappy大写

insert into table page_view_orc_snappy select * from page_view;     #会执行三个mapreduce程序,select,转换为orc格式,snappy压缩
Time taken: 17.879 seconds

select count(*) from page_view_orc_snappy;
Time taken: 26.956 seconds, Fetched: 1 row(s)

dfs -du -h /user/hive/warehouse/db_hive.db/page_view_orc_snappy;
329.4 K  /user/hive/warehouse/db_hive.db/page_view_orc_snappy/000000_0
#因为orc默认的是lzip,压缩比大于snappy

Here we use the snappy compression format to create a table, the same data size is 329.4k, and above we can see that the ORC format adopts the default compression format size of 125.4K, why does it become larger after compression? It is very strange that many people dare to fight. In fact, the default compression format of ORC is lzip format, which also shows that the compression ratio of lizp is greater than that of snappy.

In summary, we can conclude that when dealing with larger data and complex queries, ORC or Parquet storage format and compression can be used to process data, which will improve a lot of performance.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325473188&siteId=291194637