Read-write query performance test of several common compression formats (ORC, Parquet, Sequencefile, RCfile, Avro) of Hive

1. Test background

At work, I want to structure the historical APP logs into Hive for query. Due to the large data, it needs to be compressed. According to the several compression formats officially provided by Hive, the performance tests of writing, reading, and OLAP calculation are performed respectively, in order to obtain the performance test. Find the best compression format.


2. Overview of test methods

  1. Data source: The production data is sampled, and the size is 100G. The raw log format is a textfile file (standard JSON).
  2. Test platform: the company's Ambari test platform, physical memory 100G.
  3. Test method: The textfile file is automatically entered into Hive through the script to form a large table. (Note: The serde that parses JSON data uses the org.apache.hive.hcatalog.data.JsonSerDe class in hive-hcatalog-core.jar that comes with hdp hive )
  4. Create partitioned tables based on various storage methods from large tables.
  5. Core component HDP version selection


3. Practical operation

1. Create a large table js_data

CREATE TABLE IF NOT EXISTS data_ysz.js_data (referer STRING,ip STRING,articleId STRING,catalogCode STRING,userAgent STRING,sessionId STRING,title STRING,deviceId STRING,url STRING,visitTime STRING,catalogId STRING,atype STRING,domain STRING,action STRING,visitDate STRING) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

2. Load data into js_data

load data inpath '' into table js_data

3. Create partition tables with different storage methods according to the large table (in order: RCfile, ORC, sequencefile, parquet, Avro )

Create table js_data_partitioned_rcfile(referer STRING,ip STRING,articleId STRING,catalogCode STRING,userAgent STRING,sessionId STRING,title STRING,deviceId STRING,url STRING,visitTime STRING,catalogId STRING,atype STRING,domain STRING,action STRING) PARTITIONED BY (visitDate STRING) STORED AS RCfile
Create table js_data_partitioned_orc(referer STRING,ip STRING,articleId STRING,catalogCode STRING,userAgent STRING,sessionId STRING,title STRING,deviceId STRING,url STRING,visitTime STRING,catalogId STRING,atype STRING,domain STRING,action STRING)PARTITIONED BY (visitDate STRING) STORED AS ORC
Create table js_data_partitioned_sequencefile(referer STRING,ip STRING,articleId STRING,catalogCode STRING,userAgent STRING,sessionId STRING,title STRING,deviceId STRING,url STRING,visitTime STRING,catalogId STRING,atype STRING,domain STRING,action STRING) PARTITIONED BY (visitDate STRING) STORED AS SequenceFile
Create table js_data_partitioned_parquetfile(referer STRING,ip STRING,articleId STRING,catalogCode STRING,userAgent STRING,sessionId STRING,title STRING,deviceId STRING,url STRING,visitTime STRING,catalogId STRING,atype STRING,domain STRING,action STRING) PARTITIONED BY (visitDate STRING) STORED AS parquetfile
Create table js_data_partitioned_avrofile(referer STRING,ip STRING,articleId STRING,catalogCode STRING,userAgent STRING,sessionId STRING,title STRING,deviceId STRING,url STRING,visitTime STRING,catalogId STRING,atype STRING,domain STRING,action STRING) PARTITIONED BY (visitDate STRING) STORED AS Avro

4. Test based on the following SQL

select visitdate,count(*) as pv from 表名 where action = '1'  and domain = 'static.scms.sztv.com.cn' group by visitdate order by pv;

4. Result data statistics

Performance test results
storage format ORC Sequencefile Parquet RCfile Euro
size after data compression 1.8G 67.0G 11G 63.8G 66.7G
Storage takes time 535.7s 625.8s 537.3s 543.48 544.3
SQL query response speed 19.63s 184.07s 24.22s 88.5s 281.65s

V. Conclusion

1. In terms of compression storage time, except for Sequencefile, it is basically the same.

2. ORC is the best in terms of data compression ratio, saving 50 times the disk space compared to textfile, and parquet compression performance is also better.

3. In terms of SQL query speed, ORC and parquet have better performance, far exceeding other storage formats.

Based on the above performance indicators, it is recommended that the storage format of the original log written to hive in the work adopts ORC or parquet format, which is consistent with the current mainstream practice.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324961857&siteId=291194637