Data storage text to parquet and the resulting OOM problem

1. After effect of data transfer to parquet

 
table1 is a table stored in textfile format. The size of partition 20161122 before conversion is about 400M, and it is converted to three partitions of 20161122, 20161123, 20161124 of table table1_parquet in parquet format with parquet uncompressed, parquet snappy compression and parquet gzip compression respectively.
(1)insert into table1_parquet partition (dt=20161122) select a,b,c from table1 where dt=20161122;
(2)set parquet.compression=snappy;
insert into table1_pa;rquet partition (dt=20161123) select a,b,c from table1 where dt=20161122
(3)set parquet.compression=gzip;
insert into table1_parquet partition (dt=20161124) select a,b,c from table1 where dt=20161122
 Conversion result:
(1) 416.2 M  1.2 G    /table1_parquet/dt=20161122  -hive-e                     
Query: 29s, 28s, 39s
(2) 133.6 M  400.7 M  /table1_parquet/dt=20161123  -hive-e "set parquet.compression=snappy"       
Query: 35s, 28s, 39s
(3) 69.9 M   209.6 M  /table1_parquet/dt=20161124  -hive-e "set parquet.compression=gzip"         
Query: 31s, 35s, 38s
 
(1) The uncompressed size of parquet is 400M, and the size is basically unchanged.
(2) After converting to parquet and compressing with snappy, the size is 133.6M
(3) After converting to parquet and compressing it with gzip, the size is 69.9M, which is about 85% of the original size.
The query time is basically unchanged.
Reference theory: space saving about 75% after converting to parquet
What are the advantages of columnar storage compared to row storage?
  1. Data that does not meet the conditions can be skipped, and only the required data can be read, reducing the amount of IO data. (predicate push down)
  2. Compression encoding can reduce disk storage space. Since the data type of the same column is the same, more efficient compression encoding (such as Run Length Encoding and Delta Encoding) can be used to further save storage space.
  3. Read only the required columns, support vector operations, and get better scan performance.

Note: There was a problem that the amount of data did not come down after the conversion. It turned out that the parameter was set incorrectly and was set to the parameter of impala.

 

 2. OOM problem caused by converting parquet

 

Problem: Convert the historical data of table table1 into the parquet table table_parquet (the table is partitioned by day, each partition is about 10m data before conversion, and each partition is about 1m data after conversion), perform join operation on data of the past 10 days, and map side occurs Memory overflow problem.

join operation:

 

Current time: 20161130
Query 1: Query the full amount of historical data (the query is normal)
select count(*) from(
select a.* from tmp.table1_parquet a
left join
(select * from tmp.table1_parquet where dt>=20160601) b
on a.logtime=b.logtime and a.order_id=b.order_id and
where a.dt>=20160601
) as t;

Query 2: Query data for about 10 days (map side OOM)
select count(*) from(
select a.* from tmp.table1_parquet a
left join
(select * from tmp.table1_parquet where dt>=20161118) b
on a.logtime=b.logtime and a.order_id=b.order_id and
where a.dt>=20161118
) as t;

 Query 3: Query the data of the past 5 days (the query is normal)
select count(*) from(
select a.* from tmp.table1_parquet a
left join
(select * from tmp.table1_parquet where dt>=20161128) b
on a.logtime=b.logtime and a.order_id=b.order_id and
where a.dt>=20161128
) as t;
 

 

Query 2 exception information:




Tracking found that hive uses map side join by default for small tables smaller than 20m. As a result, the data size of query 2 in the past ten days is 10m (parquet, gzip compression). Hive uses map side join by default, but because 10m data is in parquet gzip compression format, After decompression, the size is about 100m, which causes the map side join to overflow.


 solution:

Adjust the value of the parameter hive.auto.convert.join.noconditionaltask.size from 20m to 5m, and the problem is solved.

The parameter hive.mapjoin.smalltable.filesize was not found.

 

Attachment 1: The parameters related to map side join are as follows:

  • hive.auto.convert.join : whether to automatically convert to mapjoin
  • hive.mapjoin.smalltable.filesize : The maximum file size of the small table, the default is 25000000, which is 25M
  • hive.auto.convert.join.noconditionaltask : whether to combine multiple mapjoins into one
  • hive.auto.convert.join.noconditionaltask.size : The maximum sum of file sizes of all small tables when multiple mapjoins are converted to 1.
Explanation: hive.auto.convert.join.noconditionaltask.size indicates the total size of tables that can be converted to MapJoin. For example, there are two tables A and B, and their sizes are smaller than the attribute value, then they will be converted into MapJoin respectively, if the sum of the size of the two tables is also less than the attribute value, then the two tables will be merged for a MapJoin.

For example, a large table is sequentially associated with 3 small tables a(10M), b(8M), c(12M), if the value of hive.auto.convert.join.noconditionaltask.size:

1. 小于18M,则无法合并mapjoin,必须执行3个mapjoin;

2. 大于18M小于30M,则可以合并a和b表的mapjoin,所以只需要执行2个mapjoin;
3. 大于30M,则可以将3个mapjoin都合并为1个。


 

 

Appendix 2: Predicate Pushdown
The predicate generally refers to those filter conditions after where

Map pushdown [projectoin pushdown ] and predicate pushdown [predicates pushdown] include an execution engine that pushes maps and predicates into a storage format in order to optimize operations under the hood as much as possible. The result is improved time and space efficiency, as columns not relevant to the query are discarded and do not need to be provided to the execution engine.

This is undoubtedly very efficient for columnar storage, because Pushdown allows the storage format to skip entire groups of columns that are not relevant to the query, and the columnar storage format is also more efficient to operate.

Next we will look at how to use Pushdown in a Hadoop pipeline.

Before coding, you need to enable the projection/map pushdown provided by Parquet out of the box in Hive and Pig. In a MapReduce program, there are some manual operations that you need to add to the Driver code to initiate pushdown operations. The specific operation can be seen from the highlighted section below.
Hive --> Predicates [set hive.optimize.ppd = true; ]
Pig --> Projectoin [Added later]

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326816432&siteId=291194637