hive performance tuning
Preface
Hive has a large proportion in the offline development and use of big data. Proficiency in hive tuning is a basic requirement for every big data practitioner
table of Contents
- SQL optimization
- Impact of data block size on performance
- JOIN optimization
- Storage format impact on performance
- The partition table is divided into the same table
- engine
SQL optimization
- Use of with
The with syntax queries the data to the memory, and then other queries can be used directly
-- with常用的几种方式
-- routine style
with a1 as ( select * from a where id between 1 and 20 )
select * from a1;
-- from style
with a1 as ( select * from a where id between 1 and 20 )
from a1
select *;
-- chaining CTEs
with c as ( select * from b where id between 5 and 10 ), -- 链式风格: 数据从a -> b -> c,b、c放内存中供查询
b as ( select * from a where id between 1 and 20 )
select * from c ;
-- union example
with a1 as (select id,name from a where id between 5 and 10),
b1 as (select id,name from b where id between 25 and 30)
select * from a1 union all select * from b1;
-- insert example
--create table b like a;-- 创建一张空表,相同表结构
create table b select * from a where 0 = 1; --创建一张空表,如果是分区表,会丢失分区分桶等信息,
with a1 as (select id, namefrom a where id between 1 and 20)
from a1
insert overwrite table b
select *;
-- ctas example (with 搭配 create table as select 建表语法)
create table b as
with a1 as (select id,name from a where id between 1 and 20)
select * from a1;
-- view example
create view v1 as
with a1 as (select id,name from a where id between 1 and 20)
select * from a1;
-- view example, name collision
create view v1 as
with a1 as (select id,name from a where id between 5 and 10)
select * from a1;
with b1 as (select id,name from a where id between 1 and 20)
select * from v1;
- from syntax usage
- distinct(count(1)) and count(group by subquery)
Data block size optimization
everything
JOIN optimization
map端join
The main meaning of mapJoin is that when the two linked tables are a relatively small table and a particularly large table, we put the smaller table directly into the memory, and then compare the larger table for map operation . The join occurs during the map operation. Whenever the data in a large table is scanned, it is necessary to check the data in the small table, which one matches it, and then connect. The join here does not involve reduce operations. The advantage of map-side join is that there is no shuffle, which is great. In actual applications, we set it like this:
set hive.auto.convert.join=true;
common join
Common join is also called shuffle join, reduce join operation. In this case, the sizes of the two tables are the same, but they are not very large. The specific process is to split the data on the map side. A block corresponds to a map operation, and then a shuffle operation is performed. The corresponding block shuffles to the reduce side, and then combines them one by one. The advantage here will involve the data tilt, which is greatly The performance may be affected by speculation, which will be discussed in the subsequent data tilt.
SMBJoin
smb is the sort merge bucket operation. It is sorted first, then merged, and then put into the corresponding bucket. Bucket is a technology similar to partition table in hive, which is to hash according to the key, and put the same hash value in the same Go in buck. When the two tables are combined. We first perform bucketing, and the join will greatly optimize performance. That is to say, when performing union, a small part of table1 is combined with a small part of table1. Table unions are all equivalent connections. The same key is put in the same bucket, then in When combined, the scanning of irrelevant items will be greatly reduced.
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
reference
Reference documents
Storage format optimization
everything
Partition table bucket table
everything
engine
everything
hive uses spark engine https://www.cnblogs.com/lyy-blog/p/9598433.html