Hive series performance tuning

hive performance tuning

Preface

Hive has a large proportion in the offline development and use of big data. Proficiency in hive tuning is a basic requirement for every big data practitioner

table of Contents

  1. SQL optimization
  2. Impact of data block size on performance
  3. JOIN optimization
  4. Storage format impact on performance
  5. The partition table is divided into the same table
  6. engine

SQL optimization

  • Use of with

The with syntax queries the data to the memory, and then other queries can be used directly

-- with常用的几种方式
-- routine style
with  a1 as  ( select  *  from  a where  id between  1 and 20 )  
select  * from  a1;

-- from style
with  a1 as  ( select  *  from  a where  id between  1 and 20 )
from  a1   
select  *;

-- chaining CTEs
with  c as  (  select * from  b where id between  5 and 10 ), -- 链式风格: 数据从a -> b -> c,b、c放内存中供查询
b as  (  select * from  a where id between  1 and 20 ) 
select  *  from  c ;

-- union example   
with a1 as (select id,name from a where id between  5 and 10),  
b1 as (select id,name from b where id between  25 and 30)
select * from  a1  union  all  select  *  from  b1;

-- insert example
--create  table  b like a;-- 创建一张空表,相同表结构
create table b select * from a where 0 = 1; --创建一张空表,如果是分区表,会丢失分区分桶等信息,
with  a1 as (select  id, namefrom a where  id between  1 and 20)
from  a1
insert  overwrite  table  b
select  *;

-- ctas example (with 搭配 create table  as select 建表语法)
create  table b as
with  a1  as  (select  id,name from a where id between  1 and 20)
select  *  from  a1;

-- view example
create  view  v1  as
with  a1  as  (select  id,name from a where id between  1 and 20)
select  *  from  a1;

 
-- view example, name collision
create  view  v1  as
with a1 as (select id,name from a where id between 5 and 10)
select  *  from  a1;
with b1 as (select id,name from a where id between 1 and 20)
select  *  from  v1;
  • from syntax usage
  • distinct(count(1)) and count(group by subquery)

Data block size optimization

everything

JOIN optimization

map端join

The main meaning of mapJoin is that when the two linked tables are a relatively small table and a particularly large table, we put the smaller table directly into the memory, and then compare the larger table for map operation . The join occurs during the map operation. Whenever the data in a large table is scanned, it is necessary to check the data in the small table, which one matches it, and then connect. The join here does not involve reduce operations. The advantage of map-side join is that there is no shuffle, which is great. In actual applications, we set it like this:

set hive.auto.convert.join=true;

common join

Common join is also called shuffle join, reduce join operation. In this case, the sizes of the two tables are the same, but they are not very large. The specific process is to split the data on the map side. A block corresponds to a map operation, and then a shuffle operation is performed. The corresponding block shuffles to the reduce side, and then combines them one by one. The advantage here will involve the data tilt, which is greatly The performance may be affected by speculation, which will be discussed in the subsequent data tilt.

SMBJoin

smb is the sort merge bucket operation. It is sorted first, then merged, and then put into the corresponding bucket. Bucket is a technology similar to partition table in hive, which is to hash according to the key, and put the same hash value in the same Go in buck. When the two tables are combined. We first perform bucketing, and the join will greatly optimize performance. That is to say, when performing union, a small part of table1 is combined with a small part of table1. Table unions are all equivalent connections. The same key is put in the same bucket, then in When combined, the scanning of irrelevant items will be greatly reduced.

set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

reference

Reference documents

hive join optimization

Storage format optimization

everything

Partition table bucket table

everything

engine

everything

hive uses spark engine https://www.cnblogs.com/lyy-blog/p/9598433.html

Guess you like

Origin blog.csdn.net/dbc_zt/article/details/110229797