hivesql optimization

1. Hivesql optimization in common scenarios

1. Column cropping

Since most of the bottom storage of the data warehouse adopts columnar storage, such as ORC/PARQUET, column clipping can be used to reduce the scanned fields.

2. Partition cropping

That is, add partition conditions when querying data tables. Data warehouses are usually group-level data storage, and the amount of data is very large, so most of them use partitions to accelerate the efficiency of data statistics, so partition tailoring is essential

The sooner you filter, the better where - to reduce the amount of data processed downstream

Data filtering

Reduce the data of the related 2 tables

select t1.dim1,t1.measure1,t2.measure2
from 
     (select dim1,measure1 from a where dt = '2019-01-01') t1
join (select dim1,measure2 from b where dt = '2019-01-01') t2 on t1.dim1 = t2.dim1

 

Deduplication optimization

It is generally prohibited to use distinct to remove duplicates, unless the amount of data is so small that a single server can be used to remove duplicates and can be used without pressure. Group by can be used to remove duplication in large watch scenarios


Sort by replaces order by order by to sort the results globally according to a certain field, which will cause all map-side data to enter a reducer
ort by, then it will start multiple reducers to sort according to the situation, and ensure that each reducer has a local In order
    to control the key assigned to the reducer of the map-side data, it is often used in conjunction with distribute by.
    If you don't add distribute by, the map-side data will be randomly distributed to the reducer.

Small watch in front, big watch in back

In the reduce phase, the table on the left of the join operator will be loaded into memory, which can improve operating efficiency

Use map join

See https://mp.csdn.net/editor/html/113359842
map join is especially suitable for large and small table joins and unequal value associations. Hive will directly complete the join process of build table and probe table on the map side,
eliminating reduce, and the efficiency is very high.

set hive.auto.convert.join=true; //Set the MapJoin optimization to automatically turn on, the default is to turn on
set hive.mapjoin.smalltable.filesize=25000000 //Set the small table does not exceed the size of the open mapjoin optimization, which is 25M

#进行查询
select t1.a,t1.b 
from table t1 
join table2 t2 
on ( t1.a=t2.a and f.ftime=20110802)

2. SQL optimization under data tilt

1) When the null or meaningless value
    
    fact table is log data, there are often some items that are not recorded. We will set it to null according to the situation.
    If there are many missing items, these null values ​​will be added when doing the join Very concentrated, slowing down progress.
        
    If you don't need null data, write a where statement in advance to filter it out .
    If you need to keep it, you can randomly break up the null key , for example, change the record with the user ID of null to a negative value randomly:
```sql

select a.uid,a.event_type,b.nickname,b.age    
from (    
select    
  (case when uid is null then cast(rand()*-10240 as int) else uid end) as uid,    
  event_type from calendar_record_log    
where pt_date >= 20190201    
) a left outer join (    
select uid,nickname,age from user_info where status = 4    
) b on a.uid = b.uid;

2) Data skew caused by different types of associations

If the two key data types of the join are different, they need to be converted to the same type, because the default hash will be sent to the reduce according to the int type, which will cause all the records of the non-int type to be sent to one reduce.

3) There are more data for a certain value of the associated key  

Can be used to join the random number aggregation method

select key,sum(pv) as pv
from
      (
          select key,round(rand()*1000) as rnd,sum(pv) as pv
          from a 
          group by key,round(rand()*1000)
      ) t
group by key

4) count(distinct) produces data skew

select count(distinct user_id) from a
可替换为
select count(1) from(select user_id from a group by user_id) t

hive optimization

Hive optimization The
startup and initialization time of a map or reduce task is much longer than the logic processing time, which will cause a lot of waste of resources.

The map stage optimization
enables a single map task to process the appropriate amount of data.

The map parameter setting
mapred.min.split.size: the smallest split unit of data; the default value of min is 1KB.
mapred.max.split.size: The maximum split unit of data; the default value of max is 256M.
By adjusting max, the number of maps can be adjusted, reducing max can increase the number of maps; increasing min can reduce the number of maps.

Map segmentation
Assuming that there is a file a in the input directory with a size of 780M,
then the default parameters of the map will divide a into 7 blocks (6 128M and 1 12M), resulting in 7 maps.

Assuming that there are 3 files a, b, and c in the input directory with sizes of 10M, 20M, and 130M respectively,
then hadoop will divide the files into 4 blocks (10M, 20M, 128M, 2M) to generate 4 map numbers.

Solution
Use the following method to merge small files before map execution to reduce the number of maps:
set mapred.max.split.size=128000000; // The maximum block size that can be split
set mapred.min.split.size.per.node =100000000; // The minimum split processed by each node
set mapred.min.split.size.per.rack=100000000; // The minimum split processed by each rack
***The first three parameters determine the size of the combined file block If the file block size is larger
than 128m, it will be separated by 128m. If it is smaller than 128m, it will be separated by 100m. Those smaller than 100m (including small files and separated large files) will
be combined
.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; // Combine files
*** This parameter means to merge small files before execution.

In the reduce stage
, if the Reduce setting is too large, a lot of small files will be generated,
hive.exec.reducers.bytes.per.reducer (the amount of data processed by each reduce task, the default is 1000^3=1G)    
hive.exec. reducers.max (the maximum number of reducers for each task, the default is 999)

Merge small files
Whether to merge Map output files: hive.merge.mapfiles=true (default value is true)
Whether to merge Reduce output files: hive.merge.mapredfiles=false (default value is false)
The size of merged files: hive.merge .size.per.task=25610001000 (the default value is 256000000)

Guess you like

Origin blog.csdn.net/qq_24271537/article/details/113216807