hive optimization

http://shiyanjun.cn/archives/588.html

 

 

http://www.cnblogs.com/xd502djj/p/3799432.html

 

https://www.2cto.com/net/201708/668075.html

 

http://dacoolbaby.iteye.com/blog/1879002

 

 

The basic principle:

1: Filter data as soon as possible to reduce the amount of data in each stage, add partitions to the partition table, and select only the fields that need to be used

select... from A

joinB

on A.key= B.key

whereA.userid>10

     andB.userid<10

       and A.dt='20120417'

       and B.dt='20120417';

should be rewritten as:

select.... from (select .... from A

                  wheredt='201200417'

                                  and userid>10

                             ) a

join (select .... from B

       wheredt='201200417'

                    and userid <10   

     )b

on a.key= b.key;

2: Try to atomize operations as much as possible, and try to avoid a SQL containing complex logic

Can use intermediate tables to complete complex logic

droptable if exists tmp_table_1;

createtable if not exists tmp_table_1 as

select......;

 

droptable if exists tmp_table_2;

createtable if not exists tmp_table_2 as

select......;

 

droptable if exists result_table;

createtable if not exists result_table as

select......;

 

droptable if exists tmp_table_1;

droptable if exists tmp_table_2;

 

 

3: The number of JOBs from a single SQL should be controlled below 5 as much as possible

 

4: Use mapjoin carefully. Generally, the number of rows is less than 2000 rows, and the table size is less than 1M (which can be appropriately enlarged after expansion) can be used. Small tables should be placed on the left side of join (currently, many small tables in TCL are placed on the right side of join). ).

Otherwise, it will cause a lot of consumption of disk and memory

 

5: When writing SQL, you must first understand the characteristics of the data itself. If there are join and group operations, pay attention to whether there will be data skew

If data skew occurs, you should do the following:

sethive.exec.reducers.max=200;

setmapred.reduce.tasks= 200;---Increase the number of Reduces

sethive.groupby.mapaggr.checkinterval=100000;--This is the number of records corresponding to the key of the group. If the number of records exceeds this value, it will be split, and the value will be set according to the specific amount of data.

sethive.groupby.skewindata=true; --If the group by process is skewed, it should be set to true

sethive.skewjoin.key=100000;--This is the number of records corresponding to the join key. If the number of records exceeds this value, it will be split, and the value will be set according to the specific amount of data.

sethive.optimize.skewjoin=true;--If the join process is skewed, it should be set to true

 、

Group By statement

  • Map-side partial aggregation:
    • Not all aggregation operations need to be completed on the Reduce side. Many aggregation operations can be partially aggregated on the Map side first, and finally get the final result on the Reduce side.
    • Hash based
    • Parameters include:
      • hive.map.aggr = true  whether to aggregate on the Map side, the default is True
      • hive.groupby.mapaggr.checkinterval = 100000  The number of entries for aggregation operations on the Map side
  • Load balancing when data is skewed
    • hive.groupby.skewindata = false
    • When the option is set to true, the generated query plan will have two MR jobs. In the first MR Job, the output result set of the Map will be randomly distributed to the Reduces, each Reduce will perform partial aggregation operations and output the results. The result of this processing is that the same Group By Key may be distributed to different Reduces. , so as to achieve the purpose of load balancing; the second MR Job is distributed to the Reduce according to the Group ByKey according to the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally completes the final Aggregate operation.


hive.groupby.skewindata变量

As can be seen from the above groupby statement, this variable is used to control load balancing. When the data is skewed, if this variable is set to true, Hive will automatically perform load balancing.

 

 

6: If the number of parts of union all is greater than 2, or each union part has a large amount of data, it should be split into multiple insertinto statements. In the actual test process, the execution time can be increased by 50%

insertoverwite table tablename partition (dt= ....)

select..... from (

                   select... from A

                   unionall

                   select... from B

                  union all

                   select... from C

                              ) R

where...;

 

can be rewritten as:

insertinto table tablename partition (dt= ....)

select.... from A

WHERE...;

 

insertinto table tablename partition (dt= ....)

select.... from B

WHERE...;

 

insertinto table tablename partition (dt= ....)

select.... from C

WHERE...; 

 

 

 

When hive is running data, the data is often skewed, so that the job is often stuck after 99% of the reduce is completed, and the last 1% has not been run for several hours. This situation is likely to be data skew. The reasons and solutions should be selected according to the specific situation.

1. The key value of join is skewed, and the key value contains many null values ​​or abnormal values

In this case, you can assign a random value to the outlier to spread the key

Such as:

selectuserid , name

fromuser_info a

join (

select  case  when userid  is  null  then  cast ( rand ( 47 )* 100000  as i nt )

elseuserid

fromuser_read_log

)b  on a . userid  = b . userid

Through the rand function, the value of null will be scattered to different values, and the problem of data skew can be solved by comparing the key value.

Note: If outliers are not needed, it is best to filter them out in advance, so that the amount of calculation can be greatly reduced

2. When the key values ​​are all valid values, the solution is to set the following parameters

set hive.exec.reducers.bytes.per.reducer = 1000000000

That is, the reduce of each node processes 1G data by default. If your join operation also produces data skew, you can set it in hive

set hive.optimize.skewjoin = true;

set hive.skewjoin.key = skew_key_threshold (default = 100000)

When hive is running, there is no way to judge which key will generate how much skew, so use this parameter to control the threshold of skew. If it exceeds this value, the new value will be sent to those reducers that have not yet reached it. Generally, it can be set to you

2-4 times of (total records processed/reduces) are acceptable.

Skew often exists. Generally, the number of layers of select is more than 2, and the mapreduce job translated into more than 3 execution plans is prone to skew. It is recommended to set this parameter before running complex SQL each time. If you don't know how much to set, you can use the official default 1 reduce algorithm to process only 1G, then skew_key_threshold = 1G/average line length. Or set it directly to 250000000 by default (almost 4 bytes in average line length)

3. The number of reduce is too small

set mapred.reduce.tasks=800;

The default is to set the hive.exec.reducers.bytes.per.reducer parameter first. After setting, hive will automatically calculate the number of reducers, so the two parameters are generally not used at the same time.

4. The problem of tilting group by

set hive.map.aggr=true (enable map-side combiner); //Do combiner on the map side, if the map data is basically different, aggregation is meaningless, and combining is superfluous, and hive is also considered more thoughtful and passed Parameter hive.groupby.mapaggr.checkinterval = 100000 (default)

hive.map.aggr.hash.min.reduction=0.5 (default)

The meaning of the two parameters is: pre-fetch 100,000 pieces of data aggregation, if the number of pieces after aggregation / 100,000> 0.5, no longer aggregate

set hive.groupby.skewindata=true;//  决定  group by 操作是否支持倾斜的数据。注意:只能对单个字段聚合. Control the generation of two MR Jobs, and the output of the first MR Job Map is randomly assigned to reduce for pre-summarization, reducing the number of certain key values. Data skew problem caused by too small

5. Association between small tables and large tables

At this point, it can be optimized by mapjoin,

set hive.auto. convert . join  true ; //将小表刷入内存中  

set hive.mapjoin.smalltable.filesize = 2500000 ;//刷入内存表的大小(字节)  

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326353713&siteId=291194637