http://shiyanjun.cn/archives/588.html
http://www.cnblogs.com/xd502djj/p/3799432.html
https://www.2cto.com/net/201708/668075.html
http://dacoolbaby.iteye.com/blog/1879002
The basic principle:
1: Filter data as soon as possible to reduce the amount of data in each stage, add partitions to the partition table, and select only the fields that need to be used
select... from A
joinB
on A.key= B.key
whereA.userid>10
andB.userid<10
and A.dt='20120417'
and B.dt='20120417';
should be rewritten as:
select.... from (select .... from A
wheredt='201200417'
and userid>10
) a
join (select .... from B
wheredt='201200417'
and userid <10
)b
on a.key= b.key;
2: Try to atomize operations as much as possible, and try to avoid a SQL containing complex logic
Can use intermediate tables to complete complex logic
droptable if exists tmp_table_1;
createtable if not exists tmp_table_1 as
select......;
droptable if exists tmp_table_2;
createtable if not exists tmp_table_2 as
select......;
droptable if exists result_table;
createtable if not exists result_table as
select......;
droptable if exists tmp_table_1;
droptable if exists tmp_table_2;
3: The number of JOBs from a single SQL should be controlled below 5 as much as possible
4: Use mapjoin carefully. Generally, the number of rows is less than 2000 rows, and the table size is less than 1M (which can be appropriately enlarged after expansion) can be used. Small tables should be placed on the left side of join (currently, many small tables in TCL are placed on the right side of join). ).
Otherwise, it will cause a lot of consumption of disk and memory
5: When writing SQL, you must first understand the characteristics of the data itself. If there are join and group operations, pay attention to whether there will be data skew
If data skew occurs, you should do the following:
sethive.exec.reducers.max=200;
setmapred.reduce.tasks= 200;---Increase the number of Reduces
sethive.groupby.mapaggr.checkinterval=100000;--This is the number of records corresponding to the key of the group. If the number of records exceeds this value, it will be split, and the value will be set according to the specific amount of data.
sethive.groupby.skewindata=true; --If the group by process is skewed, it should be set to true
sethive.skewjoin.key=100000;--This is the number of records corresponding to the join key. If the number of records exceeds this value, it will be split, and the value will be set according to the specific amount of data.
sethive.optimize.skewjoin=true;--If the join process is skewed, it should be set to true
、
Group By statement
- Map-side partial aggregation:
-
- Not all aggregation operations need to be completed on the Reduce side. Many aggregation operations can be partially aggregated on the Map side first, and finally get the final result on the Reduce side.
- Hash based
- Parameters include:
-
- hive.map.aggr = true whether to aggregate on the Map side, the default is True
- hive.groupby.mapaggr.checkinterval = 100000 The number of entries for aggregation operations on the Map side
- Load balancing when data is skewed
-
- hive.groupby.skewindata = false
- When the option is set to true, the generated query plan will have two MR jobs. In the first MR Job, the output result set of the Map will be randomly distributed to the Reduces, each Reduce will perform partial aggregation operations and output the results. The result of this processing is that the same Group By Key may be distributed to different Reduces. , so as to achieve the purpose of load balancing; the second MR Job is distributed to the Reduce according to the Group ByKey according to the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally completes the final Aggregate operation.
hive.groupby.skewindata变量
6: If the number of parts of union all is greater than 2, or each union part has a large amount of data, it should be split into multiple insertinto statements. In the actual test process, the execution time can be increased by 50%
insertoverwite table tablename partition (dt= ....)
select..... from (
select... from A
unionall
select... from B
union all
select... from C
) R
where...;
can be rewritten as:
insertinto table tablename partition (dt= ....)
select.... from A
WHERE...;
insertinto table tablename partition (dt= ....)
select.... from B
WHERE...;
insertinto table tablename partition (dt= ....)
select.... from C
WHERE...;
When hive is running data, the data is often skewed, so that the job is often stuck after 99% of the reduce is completed, and the last 1% has not been run for several hours. This situation is likely to be data skew. The reasons and solutions should be selected according to the specific situation.
1. The key value of join is skewed, and the key value contains many null values or abnormal values
In this case, you can assign a random value to the outlier to spread the key
Such as:
selectuserid , name
fromuser_info a
join (
select case when userid is null then cast ( rand ( 47 )* 100000 as i nt )
elseuserid
fromuser_read_log
)b on a . userid = b . userid
Through the rand function, the value of null will be scattered to different values, and the problem of data skew can be solved by comparing the key value.
Note: If outliers are not needed, it is best to filter them out in advance, so that the amount of calculation can be greatly reduced
2. When the key values are all valid values, the solution is to set the following parameters
set hive.exec.reducers.bytes.per.reducer = 1000000000
That is, the reduce of each node processes 1G data by default. If your join operation also produces data skew, you can set it in hive
set hive.optimize.skewjoin = true;
set hive.skewjoin.key = skew_key_threshold (default = 100000)
When hive is running, there is no way to judge which key will generate how much skew, so use this parameter to control the threshold of skew. If it exceeds this value, the new value will be sent to those reducers that have not yet reached it. Generally, it can be set to you
2-4 times of (total records processed/reduces) are acceptable.
Skew often exists. Generally, the number of layers of select is more than 2, and the mapreduce job translated into more than 3 execution plans is prone to skew. It is recommended to set this parameter before running complex SQL each time. If you don't know how much to set, you can use the official default 1 reduce algorithm to process only 1G, then skew_key_threshold = 1G/average line length. Or set it directly to 250000000 by default (almost 4 bytes in average line length)
3. The number of reduce is too small
set mapred.reduce.tasks=800;
The default is to set the hive.exec.reducers.bytes.per.reducer parameter first. After setting, hive will automatically calculate the number of reducers, so the two parameters are generally not used at the same time.
4. The problem of tilting group by
set hive.map.aggr=true (enable map-side combiner); //Do combiner on the map side, if the map data is basically different, aggregation is meaningless, and combining is superfluous, and hive is also considered more thoughtful and passed Parameter hive.groupby.mapaggr.checkinterval = 100000 (default)
hive.map.aggr.hash.min.reduction=0.5 (default)
The meaning of the two parameters is: pre-fetch 100,000 pieces of data aggregation, if the number of pieces after aggregation / 100,000> 0.5, no longer aggregate
set hive.groupby.skewindata=true;// 决定
group
by
操作是否支持倾斜的数据。注意:只能对单个字段聚合.
Control the generation of two MR Jobs, and the output of the first MR Job Map is randomly assigned to reduce for pre-summarization, reducing the number of certain key values. Data skew problem caused by too small
5. Association between small tables and large tables
At this point, it can be optimized by mapjoin,
set
hive.auto.
convert
.
join
=
true
; //将小表刷入内存中
set
hive.mapjoin.smalltable.filesize = 2500000 ;//刷入内存表的大小(字节)