Hive study notes (four)-optimization articles 1

The tuning of HiveQL is necessary for the brothers who often use HQL for data development to understand the book. Learn the implementation details behind hive and how to use hive more efficiently. I think this is also necessary to understand. Whether it is in the interview or in the development process, it will play a big role.
The hive version used here is: 2.3.0

1 Use EXPLAIN

explain can print out the execution plan of hive, which can help us understand how hive converts query statements into MapReduce tasks.

Usage: add explain in front of the hql statement

explain select sum(num) from bucket_num;

Results of the:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: bucket_num
            Statistics: Num rows: 28 Data size: 114 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: num (type: int)
              outputColumnNames: num
              Statistics: Num rows: 28 Data size: 114 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(num)
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  sort order: 
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.268 seconds, Fetched: 44 row(s)
  1. From the results of the execution plan: you can see the table name: bucket_number, field name: num, data type: int, and the aggregate function sum().
  2. A hive task will contain one or more stages, and different stages will have mutual dependencies. The more complex the query usually has more stages, and it will take more time to complete. A stage can be a MapReduce task, a sampling stage, or a merge stage, or a limit stage. By default, hive will execute a stag.
  3. STAGE PLAN
    Stage-1: Contains most of the processing of the job.
    Stage-0: This statement is a stage without any operation order at this stage.
    Map Operator Tree: the beginning of the map stage.
    Reduce Operator Tree: the beginning of the reduce stage
    _col0: This is a temporary result derived from the field name.
    4) Understanding hive's analysis of each query is a good way to analyze complex queries and inefficient execution.
    5) There is a more detailed way, you can compare:
explain extended select sum(num) from bucket_num;

2 Limit adjustment

Limit adjustment refers to: when we execute the query statement, we need to query the entire statement and return the result. This situation is a waste of time, so we must avoid this situation as much as possible. Therefore, hive has an attribute that can be turned on. When the limit statement is used, its statement can be sampled:

<prperty>
	<name>hive.limit.optimize.enable</name>
	<value>true</value>
</property>

So once the function is turned on, you can also set the number of rows and the number of files to be sampled

<prperty>
	<name>hive.limit.row.max.size</name>
	<value>10000</value>
</property>
<prperty>
	<name>hive.limit.optimit.limit.file</name>
	<value>10</value>
</property>

Although the function is good, it also has disadvantages. Because random sampling may not be able to process the useful data, it may produce different results every time you operate other aggregate functions.

3 column cropping and partition cropping

To put it bluntly, column clipping means that only the corresponding column is checked when the query statement is executed.
Zone clipping also only queries the corresponding partition.

When there are many columns or a large amount of data, if you select * or do not specify partitions, the efficiency of full column scan and full table scan are very low. The configuration item related to column clipping optimization in Hive is hive.optimize.cp, which is related to partition clipping The optimization related is hive.optimize.pruner, which is true by default. In the analysis phase of HiveQL, it corresponds to the ColumnPruner logic optimizer.

4 Predicate pushdown

Predicate pushdown, this concept also exists in relational databases, and now also exists in HQL statements, which means that the predicate statement of the filter condition query is executed in advance to reduce the amount of data to be queried.

select a.uid,a.event_type,b.topic_id,b.title
from calendar_record_log a
left outer join (
  select uid,topic_id,title from forum_topic
  where pt_date = 20190224 and length(content) >= 100
) b on a.uid = b.uid
where a.pt_date = 20190224 and status = 0;

The sentence for filtering forum_topic is written inside the subquery, not outside. Hive has the configuration item hive.optimize.ppd for predicate pushdown optimization, the default value is true, and the logical optimizer corresponding to it is PredicatePushDown. The optimizer is to raise the FilterOperator in OperatorTree up, as shown in the figure below.
Insert picture description here
As shown in the figure: The location of FIL(9) has now been mentioned above. The advantage of this is that it can filter out a part of the original data, thereby improving query efficiency. Of course, such a query may not be correct for the result. Yes, the narrow meaning of the predicate depends on the situation.
You can refer to this article to give a detailed introduction to the four predicate moves down: https://blog.csdn.net/pengzonglu7292/article/details/81036712

5 sort by 代替 order by

1) The sorting method of order by, which is the same as other SQL dialects, is globally sorted. In this case, all data on the map side will enter the reduce side. If the amount of data is large, this will increase the amount of calculation on the reducer side, and it may even be possible Downtime.
If you have to use order by, you must be in strict mode, in this mode, and used with limit to limit the number of items, order by will not be globally sorted (this is not recommended)
set hive.mapred.mode = strict;

2) The sorting method of sort by is partial sorting. Depending on the situation, multiple reducers will be started for sorting, and each reducer is internally ordered. In order to control the key assigned to the reducer of the map-side data, it is often used in conjunction with distribute by. The function of distribute by is to distribute to different reducers according to certain rules, and the data on the map side will not be randomly distributed to reducers.

select uid,upload_time,event_type,record_data
from calendar_record_log
where pt_date >= 20190201 and pt_date <= 20190224
distribute by uid
sort by upload_time desc,event_type desc;

Six group by configuration

Let's take a look at the principle of group by aggregation: When
Insert picture description here
group by, if you start a combiner to do partial pre-aggregation on the map side, you can effectively reduce the amount of shuffle data. The pre-aggregated configuration item is hive.map.aggr, the default value is true, and the corresponding optimizer is GroupByOptimizer, which is simple and convenient.
Through the hive.groupby.mapaggr.checkinterval parameter, you can also set the threshold for the number of rows pre-aggregated on the map side. If this value is exceeded, the job will be split. The default value is 100000.

Of course, in group by, if the amount of data for some key values ​​is too large, data skew will occur. For this, hive has its own balance function: the default is false
hive.groupby.skewindata=true
The implementation method is to start two MR jobs during group by. The first job will randomly input the map-side data into the reducer, and each reducer will be partially aggregated, and the same key will be distributed in different reducers. The second job aggregates the previously preprocessed data by key and outputs the result, which has a balanced effect.
However, the configuration item is dead after all, and sometimes it cannot solve the problem fundamentally. Therefore, it is recommended to understand the details of data skew by yourself and optimize the query statement.

Guess you like

Origin blog.csdn.net/u013963379/article/details/90183073