Hive_Hive optimization guide

Hive's SQL operation optimization, we can start from the following aspects

 

Introduction to optimization direction

 

Scene 1. Deduplication problem

Scenario 2. Reduce the number of JOBs in a certain way

Scenario 3. Reasonably control the number of parallels

Scenario 4. Control the number of nodes / files in the task

Scenario 5. Sorting problem

Scenario 6. Reduce the computation cost and data transmission cost of Reduce by putting more pressure on the Map side

Scenario 7. Data skew problem

Scene 8. Data cropping problem

Scenario 9. Reduce the number of IO

 

Let's look at this problem in detail

 

Detailed list of optimization directions

 

TIPS: Some optimization details, the case length is too long, a separate article will be started.

 

Scene 1. Deduplication problem

1) UNION-the difference between UNION ALL, how to choose

2) DISTINCT alternative way GROUP BY

Article link: https://blog.csdn.net/u010003835/article/details/105493563

 

Scenario 2. Reduce the number of JOB

1) Use UNION ALL skillfully to reduce the number of JOB

2) Use the same JOIN condition of multiple tables to reduce the number of JOB

Article link: https://blog.csdn.net/u010003835/article/details/105493938

 

Scenario 3. Reasonable parallel control

Reasonable use of parallelized parameter control

For the following statement

1) UNION ALL

2) JOIN

Article link: https://blog.csdn.net/u010003835/article/details/105494048

 

Scenario 4. Control the number of nodes / files in the task

1) Mapper quantity control

2) Reducer quantity control

3) Control the number of files output by Mapper and Reducer

Article link: https://blog.csdn.net/u010003835/article/details/105494261

 

Scenario 5. Sorting problem

1) Use ORDER BY and SORT BY reasonably, and choose between the two

2) Limit the sorted output by using LIMIT

Article link: https://blog.csdn.net/u010003835/article/details/105494790

 

Scenario 6. Reduce the computational cost and data transmission cost of the reducer by letting the MAP end perform more tasks.

1) The way of MAP JOIN

2) MAP AGGR, pre-aggregation on the Map side

 

Scenario 7. Data skew problem

  1) Data skew caused by null value

  2) Due to inconsistent data types, conversion problems caused, resulting in data skew

  3) Business data itself is unevenly distributed, resulting in data skew

 

Scene 8. Data cropping problem

1) Record number cropping

   i. Build through the advantages of partitioning and bucket table 

  ii. Remove the invalid records through the filtering conditions, so that the invalid data is eliminated in the map stage

2) Column cropping

 i. Eliminate invalid, non-calculated column data

 ii. Use columnar storage

 

 

Scenario 9. Reduce the number of IO

1) Insert from multiple tables FROM A INSERT B SELECT a, ... INSERT C SELECT a, ...

2) Enter once, use WITH TABLE AS (...) multiple times

 

 

Let's take a look at these nine optimization directions separately

 

Test table and test data

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`(                |
|   `company_name` string COMMENT '????',            |
|   `dep_name` string COMMENT '????',                |
|   `user_id` bigint COMMENT '??id',                 |
|   `user_name` string COMMENT '????',               |
|   `salary` decimal(10,2) COMMENT '??',             |
|   `create_time` date COMMENT '????',               |
|   `update_time` date COMMENT '????')               |
| PARTITIONED BY (                                   |
|   `pt` string COMMENT '????')                      |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'field.delim'=',',                               |
|   'serialization.format'=',')                      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1586310488')            |
+----------------------------------------------------+
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name  | datacube_salary_org.dep_name  | datacube_salary_org.user_id  | datacube_salary_org.user_name  | datacube_salary_org.salary  | datacube_salary_org.create_time  | datacube_salary_org.update_time  | datacube_salary_org.pt  |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200405                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200406                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | enginer                       | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| s.zh                              | enginer                       | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200407                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200407                |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+

 

 

 

Scene 1. Deduplication problem

1) UNION-the difference between UNION ALL, how to choose

2) DISTINCT alternative way GROUP BY

 

 

Scenario 2. Reduce the number of JOB

1) Use UNION ALL skillfully to reduce the number of JOB

2) Use the same JOIN condition of multiple tables to reduce the number of JOB

 

 

Scenario 3. Reasonable parallel control

Reasonable use of parallelized parameter control

For the following statement

1) UNION ALL

2) JOIN

 

Scenario 4. Control the number of nodes / files in the task

1) Mapper quantity control

2) Reducer quantity control

3) Control the number of files output by Mapper and Reducer

 

 

Scenario 5. Sorting problem

1) Use ORDER BY and SORT BY reasonably, and choose between the two

2) Limit the sorted output by using LIMIT

 

 

 

 

Scenario 6. Reduce the computational cost and data transmission cost of the reducer by letting the MAP end perform more tasks

1) The way of MAP JOIN

2) MAP AGGR, pre-aggregation on the Map side

 

 

 

 

 

Scenario 7. Data skew problem

1) Data skew caused by null value

2) Due to inconsistent data types, conversion problems caused, resulting in data skew

3) Business data itself is unevenly distributed, resulting in data skew

 

 

 

Scene 8. Data cropping problem

1) Record number cropping

   i. Build through the advantages of partitioning and bucket table 

  ii. Remove the invalid records through the filtering conditions, so that the invalid data is eliminated in the map stage

2) Column cropping

 i. Eliminate invalid, non-calculated column data

 ii. Use columnar storage

 

 

 

 

Scenario 9. Reduce the number of IO

1) Insert from multiple tables FROM A INSERT B SELECT a, ... INSERT C SELECT a, ...

2) Enter once, use WITH TABLE AS (...) multiple times

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

bubble

Published 519 original articles · praised 1146 · 2.83 million views

Guess you like

Origin blog.csdn.net/u010003835/article/details/105334641