Hive's SQL operation optimization, we can start from the following aspects
Introduction to optimization direction
Scene 1. Deduplication problem
Scenario 2. Reduce the number of JOBs in a certain way
Scenario 3. Reasonably control the number of parallels
Scenario 4. Control the number of nodes / files in the task
Scenario 5. Sorting problem
Scenario 6. Reduce the computation cost and data transmission cost of Reduce by putting more pressure on the Map side
Scenario 7. Data skew problem
Scene 8. Data cropping problem
Scenario 9. Reduce the number of IO
Let's look at this problem in detail
Detailed list of optimization directions
TIPS: Some optimization details, the case length is too long, a separate article will be started.
Scene 1. Deduplication problem
1) UNION-the difference between UNION ALL, how to choose
2) DISTINCT alternative way GROUP BY
Article link: https://blog.csdn.net/u010003835/article/details/105493563
Scenario 2. Reduce the number of JOB
1) Use UNION ALL skillfully to reduce the number of JOB
2) Use the same JOIN condition of multiple tables to reduce the number of JOB
Article link: https://blog.csdn.net/u010003835/article/details/105493938
Scenario 3. Reasonable parallel control
Reasonable use of parallelized parameter control
For the following statement
1) UNION ALL
2) JOIN
Article link: https://blog.csdn.net/u010003835/article/details/105494048
Scenario 4. Control the number of nodes / files in the task
1) Mapper quantity control
2) Reducer quantity control
3) Control the number of files output by Mapper and Reducer
Article link: https://blog.csdn.net/u010003835/article/details/105494261
Scenario 5. Sorting problem
1) Use ORDER BY and SORT BY reasonably, and choose between the two
2) Limit the sorted output by using LIMIT
Article link: https://blog.csdn.net/u010003835/article/details/105494790
Scenario 6. Reduce the computational cost and data transmission cost of the reducer by letting the MAP end perform more tasks.
1) The way of MAP JOIN
2) MAP AGGR, pre-aggregation on the Map side
Scenario 7. Data skew problem
1) Data skew caused by null value
2) Due to inconsistent data types, conversion problems caused, resulting in data skew
3) Business data itself is unevenly distributed, resulting in data skew
Scene 8. Data cropping problem
1) Record number cropping
i. Build through the advantages of partitioning and bucket table
ii. Remove the invalid records through the filtering conditions, so that the invalid data is eliminated in the map stage
2) Column cropping
i. Eliminate invalid, non-calculated column data
ii. Use columnar storage
Scenario 9. Reduce the number of IO
1) Insert from multiple tables FROM A INSERT B SELECT a, ... INSERT C SELECT a, ...
2) Enter once, use WITH TABLE AS (...) multiple times
Let's take a look at these nine optimization directions separately
Test table and test data
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`( |
| `company_name` string COMMENT '????', |
| `dep_name` string COMMENT '????', |
| `user_id` bigint COMMENT '??id', |
| `user_name` string COMMENT '????', |
| `salary` decimal(10,2) COMMENT '??', |
| `create_time` date COMMENT '????', |
| `update_time` date COMMENT '????') |
| PARTITIONED BY ( |
| `pt` string COMMENT '????') |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'=',', |
| 'serialization.format'=',') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1586310488') |
+----------------------------------------------------+
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name | datacube_salary_org.dep_name | datacube_salary_org.user_id | datacube_salary_org.user_name | datacube_salary_org.salary | datacube_salary_org.create_time | datacube_salary_org.update_time | datacube_salary_org.pt |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200405 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200406 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | enginer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| s.zh | enginer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200407 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200407 |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
Scene 1. Deduplication problem
1) UNION-the difference between UNION ALL, how to choose
2) DISTINCT alternative way GROUP BY
Scenario 2. Reduce the number of JOB
1) Use UNION ALL skillfully to reduce the number of JOB
2) Use the same JOIN condition of multiple tables to reduce the number of JOB
Scenario 3. Reasonable parallel control
Reasonable use of parallelized parameter control
For the following statement
1) UNION ALL
2) JOIN
Scenario 4. Control the number of nodes / files in the task
1) Mapper quantity control
2) Reducer quantity control
3) Control the number of files output by Mapper and Reducer
Scenario 5. Sorting problem
1) Use ORDER BY and SORT BY reasonably, and choose between the two
2) Limit the sorted output by using LIMIT
Scenario 6. Reduce the computational cost and data transmission cost of the reducer by letting the MAP end perform more tasks
1) The way of MAP JOIN
2) MAP AGGR, pre-aggregation on the Map side
Scenario 7. Data skew problem
1) Data skew caused by null value
2) Due to inconsistent data types, conversion problems caused, resulting in data skew
3) Business data itself is unevenly distributed, resulting in data skew
Scene 8. Data cropping problem
1) Record number cropping
i. Build through the advantages of partitioning and bucket table
ii. Remove the invalid records through the filtering conditions, so that the invalid data is eliminated in the map stage
2) Column cropping
i. Eliminate invalid, non-calculated column data
ii. Use columnar storage
Scenario 9. Reduce the number of IO
1) Insert from multiple tables FROM A INSERT B SELECT a, ... INSERT C SELECT a, ...
2) Enter once, use WITH TABLE AS (...) multiple times
bubble