Hive_HIVE优化指南_场景二_减少JOB的数量

大纲地址：https://mp.csdn.net/console/editor/html/105334641

测试表以及测试数据

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`(                |
|   `company_name` string COMMENT '????',            |
|   `dep_name` string COMMENT '????',                |
|   `user_id` bigint COMMENT '??id',                 |
|   `user_name` string COMMENT '????',               |
|   `salary` decimal(10,2) COMMENT '??',             |
|   `create_time` date COMMENT '????',               |
|   `update_time` date COMMENT '????')               |
| PARTITIONED BY (                                   |
|   `pt` string COMMENT '????')                      |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'field.delim'=',',                               |
|   'serialization.format'=',')                      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1586310488')            |
+----------------------------------------------------+

+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name  | datacube_salary_org.dep_name  | datacube_salary_org.user_id  | datacube_salary_org.user_name  | datacube_salary_org.salary  | datacube_salary_org.create_time  | datacube_salary_org.update_time  | datacube_salary_org.pt  |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200405                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200406                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | enginer                       | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| s.zh                              | enginer                       | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200407                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200407                |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+

场景二.减少JOB的数量

1) 巧妙的使用 UNION ALL 减少 JOB 数量

2) 利用多表相同的JOIN 条件，去减少 JOB 的数量

1) 巧妙的使用 UNION ALL 减少 JOB 数量

假如如下的场景，我们需要统计每多张表的数据量。

首先我们可以编写多条SQL进行统计，这样的效率不高。（没意义）

或者我们采用UNION ALL 的形式把多个结果合并起来，但是这样效率也比较低

如：

扫描二维码关注公众号，回复： 10761358 查看本文章

SELECT 
 'a' AS type
 ,COUNT(1) AS num
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
 'b' AS type
 ,COUNT(1) AS num
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT 
 'c' AS type
 ,COUNT(1) AS num
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT 
 'd' AS type
 ,COUNT(1) AS num
FROM datacube_salary_total_aggr AS d
;

较为优势的写法是，我们将多个表的数据先读取进来，然后打上标记然后再去做聚合统计

这里由于是多个表作为输入，所以会有多个Mapper

示例如下：

SELECT 
 type
 ,COUNT(1)
FROM
(
SELECT 
 'a' AS type
 ,total_salary 
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
 'b' AS type
 ,total_salary
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT 
 'c' AS type
 ,total_salary
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT 
 'd' AS type
 ,total_salary
FROM datacube_salary_total_aggr AS d
) AS tmp
GROUP BY
 type
;

我们通过EXLAIN 看下具体这2种方式有什么不同

INFO  : Starting task [Stage-6:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200411043838_47d043d4-8b22-433c-a0be-4714aed8ab94); Time taken: 0.008 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1, Stage-3, Stage-4, Stage-5 |
|   Stage-3 is a root stage                          |
|   Stage-4 is a root stage                          |
|   Stage-5 is a root stage                          |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
|               Group By Operator                    |
|                 aggregations: count(1)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                   value expressions: _col0 (type: bigint) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|           Select Operator                          |
|             expressions: 'a' (type: string), _col0 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-3                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
|               Group By Operator                    |
|                 aggregations: count(1)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                   value expressions: _col0 (type: bigint) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|           Select Operator                          |
|             expressions: 'b' (type: string), _col0 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-4                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
|               Group By Operator                    |
|                 aggregations: count(1)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                   value expressions: _col0 (type: bigint) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|           Select Operator                          |
|             expressions: 'c' (type: string), _col0 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-5                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: d                               |
|             Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
|               Group By Operator                    |
|                 aggregations: count(1)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                   value expressions: _col0 (type: bigint) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|           Select Operator                          |
|             expressions: 'd' (type: string), _col0 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
199 rows selected (0.242 seconds)

可以看到作业分为了6个STAGE

其中 SELECT '' AS type, COUNT(1) FROM xxx; 是一个阶段，由于是4张表，所以划分了4个作业，即 STAGE1, STAGE3, STAGE4, STAGE5

STAGE 2 为 UNION 的作业，将上述结果聚合起来

我们再来看下优化方法的 EXPLAIN 结果

INFO  : Starting task [Stage-5:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200411044418_821a2232-e735-4cfb-9397-10da41ce7570); Time taken: 0.006 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               expressions: 'a' (type: string)      |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 7 Data size: 595 Basic stats: COMPLETE Column stats: COMPLETE |
|               Union                                |
|                 Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Select Operator                    |
|                   expressions: _col0 (type: string) |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                   Group By Operator                |
|                     aggregations: count(1)         |
|                     keys: _col0 (type: string)     |
|                     mode: hash                     |
|                     outputColumnNames: _col0, _col1 |
|                     Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: string) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: string) |
|                       Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               expressions: 'b' (type: string)      |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 2 Data size: 170 Basic stats: COMPLETE Column stats: COMPLETE |
|               Union                                |
|                 Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Select Operator                    |
|                   expressions: _col0 (type: string) |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                   Group By Operator                |
|                     aggregations: count(1)         |
|                     keys: _col0 (type: string)     |
|                     mode: hash                     |
|                     outputColumnNames: _col0, _col1 |
|                     Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: string) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: string) |
|                       Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               expressions: 'c' (type: string)      |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 4 Data size: 340 Basic stats: COMPLETE Column stats: COMPLETE |
|               Union                                |
|                 Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Select Operator                    |
|                   expressions: _col0 (type: string) |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                   Group By Operator                |
|                     aggregations: count(1)         |
|                     keys: _col0 (type: string)     |
|                     mode: hash                     |
|                     outputColumnNames: _col0, _col1 |
|                     Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: string) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: string) |
|                       Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: d                               |
|             Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               expressions: 'd' (type: string)      |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 1 Data size: 85 Basic stats: COMPLETE Column stats: COMPLETE |
|               Union                                |
|                 Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Select Operator                    |
|                   expressions: _col0 (type: string) |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                   Group By Operator                |
|                     aggregations: count(1)         |
|                     keys: _col0 (type: string)     |
|                     mode: hash                     |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                     outputColumnNames: _col0, _col1 |
|                     Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: string) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: string) |
|                       Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
129 rows selected (0.219 seconds)

可以看到，运行只有两个阶段 STAGE0, STAGE1

这里为什么比上面的写法划分的阶段数量更少呢？因为我们先把所有数据UNION ALL了之后，再去做的统计，相当于多个表的数据利用 type 作为区分，一次扫描了进来，所以效率更高。但是相应的这个STAGE 阶段需要的Mapper 数量也更多，毕竟我们是一下扫描的4张表的数据。

我们再去对比下两者的执行时间（小数据规模下）

效率较低的方法，第一个SQL

INFO  : Starting Job = job_1586423165261_0033, Tracking URL = http://cdh-manager:8088/proxy/application_1586423165261_0033/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job  -kill job_1586423165261_0033
INFO  : Hadoop job information for Stage-2: number of mappers: 4; number of reducers: 0
INFO  : 2020-04-11 04:52:45,140 Stage-2 map = 0%,  reduce = 0%
INFO  : 2020-04-11 04:52:51,303 Stage-2 map = 25%,  reduce = 0%, Cumulative CPU 1.56 sec
INFO  : 2020-04-11 04:52:56,430 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 3.17 sec
INFO  : 2020-04-11 04:53:00,559 Stage-2 map = 75%,  reduce = 0%, Cumulative CPU 4.9 sec
INFO  : 2020-04-11 04:53:04,663 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 6.44 sec
INFO  : MapReduce Total cumulative CPU time: 6 seconds 440 msec
INFO  : Ended Job = job_1586423165261_0033
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.85 sec   HDFS Read: 11511 HDFS Write: 116 SUCCESS
INFO  : Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 4.7 sec   HDFS Read: 10963 HDFS Write: 116 SUCCESS
INFO  : Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 4.09 sec   HDFS Read: 11238 HDFS Write: 116 SUCCESS
INFO  : Stage-Stage-5: Map: 1  Reduce: 1   Cumulative CPU: 4.41 sec   HDFS Read: 10512 HDFS Write: 116 SUCCESS
INFO  : Stage-Stage-2: Map: 4   Cumulative CPU: 6.44 sec   HDFS Read: 20912 HDFS Write: 412 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 24 seconds 490 msec
INFO  : Completed executing command(queryId=hive_20200411045047_295d368a-500d-4225-9d8e-e83ed3e51321); Time taken: 138.08 seconds
INFO  : OK
+-----------+----------+
| _u1.type  | _u1.num  |
+-----------+----------+
| a         | 7        |
| b         | 2        |
| c         | 4        |
| d         | 1        |
+-----------+----------+
4 rows selected (138.285 seconds)

效率更高的写法

INFO  : Starting Job = job_1586423165261_0034, Tracking URL = http://cdh-manager:8088/proxy/application_1586423165261_0034/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job  -kill job_1586423165261_0034
INFO  : Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
INFO  : 2020-04-11 04:54:41,047 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-11 04:54:49,371 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 3.07 sec
INFO  : 2020-04-11 04:54:54,558 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 5.37 sec
INFO  : 2020-04-11 04:54:58,667 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 7.71 sec
INFO  : 2020-04-11 04:55:02,777 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.55 sec
INFO  : 2020-04-11 04:55:08,914 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.25 sec
INFO  : MapReduce Total cumulative CPU time: 11 seconds 250 msec
INFO  : Ended Job = job_1586423165261_0034
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 4  Reduce: 1   Cumulative CPU: 11.25 sec   HDFS Read: 45492 HDFS Write: 151 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 11 seconds 250 msec
INFO  : Completed executing command(queryId=hive_20200411045431_e528deb6-db83-4556-944d-802cdfa20d76); Time taken: 38.374 seconds
INFO  : OK
+-------+------+
| type  | _c1  |
+-------+------+
| a     | 7    |
| b     | 2    |
| c     | 4    |
| d     | 1    |
+-------+------+
4 rows selected (38.545 seconds)

可以看到方式二明显更快了，这个是因为划分的阶段少，申请资源的次数更少，所以，效率耗时更少了。

2) 利用多表相同的JOIN 条件，去减少 JOB 的数量

多张表相同的JOIN 条件。

首先，我假设有如下的场景，分别有 A,B,C 3张表，它们之间可以通过 user_id ，或者 mobile 进行关联(两者都是唯一的)。为此我们先构建下基本的数据, SQL 如下：

CREATE TABLE IF NOT EXISTS join_multi_a(
 user_id BIGINT
 ,mobile STRING
 ,sex BIGINT 
);


CREATE TABLE IF NOT EXISTS join_multi_b(
 user_id BIGINT
 ,mobile STRING
 ,user_name STRING 
);


CREATE TABLE IF NOT EXISTS join_multi_c(
 user_id BIGINT
 ,mobile STRING
 ,type STRING
);


INSERT OVERWRITE TABLE join_multi_a VALUES 
 (1, '123456', 0)
 ,(2, '234567', 1)
 ,(3, '345678', 1)
;


INSERT OVERWRITE TABLE join_multi_b VALUES 
 (1, '123456', 'sunzhenhua')
 ,(2, '234567', 'zyq')
 ,(3, '345678', 'zz')
;


INSERT OVERWRITE TABLE join_multi_c VALUES 
 (1, '123456', 'a')
 ,(2, '234567', 'b')
 ,(3, '345678', 'c')
;

注意，由于我们数据量非常小，所以在Hive 0.11 以上会导致使用Map JOIN 优化

我们要关闭 MAP JOIN 优化：

set hive.auto.convert.join=false;

例如如下语句，如果使用MAP JOIN 优化， EXPLAIN 会变为如下结果

SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON b.mobile = c.mobile
;

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-7 is a root stage                          |
|   Stage-5 depends on stages: Stage-7               |
|   Stage-0 depends on stages: Stage-5               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-7                                   |
|     Map Reduce Local Work                          |
|       Alias -> Map Local Tables:                   |
|         b                                          |
|           Fetch Operator                           |
|             limit: -1                              |
|         c                                          |
|           Fetch Operator                           |
|             limit: -1                              |
|       Alias -> Map Local Operator Tree:            |
|         b                                          |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|             HashTable Sink Operator                |
|               keys:                                |
|                 0 user_id (type: bigint)           |
|                 1 user_id (type: bigint)           |
|         c                                          |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             HashTable Sink Operator                |
|               keys:                                |
|                 0 _col7 (type: string)             |
|                 1 mobile (type: string)            |
|                                                    |
|   Stage: Stage-5                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Map Join Operator                      |
|               condition map:                       |
|                    Left Outer Join0 to 1           |
|               keys:                                |
|                 0 user_id (type: bigint)           |
|                 1 user_id (type: bigint)           |
|               outputColumnNames: _col0, _col1, _col2, _col7, _col8 |
|               Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
|               Map Join Operator                    |
|                 condition map:                     |
|                      Left Outer Join0 to 1         |
|                 keys:                              |
|                   0 _col7 (type: string)           |
|                   1 mobile (type: string)          |
|                 outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
|                 Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|                 Select Operator                    |
|                   expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
|                   outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|                   Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|                   File Output Operator             |
|                     compressed: false              |
|                     Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|                     table:                         |
|                         input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                         output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|       Local Work:                                  |
|         Map Reduce Local Work                      |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
75 rows selected (0.181 seconds)

这里，我们获取到获取用户的所有信息，我们有以下2种写法

低效的写法，由于两个user_id, mobile 都能关联

1.我们先用user_id 关联，再用 mobile 关联

EXPLAIN
SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON b.mobile = c.mobile
;

2.都用user_id 做关联

SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON a.user_id = c.user_id
;

我们分别看下以上两个语句，在不使用MAP JOIN 后的 EXPLAIN 结果

先用user_id 关联，再用 mobile 关联

INFO  : Starting task [Stage-5:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200411063314_438501bd-09fb-498a-8a33-286363c82b5e); Time taken: 0.004 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: mobile (type: string), sex (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: mobile (type: string), user_name (type: string) |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col0, _col1, _col2, _col7, _col8 |
|           Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col7 (type: string) |
|               sort order: +                        |
|               Map-reduce partition columns: _col7 (type: string) |
|               Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string) |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: mobile (type: string) |
|               sort order: +                        |
|               Map-reduce partition columns: mobile (type: string) |
|               Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: type (type: string) |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 _col7 (type: string)                 |
|             1 mobile (type: string)                |
|           outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
|           Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
|             outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|             Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
89 rows selected (0.161 seconds)

都用user_id 做关联

INFO  : Starting task [Stage-4:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200411063216_19f3048a-8c39-41c9-ad08-c49be9c06e6d); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: mobile (type: string), sex (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: user_name (type: string) |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: type (type: string) |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|                Left Outer Join0 to 2               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|             2 user_id (type: bigint)               |
|           outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
|           Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
|             outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|             Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.178 seconds)

可以看到先用 user_id JOIN, 再用 mobile JOIN 会多出一个 STAGE.

而两次都用 user_id JOIN, 由于是相同的key，3个表JOIN 的流程会在一个 STAGE.

下面，我们看下两种方式的执行时间：

先 user_id , 再 mobile

SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON b.mobile = c.mobile
;

INFO  : Starting task [Stage-1:MAPRED] in parallel
INFO  : Launching Job 2 out of 2
INFO  : Starting task [Stage-2:MAPRED] in parallel
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.34 sec   HDFS Read: 14226 HDFS Write: 213 SUCCESS
INFO  : Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 5.86 sec   HDFS Read: 14889 HDFS Write: 180 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 11 seconds 200 msec
INFO  : Completed executing command(queryId=hive_20200411063838_e348957a-1c29-49de-a45b-be69858d8a61); Time taken: 62.026 seconds
INFO  : OK
+------------+-----------+--------+--------------+---------+
| a.user_id  | a.mobile  | a.sex  | b.user_name  | c.type  |
+------------+-----------+--------+--------------+---------+
| 1          | 123456    | 0      | sunzhenhua   | a       |
| 2          | 234567    | 1      | zyq          | b       |
| 3          | 345678    | 1      | zz           | c       |
+------------+-----------+--------+--------------+---------+
3 rows selected (62.167 seconds)

都是通过user_id JOIN

SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON a.user_id = c.user_id
;

INFO  : Starting task [Stage-1:MAPRED] in parallel
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 6.53 sec   HDFS Read: 25156 HDFS Write: 180 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 6 seconds 530 msec
INFO  : Completed executing command(queryId=hive_20200411064014_ddfb093d-a486-411b-8fff-9af2b6fb27d4); Time taken: 32.012 seconds
INFO  : OK
+------------+-----------+--------+--------------+---------+
| a.user_id  | a.mobile  | a.sex  | b.user_name  | c.type  |
+------------+-----------+--------+--------------+---------+
| 1          | 123456    | 0      | sunzhenhua   | a       |
| 2          | 234567    | 1      | zyq          | b       |
| 3          | 345678    | 1      | zz           | c       |
+------------+-----------+--------+--------------+---------+
3 rows selected (32.123 seconds)

可以看到使用相同条件做JOIN, 执行效率要明显优于不同JOIN 条件的SQL

我们这里再做下总结：

由于相同条件的JOIN, STAGE 数量数量更少，所以减少了 mr job 的数量，所以效率更快。

而不同条件的JOIN ,STAGE 数量更多，增加了一个 mr job的计算时间，与申请资源时间所以效率更低。

我们可以通过如下 JOIN 的 MR 流程，更清晰直观的了解。