Hive_HIVE优化指南_场景二_减少JOB的数量

大纲地址 :https://mp.csdn.net/console/editor/html/105334641

测试表以及测试数据

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`(                |
|   `company_name` string COMMENT '????',            |
|   `dep_name` string COMMENT '????',                |
|   `user_id` bigint COMMENT '??id',                 |
|   `user_name` string COMMENT '????',               |
|   `salary` decimal(10,2) COMMENT '??',             |
|   `create_time` date COMMENT '????',               |
|   `update_time` date COMMENT '????')               |
| PARTITIONED BY (                                   |
|   `pt` string COMMENT '????')                      |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'field.delim'=',',                               |
|   'serialization.format'=',')                      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1586310488')            |
+----------------------------------------------------+
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name  | datacube_salary_org.dep_name  | datacube_salary_org.user_id  | datacube_salary_org.user_name  | datacube_salary_org.salary  | datacube_salary_org.create_time  | datacube_salary_org.update_time  | datacube_salary_org.pt  |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200405                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200406                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | enginer                       | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| s.zh                              | enginer                       | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200407                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200407                |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+

场景二.减少JOB的数量

1) 巧妙的使用 UNION ALL 减少 JOB 数量

2) 利用多表相同的JOIN 条件,去减少 JOB 的数量

1) 巧妙的使用 UNION ALL 减少 JOB 数量

假如如下的场景,我们需要统计每多张表的数据量。

首先我们可以编写多条SQL进行统计,这样的效率不高。(没意义)

或者 我们采用UNION ALL 的形式把多个结果合并起来,但是这样效率也比较低

如:

扫描二维码关注公众号,回复: 10761358 查看本文章
SELECT 
 'a' AS type
 ,COUNT(1) AS num
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
 'b' AS type
 ,COUNT(1) AS num
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT 
 'c' AS type
 ,COUNT(1) AS num
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT 
 'd' AS type
 ,COUNT(1) AS num
FROM datacube_salary_total_aggr AS d
;

较为优势的写法是,我们将多个表的数据先读取进来,然后 打上标记然后再去做聚合统计

这里由于是多个表作为输入,所以会有多个Mapper

示例如下:

SELECT 
 type
 ,COUNT(1)
FROM
(
SELECT 
 'a' AS type
 ,total_salary 
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
 'b' AS type
 ,total_salary
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT 
 'c' AS type
 ,total_salary
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT 
 'd' AS type
 ,total_salary
FROM datacube_salary_total_aggr AS d
) AS tmp
GROUP BY
 type
;

我们通过EXLAIN 看下具体这2种方式有什么不同

INFO  : Starting task [Stage-6:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200411043838_47d043d4-8b22-433c-a0be-4714aed8ab94); Time taken: 0.008 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1, Stage-3, Stage-4, Stage-5 |
|   Stage-3 is a root stage                          |
|   Stage-4 is a root stage                          |
|   Stage-5 is a root stage                          |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
|               Group By Operator                    |
|                 aggregations: count(1)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                   value expressions: _col0 (type: bigint) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|           Select Operator                          |
|             expressions: 'a' (type: string), _col0 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-3                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
|               Group By Operator                    |
|                 aggregations: count(1)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                   value expressions: _col0 (type: bigint) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|           Select Operator                          |
|             expressions: 'b' (type: string), _col0 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-4                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
|               Group By Operator                    |
|                 aggregations: count(1)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                   value expressions: _col0 (type: bigint) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|           Select Operator                          |
|             expressions: 'c' (type: string), _col0 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-5                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: d                               |
|             Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
|               Group By Operator                    |
|                 aggregations: count(1)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Reduce Output Operator             |
|                   sort order:                      |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|                   value expressions: _col0 (type: bigint) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
|           Select Operator                          |
|             expressions: 'd' (type: string), _col0 (type: bigint) |
|             outputColumnNames: _col0, _col1        |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
199 rows selected (0.242 seconds)

可以看到作业分为了6个STAGE

其中  SELECT '' AS type, COUNT(1) FROM xxx;  是一个阶段,由于是4张表,所以划分了4个作业,即 STAGE1, STAGE3, STAGE4, STAGE5

STAGE 2 为 UNION 的作业,将上述结果聚合起来

我们再来看下 优化方法的 EXPLAIN 结果

INFO  : Starting task [Stage-5:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200411044418_821a2232-e735-4cfb-9397-10da41ce7570); Time taken: 0.006 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               expressions: 'a' (type: string)      |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 7 Data size: 595 Basic stats: COMPLETE Column stats: COMPLETE |
|               Union                                |
|                 Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Select Operator                    |
|                   expressions: _col0 (type: string) |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                   Group By Operator                |
|                     aggregations: count(1)         |
|                     keys: _col0 (type: string)     |
|                     mode: hash                     |
|                     outputColumnNames: _col0, _col1 |
|                     Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: string) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: string) |
|                       Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               expressions: 'b' (type: string)      |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 2 Data size: 170 Basic stats: COMPLETE Column stats: COMPLETE |
|               Union                                |
|                 Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Select Operator                    |
|                   expressions: _col0 (type: string) |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                   Group By Operator                |
|                     aggregations: count(1)         |
|                     keys: _col0 (type: string)     |
|                     mode: hash                     |
|                     outputColumnNames: _col0, _col1 |
|                     Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: string) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: string) |
|                       Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               expressions: 'c' (type: string)      |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 4 Data size: 340 Basic stats: COMPLETE Column stats: COMPLETE |
|               Union                                |
|                 Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Select Operator                    |
|                   expressions: _col0 (type: string) |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                   Group By Operator                |
|                     aggregations: count(1)         |
|                     keys: _col0 (type: string)     |
|                     mode: hash                     |
|                     outputColumnNames: _col0, _col1 |
|                     Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: string) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: string) |
|                       Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: d                               |
|             Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
|             Select Operator                        |
|               expressions: 'd' (type: string)      |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 1 Data size: 85 Basic stats: COMPLETE Column stats: COMPLETE |
|               Union                                |
|                 Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                 Select Operator                    |
|                   expressions: _col0 (type: string) |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
|                   Group By Operator                |
|                     aggregations: count(1)         |
|                     keys: _col0 (type: string)     |
|                     mode: hash                     |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                     outputColumnNames: _col0, _col1 |
|                     Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                     Reduce Output Operator         |
|                       key expressions: _col0 (type: string) |
|                       sort order: +                |
|                       Map-reduce partition columns: _col0 (type: string) |
|                       Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|                       value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
129 rows selected (0.219 seconds)

可以看到 ,运行只有两个阶段 STAGE0, STAGE1 

这里为什么比上面的写法划分的阶段数量更少呢 ?因为我们先把所有数据UNION ALL了之后,再去做的统计,相当于多个表的数据利用 type 作为区分,一次扫描了进来,所以效率更高。但是相应的这个STAGE 阶段需要的Mapper 数量也更多,毕竟我们是一下扫描 的4张表的数据 。

我们再去对比下两者的执行时间 (小数据规模下)

效率较低的方法,第一个SQL

INFO  : Starting Job = job_1586423165261_0033, Tracking URL = http://cdh-manager:8088/proxy/application_1586423165261_0033/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job  -kill job_1586423165261_0033
INFO  : Hadoop job information for Stage-2: number of mappers: 4; number of reducers: 0
INFO  : 2020-04-11 04:52:45,140 Stage-2 map = 0%,  reduce = 0%
INFO  : 2020-04-11 04:52:51,303 Stage-2 map = 25%,  reduce = 0%, Cumulative CPU 1.56 sec
INFO  : 2020-04-11 04:52:56,430 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 3.17 sec
INFO  : 2020-04-11 04:53:00,559 Stage-2 map = 75%,  reduce = 0%, Cumulative CPU 4.9 sec
INFO  : 2020-04-11 04:53:04,663 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 6.44 sec
INFO  : MapReduce Total cumulative CPU time: 6 seconds 440 msec
INFO  : Ended Job = job_1586423165261_0033
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.85 sec   HDFS Read: 11511 HDFS Write: 116 SUCCESS
INFO  : Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 4.7 sec   HDFS Read: 10963 HDFS Write: 116 SUCCESS
INFO  : Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 4.09 sec   HDFS Read: 11238 HDFS Write: 116 SUCCESS
INFO  : Stage-Stage-5: Map: 1  Reduce: 1   Cumulative CPU: 4.41 sec   HDFS Read: 10512 HDFS Write: 116 SUCCESS
INFO  : Stage-Stage-2: Map: 4   Cumulative CPU: 6.44 sec   HDFS Read: 20912 HDFS Write: 412 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 24 seconds 490 msec
INFO  : Completed executing command(queryId=hive_20200411045047_295d368a-500d-4225-9d8e-e83ed3e51321); Time taken: 138.08 seconds
INFO  : OK
+-----------+----------+
| _u1.type  | _u1.num  |
+-----------+----------+
| a         | 7        |
| b         | 2        |
| c         | 4        |
| d         | 1        |
+-----------+----------+
4 rows selected (138.285 seconds)

效率更高的写法

INFO  : Starting Job = job_1586423165261_0034, Tracking URL = http://cdh-manager:8088/proxy/application_1586423165261_0034/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job  -kill job_1586423165261_0034
INFO  : Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
INFO  : 2020-04-11 04:54:41,047 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-11 04:54:49,371 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 3.07 sec
INFO  : 2020-04-11 04:54:54,558 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 5.37 sec
INFO  : 2020-04-11 04:54:58,667 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 7.71 sec
INFO  : 2020-04-11 04:55:02,777 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.55 sec
INFO  : 2020-04-11 04:55:08,914 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.25 sec
INFO  : MapReduce Total cumulative CPU time: 11 seconds 250 msec
INFO  : Ended Job = job_1586423165261_0034
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 4  Reduce: 1   Cumulative CPU: 11.25 sec   HDFS Read: 45492 HDFS Write: 151 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 11 seconds 250 msec
INFO  : Completed executing command(queryId=hive_20200411045431_e528deb6-db83-4556-944d-802cdfa20d76); Time taken: 38.374 seconds
INFO  : OK
+-------+------+
| type  | _c1  |
+-------+------+
| a     | 7    |
| b     | 2    |
| c     | 4    |
| d     | 1    |
+-------+------+
4 rows selected (38.545 seconds)

可以看到方式二明显更快了,这个是因为划分的阶段少,申请资源的次数更少,所以,效率耗时更少了。

2) 利用多表相同的JOIN 条件,去减少 JOB 的数量

多张表相同的JOIN 条件。

首先,我假设有如下的场景,分别有 A,B,C 3张表,它们之间可以通过 user_id , 或者 mobile 进行关联(两者都是唯一的)。为此我们先构建下基本的数据, SQL 如下:

CREATE TABLE IF NOT EXISTS join_multi_a(
 user_id BIGINT
 ,mobile STRING
 ,sex BIGINT 
);


CREATE TABLE IF NOT EXISTS join_multi_b(
 user_id BIGINT
 ,mobile STRING
 ,user_name STRING 
);


CREATE TABLE IF NOT EXISTS join_multi_c(
 user_id BIGINT
 ,mobile STRING
 ,type STRING
);


INSERT OVERWRITE TABLE join_multi_a VALUES 
 (1, '123456', 0)
 ,(2, '234567', 1)
 ,(3, '345678', 1)
;


INSERT OVERWRITE TABLE join_multi_b VALUES 
 (1, '123456', 'sunzhenhua')
 ,(2, '234567', 'zyq')
 ,(3, '345678', 'zz')
;


INSERT OVERWRITE TABLE join_multi_c VALUES 
 (1, '123456', 'a')
 ,(2, '234567', 'b')
 ,(3, '345678', 'c')
;

注意,由于我们数据量非常小,所以在Hive 0.11 以上会导致使用Map JOIN 优化

我们要关闭 MAP JOIN 优化:

set hive.auto.convert.join=false;

例如 如下语句,如果 使用MAP JOIN 优化, EXPLAIN 会变为如下结果

SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON b.mobile = c.mobile
; 

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-7 is a root stage                          |
|   Stage-5 depends on stages: Stage-7               |
|   Stage-0 depends on stages: Stage-5               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-7                                   |
|     Map Reduce Local Work                          |
|       Alias -> Map Local Tables:                   |
|         b                                          |
|           Fetch Operator                           |
|             limit: -1                              |
|         c                                          |
|           Fetch Operator                           |
|             limit: -1                              |
|       Alias -> Map Local Operator Tree:            |
|         b                                          |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|             HashTable Sink Operator                |
|               keys:                                |
|                 0 user_id (type: bigint)           |
|                 1 user_id (type: bigint)           |
|         c                                          |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             HashTable Sink Operator                |
|               keys:                                |
|                 0 _col7 (type: string)             |
|                 1 mobile (type: string)            |
|                                                    |
|   Stage: Stage-5                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Map Join Operator                      |
|               condition map:                       |
|                    Left Outer Join0 to 1           |
|               keys:                                |
|                 0 user_id (type: bigint)           |
|                 1 user_id (type: bigint)           |
|               outputColumnNames: _col0, _col1, _col2, _col7, _col8 |
|               Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
|               Map Join Operator                    |
|                 condition map:                     |
|                      Left Outer Join0 to 1         |
|                 keys:                              |
|                   0 _col7 (type: string)           |
|                   1 mobile (type: string)          |
|                 outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
|                 Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|                 Select Operator                    |
|                   expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
|                   outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|                   Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|                   File Output Operator             |
|                     compressed: false              |
|                     Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|                     table:                         |
|                         input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                         output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|       Local Work:                                  |
|         Map Reduce Local Work                      |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
75 rows selected (0.181 seconds)

这里,我们获取到 获取用户的所有信息,我们有以下2种写法

低效的写法,由于两个user_id, mobile 都能关联

1.我们先用user_id 关联,再用 mobile 关联

EXPLAIN
SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON b.mobile = c.mobile
; 

2.都用user_id 做关联

SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON a.user_id = c.user_id
; 

我们分别看下以上两个语句,在不使用MAP JOIN 后的 EXPLAIN 结果

先用user_id 关联,再用 mobile 关联

INFO  : Starting task [Stage-5:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200411063314_438501bd-09fb-498a-8a33-286363c82b5e); Time taken: 0.004 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: mobile (type: string), sex (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: mobile (type: string), user_name (type: string) |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col0, _col1, _col2, _col7, _col8 |
|           Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col7 (type: string) |
|               sort order: +                        |
|               Map-reduce partition columns: _col7 (type: string) |
|               Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string) |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: mobile (type: string) |
|               sort order: +                        |
|               Map-reduce partition columns: mobile (type: string) |
|               Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: type (type: string) |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 _col7 (type: string)                 |
|             1 mobile (type: string)                |
|           outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
|           Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
|             outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|             Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
89 rows selected (0.161 seconds)

都用user_id 做关联

INFO  : Starting task [Stage-4:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200411063216_19f3048a-8c39-41c9-ad08-c49be9c06e6d); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: mobile (type: string), sex (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: user_name (type: string) |
|           TableScan                                |
|             alias: c                               |
|             Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: user_id (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: user_id (type: bigint) |
|               Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: type (type: string) |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|                Left Outer Join0 to 2               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|             2 user_id (type: bigint)               |
|           outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
|           Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
|             outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
|             Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.178 seconds)

可以看到 先用 user_id JOIN, 再用 mobile JOIN 会多出一个 STAGE.

而两次都用 user_id JOIN,  由于是相同的key,3个表JOIN 的流程会在一个 STAGE.

下面,我们看下两种方式的执行时间:

先 user_id , 再 mobile

SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON b.mobile = c.mobile
; 
INFO  : Starting task [Stage-1:MAPRED] in parallel
INFO  : Launching Job 2 out of 2
INFO  : Starting task [Stage-2:MAPRED] in parallel
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.34 sec   HDFS Read: 14226 HDFS Write: 213 SUCCESS
INFO  : Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 5.86 sec   HDFS Read: 14889 HDFS Write: 180 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 11 seconds 200 msec
INFO  : Completed executing command(queryId=hive_20200411063838_e348957a-1c29-49de-a45b-be69858d8a61); Time taken: 62.026 seconds
INFO  : OK
+------------+-----------+--------+--------------+---------+
| a.user_id  | a.mobile  | a.sex  | b.user_name  | c.type  |
+------------+-----------+--------+--------------+---------+
| 1          | 123456    | 0      | sunzhenhua   | a       |
| 2          | 234567    | 1      | zyq          | b       |
| 3          | 345678    | 1      | zz           | c       |
+------------+-----------+--------+--------------+---------+
3 rows selected (62.167 seconds)

都是通过user_id JOIN

SELECT 
 a.user_id 
 ,a.mobile
 ,a.sex
 ,b.user_name
 ,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
 ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
 ON a.user_id = c.user_id
; 
INFO  : Starting task [Stage-1:MAPRED] in parallel
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 6.53 sec   HDFS Read: 25156 HDFS Write: 180 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 6 seconds 530 msec
INFO  : Completed executing command(queryId=hive_20200411064014_ddfb093d-a486-411b-8fff-9af2b6fb27d4); Time taken: 32.012 seconds
INFO  : OK
+------------+-----------+--------+--------------+---------+
| a.user_id  | a.mobile  | a.sex  | b.user_name  | c.type  |
+------------+-----------+--------+--------------+---------+
| 1          | 123456    | 0      | sunzhenhua   | a       |
| 2          | 234567    | 1      | zyq          | b       |
| 3          | 345678    | 1      | zz           | c       |
+------------+-----------+--------+--------------+---------+
3 rows selected (32.123 seconds)

可以看到使用相同条件做JOIN,  执行效率要明显优于不同JOIN 条件的SQL 

我们这里再做下总结 :

由于相同条件的JOIN, STAGE 数量数量更少,所以减少了 mr job 的数量,所以效率更快。

而不同条件的JOIN ,STAGE 数量更多,增加了一个 mr job的 计算时间,与申请资源时间所以效率更低。

我们可以通过如下 JOIN 的 MR 流程,更清晰直观的了解。

对于 键 a

最后在 reduce 流程相当于3层循环

for   x  in  List {[a,3,4,A]} (List A 中只有一个元素)

  for  y in List{[a,2,B]} (ListB 只有一个元素)

   for  z in List{[a,3,C]} (ListC 只有一个元素)

发布了519 篇原创文章 · 获赞 1146 · 访问量 283万+

猜你喜欢

转载自blog.csdn.net/u010003835/article/details/105493938
今日推荐