大纲地址 :https://mp.csdn.net/console/editor/html/105334641
测试表以及测试数据
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`( |
| `company_name` string COMMENT '????', |
| `dep_name` string COMMENT '????', |
| `user_id` bigint COMMENT '??id', |
| `user_name` string COMMENT '????', |
| `salary` decimal(10,2) COMMENT '??', |
| `create_time` date COMMENT '????', |
| `update_time` date COMMENT '????') |
| PARTITIONED BY ( |
| `pt` string COMMENT '????') |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'=',', |
| 'serialization.format'=',') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1586310488') |
+----------------------------------------------------+
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name | datacube_salary_org.dep_name | datacube_salary_org.user_id | datacube_salary_org.user_name | datacube_salary_org.salary | datacube_salary_org.create_time | datacube_salary_org.update_time | datacube_salary_org.pt |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200405 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200406 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | enginer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| s.zh | enginer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200407 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200407 |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
场景二.减少JOB的数量
1) 巧妙的使用 UNION ALL 减少 JOB 数量
2) 利用多表相同的JOIN 条件,去减少 JOB 的数量
1) 巧妙的使用 UNION ALL 减少 JOB 数量
假如如下的场景,我们需要统计每多张表的数据量。
首先我们可以编写多条SQL进行统计,这样的效率不高。(没意义)
或者 我们采用UNION ALL 的形式把多个结果合并起来,但是这样效率也比较低
如:
SELECT
'a' AS type
,COUNT(1) AS num
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
'b' AS type
,COUNT(1) AS num
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT
'c' AS type
,COUNT(1) AS num
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT
'd' AS type
,COUNT(1) AS num
FROM datacube_salary_total_aggr AS d
;
较为优势的写法是,我们将多个表的数据先读取进来,然后 打上标记然后再去做聚合统计
这里由于是多个表作为输入,所以会有多个Mapper
示例如下:
SELECT
type
,COUNT(1)
FROM
(
SELECT
'a' AS type
,total_salary
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
'b' AS type
,total_salary
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT
'c' AS type
,total_salary
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT
'd' AS type
,total_salary
FROM datacube_salary_total_aggr AS d
) AS tmp
GROUP BY
type
;
我们通过EXLAIN 看下具体这2种方式有什么不同
INFO : Starting task [Stage-6:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200411043838_47d043d4-8b22-433c-a0be-4714aed8ab94); Time taken: 0.008 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1, Stage-3, Stage-4, Stage-5 |
| Stage-3 is a root stage |
| Stage-4 is a root stage |
| Stage-5 is a root stage |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
| Group By Operator |
| aggregations: count(1) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col0 (type: bigint) |
| Execution mode: vectorized |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: 'a' (type: string), _col0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Union |
| Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| TableScan |
| Union |
| Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| TableScan |
| Union |
| Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| TableScan |
| Union |
| Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: b |
| Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
| Group By Operator |
| aggregations: count(1) |
| mode: hash |
| outputColumnNames: _col0 |
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col0 (type: bigint) |
| Execution mode: vectorized |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: 'b' (type: string), _col0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-4 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: c |
| Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
| Group By Operator |
| aggregations: count(1) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col0 (type: bigint) |
| Execution mode: vectorized |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: 'c' (type: string), _col0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-5 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: d |
| Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
| Group By Operator |
| aggregations: count(1) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col0 (type: bigint) |
| Execution mode: vectorized |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: 'd' (type: string), _col0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
199 rows selected (0.242 seconds)
可以看到作业分为了6个STAGE
其中 SELECT '' AS type, COUNT(1) FROM xxx; 是一个阶段,由于是4张表,所以划分了4个作业,即 STAGE1, STAGE3, STAGE4, STAGE5
STAGE 2 为 UNION 的作业,将上述结果聚合起来
我们再来看下 优化方法的 EXPLAIN 结果
INFO : Starting task [Stage-5:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200411044418_821a2232-e735-4cfb-9397-10da41ce7570); Time taken: 0.006 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: 'a' (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 7 Data size: 595 Basic stats: COMPLETE Column stats: COMPLETE |
| Union |
| Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: _col0 (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
| Group By Operator |
| aggregations: count(1) |
| keys: _col0 (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col1 (type: bigint) |
| TableScan |
| alias: b |
| Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: 'b' (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 2 Data size: 170 Basic stats: COMPLETE Column stats: COMPLETE |
| Union |
| Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: _col0 (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
| Group By Operator |
| aggregations: count(1) |
| keys: _col0 (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col1 (type: bigint) |
| TableScan |
| alias: c |
| Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: 'c' (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 4 Data size: 340 Basic stats: COMPLETE Column stats: COMPLETE |
| Union |
| Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: _col0 (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
| Group By Operator |
| aggregations: count(1) |
| keys: _col0 (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col1 (type: bigint) |
| TableScan |
| alias: d |
| Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: 'd' (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 85 Basic stats: COMPLETE Column stats: COMPLETE |
| Union |
| Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
| Select Operator |
| expressions: _col0 (type: string) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE |
| Group By Operator |
| aggregations: count(1) |
| keys: _col0 (type: string) |
| mode: hash |
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| value expressions: _col1 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| keys: KEY._col0 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
129 rows selected (0.219 seconds)
可以看到 ,运行只有两个阶段 STAGE0, STAGE1
这里为什么比上面的写法划分的阶段数量更少呢 ?因为我们先把所有数据UNION ALL了之后,再去做的统计,相当于多个表的数据利用 type 作为区分,一次扫描了进来,所以效率更高。但是相应的这个STAGE 阶段需要的Mapper 数量也更多,毕竟我们是一下扫描 的4张表的数据 。
我们再去对比下两者的执行时间 (小数据规模下)
效率较低的方法,第一个SQL
INFO : Starting Job = job_1586423165261_0033, Tracking URL = http://cdh-manager:8088/proxy/application_1586423165261_0033/
INFO : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job -kill job_1586423165261_0033
INFO : Hadoop job information for Stage-2: number of mappers: 4; number of reducers: 0
INFO : 2020-04-11 04:52:45,140 Stage-2 map = 0%, reduce = 0%
INFO : 2020-04-11 04:52:51,303 Stage-2 map = 25%, reduce = 0%, Cumulative CPU 1.56 sec
INFO : 2020-04-11 04:52:56,430 Stage-2 map = 50%, reduce = 0%, Cumulative CPU 3.17 sec
INFO : 2020-04-11 04:53:00,559 Stage-2 map = 75%, reduce = 0%, Cumulative CPU 4.9 sec
INFO : 2020-04-11 04:53:04,663 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 6.44 sec
INFO : MapReduce Total cumulative CPU time: 6 seconds 440 msec
INFO : Ended Job = job_1586423165261_0033
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.85 sec HDFS Read: 11511 HDFS Write: 116 SUCCESS
INFO : Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 4.7 sec HDFS Read: 10963 HDFS Write: 116 SUCCESS
INFO : Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 4.09 sec HDFS Read: 11238 HDFS Write: 116 SUCCESS
INFO : Stage-Stage-5: Map: 1 Reduce: 1 Cumulative CPU: 4.41 sec HDFS Read: 10512 HDFS Write: 116 SUCCESS
INFO : Stage-Stage-2: Map: 4 Cumulative CPU: 6.44 sec HDFS Read: 20912 HDFS Write: 412 SUCCESS
INFO : Total MapReduce CPU Time Spent: 24 seconds 490 msec
INFO : Completed executing command(queryId=hive_20200411045047_295d368a-500d-4225-9d8e-e83ed3e51321); Time taken: 138.08 seconds
INFO : OK
+-----------+----------+
| _u1.type | _u1.num |
+-----------+----------+
| a | 7 |
| b | 2 |
| c | 4 |
| d | 1 |
+-----------+----------+
4 rows selected (138.285 seconds)
效率更高的写法
INFO : Starting Job = job_1586423165261_0034, Tracking URL = http://cdh-manager:8088/proxy/application_1586423165261_0034/
INFO : Kill Command = /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop/bin/hadoop job -kill job_1586423165261_0034
INFO : Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
INFO : 2020-04-11 04:54:41,047 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-11 04:54:49,371 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 3.07 sec
INFO : 2020-04-11 04:54:54,558 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.37 sec
INFO : 2020-04-11 04:54:58,667 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 7.71 sec
INFO : 2020-04-11 04:55:02,777 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.55 sec
INFO : 2020-04-11 04:55:08,914 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.25 sec
INFO : MapReduce Total cumulative CPU time: 11 seconds 250 msec
INFO : Ended Job = job_1586423165261_0034
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 4 Reduce: 1 Cumulative CPU: 11.25 sec HDFS Read: 45492 HDFS Write: 151 SUCCESS
INFO : Total MapReduce CPU Time Spent: 11 seconds 250 msec
INFO : Completed executing command(queryId=hive_20200411045431_e528deb6-db83-4556-944d-802cdfa20d76); Time taken: 38.374 seconds
INFO : OK
+-------+------+
| type | _c1 |
+-------+------+
| a | 7 |
| b | 2 |
| c | 4 |
| d | 1 |
+-------+------+
4 rows selected (38.545 seconds)
可以看到方式二明显更快了,这个是因为划分的阶段少,申请资源的次数更少,所以,效率耗时更少了。
2) 利用多表相同的JOIN 条件,去减少 JOB 的数量
多张表相同的JOIN 条件。
首先,我假设有如下的场景,分别有 A,B,C 3张表,它们之间可以通过 user_id , 或者 mobile 进行关联(两者都是唯一的)。为此我们先构建下基本的数据, SQL 如下:
CREATE TABLE IF NOT EXISTS join_multi_a(
user_id BIGINT
,mobile STRING
,sex BIGINT
);
CREATE TABLE IF NOT EXISTS join_multi_b(
user_id BIGINT
,mobile STRING
,user_name STRING
);
CREATE TABLE IF NOT EXISTS join_multi_c(
user_id BIGINT
,mobile STRING
,type STRING
);
INSERT OVERWRITE TABLE join_multi_a VALUES
(1, '123456', 0)
,(2, '234567', 1)
,(3, '345678', 1)
;
INSERT OVERWRITE TABLE join_multi_b VALUES
(1, '123456', 'sunzhenhua')
,(2, '234567', 'zyq')
,(3, '345678', 'zz')
;
INSERT OVERWRITE TABLE join_multi_c VALUES
(1, '123456', 'a')
,(2, '234567', 'b')
,(3, '345678', 'c')
;
注意,由于我们数据量非常小,所以在Hive 0.11 以上会导致使用Map JOIN 优化
我们要关闭 MAP JOIN 优化:
set hive.auto.convert.join=false;
例如 如下语句,如果 使用MAP JOIN 优化, EXPLAIN 会变为如下结果
SELECT
a.user_id
,a.mobile
,a.sex
,b.user_name
,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
ON b.mobile = c.mobile
;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-7 is a root stage |
| Stage-5 depends on stages: Stage-7 |
| Stage-0 depends on stages: Stage-5 |
| |
| STAGE PLANS: |
| Stage: Stage-7 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| b |
| Fetch Operator |
| limit: -1 |
| c |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| b |
| TableScan |
| alias: b |
| Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| c |
| TableScan |
| alias: c |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 _col7 (type: string) |
| 1 mobile (type: string) |
| |
| Stage: Stage-5 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| outputColumnNames: _col0, _col1, _col2, _col7, _col8 |
| Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 _col7 (type: string) |
| 1 mobile (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
| Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
75 rows selected (0.181 seconds)
这里,我们获取到 获取用户的所有信息,我们有以下2种写法
低效的写法,由于两个user_id, mobile 都能关联
1.我们先用user_id 关联,再用 mobile 关联
EXPLAIN
SELECT
a.user_id
,a.mobile
,a.sex
,b.user_name
,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
ON b.mobile = c.mobile
;
2.都用user_id 做关联
SELECT
a.user_id
,a.mobile
,a.sex
,b.user_name
,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
ON a.user_id = c.user_id
;
我们分别看下以上两个语句,在不使用MAP JOIN 后的 EXPLAIN 结果
先用user_id 关联,再用 mobile 关联
INFO : Starting task [Stage-5:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200411063314_438501bd-09fb-498a-8a33-286363c82b5e); Time taken: 0.004 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| value expressions: mobile (type: string), sex (type: bigint) |
| TableScan |
| alias: b |
| Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
| value expressions: mobile (type: string), user_name (type: string) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| outputColumnNames: _col0, _col1, _col2, _col7, _col8 |
| Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col7 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col7 (type: string) |
| Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string) |
| TableScan |
| alias: c |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: mobile (type: string) |
| sort order: + |
| Map-reduce partition columns: mobile (type: string) |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| value expressions: type (type: string) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 _col7 (type: string) |
| 1 mobile (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
| Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
89 rows selected (0.161 seconds)
都用user_id 做关联
INFO : Starting task [Stage-4:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200411063216_19f3048a-8c39-41c9-ad08-c49be9c06e6d); Time taken: 0.005 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| value expressions: mobile (type: string), sex (type: bigint) |
| TableScan |
| alias: b |
| Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE |
| value expressions: user_name (type: string) |
| TableScan |
| alias: c |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| value expressions: type (type: string) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| Left Outer Join0 to 2 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| 2 user_id (type: bigint) |
| outputColumnNames: _col0, _col1, _col2, _col8, _col14 |
| Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
64 rows selected (0.178 seconds)
可以看到 先用 user_id JOIN, 再用 mobile JOIN 会多出一个 STAGE.
而两次都用 user_id JOIN, 由于是相同的key,3个表JOIN 的流程会在一个 STAGE.
下面,我们看下两种方式的执行时间:
先 user_id , 再 mobile
SELECT
a.user_id
,a.mobile
,a.sex
,b.user_name
,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
ON b.mobile = c.mobile
;
INFO : Starting task [Stage-1:MAPRED] in parallel
INFO : Launching Job 2 out of 2
INFO : Starting task [Stage-2:MAPRED] in parallel
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 5.34 sec HDFS Read: 14226 HDFS Write: 213 SUCCESS
INFO : Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 5.86 sec HDFS Read: 14889 HDFS Write: 180 SUCCESS
INFO : Total MapReduce CPU Time Spent: 11 seconds 200 msec
INFO : Completed executing command(queryId=hive_20200411063838_e348957a-1c29-49de-a45b-be69858d8a61); Time taken: 62.026 seconds
INFO : OK
+------------+-----------+--------+--------------+---------+
| a.user_id | a.mobile | a.sex | b.user_name | c.type |
+------------+-----------+--------+--------------+---------+
| 1 | 123456 | 0 | sunzhenhua | a |
| 2 | 234567 | 1 | zyq | b |
| 3 | 345678 | 1 | zz | c |
+------------+-----------+--------+--------------+---------+
3 rows selected (62.167 seconds)
都是通过user_id JOIN
SELECT
a.user_id
,a.mobile
,a.sex
,b.user_name
,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
ON a.user_id = c.user_id
;
INFO : Starting task [Stage-1:MAPRED] in parallel
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 6.53 sec HDFS Read: 25156 HDFS Write: 180 SUCCESS
INFO : Total MapReduce CPU Time Spent: 6 seconds 530 msec
INFO : Completed executing command(queryId=hive_20200411064014_ddfb093d-a486-411b-8fff-9af2b6fb27d4); Time taken: 32.012 seconds
INFO : OK
+------------+-----------+--------+--------------+---------+
| a.user_id | a.mobile | a.sex | b.user_name | c.type |
+------------+-----------+--------+--------------+---------+
| 1 | 123456 | 0 | sunzhenhua | a |
| 2 | 234567 | 1 | zyq | b |
| 3 | 345678 | 1 | zz | c |
+------------+-----------+--------+--------------+---------+
3 rows selected (32.123 seconds)
可以看到使用相同条件做JOIN, 执行效率要明显优于不同JOIN 条件的SQL
我们这里再做下总结 :
由于相同条件的JOIN, STAGE 数量数量更少,所以减少了 mr job 的数量,所以效率更快。
而不同条件的JOIN ,STAGE 数量更多,增加了一个 mr job的 计算时间,与申请资源时间所以效率更低。
我们可以通过如下 JOIN 的 MR 流程,更清晰直观的了解。
对于 键 a
最后在 reduce 流程相当于3层循环
for x in List {[a,3,4,A]} (List A 中只有一个元素)
for y in List{[a,2,B]} (ListB 只有一个元素)
for z in List{[a,3,C]} (ListC 只有一个元素)