Hive_LEFT SEMI JOIN / LEFT OUTER JOIN 与 (IN / NOT IN), (EXISTS / NOT EXISTS ) 分析

参考文章 : https://blog.csdn.net/happyrocking/article/details/79885071

本篇文章,我们主要就 Hive 中的  LEFT SEMI JOIN 和  (IN / NOT IN), (EXISTS / NOT EXISTS ) 子句查询做一个了解。

LEFT SEMI JOIN 基本认识

首先,我们先要了解下什么是 LEFT SEMI JOIN.

特点

1、left semi join 的限制是, JOIN 子句中右边的表只能在 ON 子句中设置过滤条件,在 WHERE 子句、SELECT 子句或其他地方过滤都不行。

2、left semi join 是只传递表的 join key 给 map 阶段,因此left semi join 中最后 select 的结果只许出现左表。

3、因为 left semi join 是 in(keySet) 的关系,遇到右表重复记录,左表会跳过,而 join 则会一直遍历。这就导致右表有重复值得情况下 left semi join 只产生一条,join 会产生多条,也会导致 left semi join 的性能更高。 

比如以下A表和B表进行 join 或 left semi join,然后 select 出所有字段,结果区别如下:
 

注意:蓝色叉的那一列实际是不存在left semi join中的,因为最后 select 的结果只许出现左表。

其实可以这么认为 LEFT SEMI JOIN 就是 子查询形式的 (IN / NOT IN), (EXISTS / NOT EXISTS ) 的替代方案。

因为 HIVE 0.13 版本之前,是不支持 (IN / NOT IN), (EXISTS / NOT EXISTS ) 中存在子查询语句的,此时我们需要使用 LEFT SEMI JOIN

文档如下:

构建基础的测试数据

DROP TABLE IF EXISTS data_semi_a;

DROP TABLE IF EXISTS data_semi_b;


CREATE TABLE IF NOT EXISTS data_semi_a 
(
 user_id BIGINT
 ,sex_id BIGINT 
);

CREATE TABLE IF NOT EXISTS data_semi_b
(
 user_id BIGINT
 ,sex_id BIGINT
 ,age BIGINT
);

INSERT INTO TABLE data_semi_a VALUES
(NULL ,0)
,(1, 1)
,(1, 0)
,(2, 1)
,(3, 0)
,(4, 1)
;

INSERT INTO TABLE data_semi_b VALUES
(NULL, 0, 3)
,(1, 0, 12)
,(2, 1, 14)
;

测试数据:

data_semi_a

+----------------------+---------------------+
| data_semi_a.user_id  | data_semi_a.sex_id  |
+----------------------+---------------------+
| NULL                 | 0                   |
| 1                    | 1                   |
| 1                    | 0                   |
| 2                    | 1                   |
| 3                    | 0                   |
| 4                    | 1                   |
+----------------------+---------------------+

data_semi_b

+----------------------+---------------------+------------------+
| data_semi_b.user_id  | data_semi_b.sex_id  | data_semi_b.age  |
+----------------------+---------------------+------------------+
| NULL                 | 0                   | 3                |
| 1                    | 0                   | 12               |
| 2                    | 1                   | 14               |
+----------------------+---------------------+------------------+

单条件的 LEFT SEMI JOIN  相当于 (IN )

 

注意

LEFT SEMI JOIN 等同于 IN ,其原理是 只传递 LEFT SEMI JOIN 中的 KEY 。

所以 A LEFT SEMI JOIN B , SELECT 语句中 不能出现B 中的字段。

SELECT 
 a.user_id
 ,a.sex_id
 ,b.age
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
;

Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 4:1 Invalid table alias or column reference 'b': (possible column names are: user_id, sex_id) (state=42000,code=10004)

单条件的 LEFT SEMI JOIN  相当于 (IN )   , 例如如下SQL

SQL 语句

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
;

等价的 IN SQL

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id IN (
 SELECT b.user_id 
 FROM data_semi_b AS b
);

我们比较下2个SQL 的运行结果

LEFT SEMI JOIN 的执行结果

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 10:53:09,591 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 10:53:17,849 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.12 sec
INFO  : 2020-04-12 10:53:22,975 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.84 sec
INFO  : 2020-04-12 10:53:29,141 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.77 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 770 msec
INFO  : Ended Job = job_1586423165261_0087
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.77 sec   HDFS Read: 16677 HDFS Write: 135 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 770 msec
INFO  : Completed executing command(queryId=hive_20200412105301_9f643e42-c966-4140-8c72-330be6bdd73c); Time taken: 28.939 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 1          | 1         |
| 2          | 1         |
+------------+-----------+
3 rows selected (29.073 seconds)

IN 的执行结果

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 10:37:26,143 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 10:37:33,376 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.71 sec
INFO  : 2020-04-12 10:37:39,510 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.6 sec
INFO  : 2020-04-12 10:37:44,680 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.41 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 410 msec
INFO  : Ended Job = job_1586423165261_0085
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.41 sec   HDFS Read: 16726 HDFS Write: 135 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 410 msec
INFO  : Completed executing command(queryId=hive_20200412103717_2ab604da-f301-4fee-b9bd-9c22ad6e65a1); Time taken: 27.796 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 1          | 1         |
| 2          | 1         |
+------------+-----------+
3 rows selected (27.902 seconds)

我们再看下两个语句的 EXPLAIN 结果:

LEFT SEMI JOIN 的 EXPLAIN 结果:

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200412105949_53e51917-8c04-4f6f-b9fd-32ab71a2888b); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint)       |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col0 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
65 rows selected (0.136 seconds)

IN 的 EXPLAIN 结果:

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200412110229_81d9cf79-50e2-46f1-8152-a399038861c7); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint)       |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col0 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
65 rows selected (0.127 seconds)

可以看到两者在执行结果 和 EXPLAIN 结果上是完全一致的。

其实 IN 内部也是使用 的 LEFT SEMI JOIN

LEFT OUTER JOIN 实现 NOT IN 

注意 LEFT SEMI JOIN 不能实现 NOT IN 

本质 : Hive 中不支持不等值连接!!!

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON (a.user_id != b.user_id)
;

Error: Error while compiling statement: FAILED: SemanticException [Error 10017]: Line 6:4 Both left and right aliases encountered in JOIN 'user_id' (state=42000,code=10017)

正确写法

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id NOT IN (
 SELECT b.user_id 
 FROM data_semi_b AS b
);
INFO  : Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:02:26,751 Stage-2 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:02:33,938 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 1.76 sec
INFO  : 2020-04-12 23:02:39,172 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.35 sec
INFO  : 2020-04-12 23:02:47,688 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 7.88 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 880 msec
INFO  : Ended Job = job_1586423165261_0106
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 6.49 sec   HDFS Read: 8372 HDFS Write: 96 SUCCESS
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.65 sec   HDFS Read: 11974 HDFS Write: 96 SUCCESS
INFO  : Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 7.88 sec   HDFS Read: 14131 HDFS Write: 87 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 20 seconds 20 msec
INFO  : Completed executing command(queryId=hive_20200412230117_fef818dc-e433-4880-9c8d-f6a9d28a08a9); Time taken: 91.471 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
+------------+-----------+
No rows selected (91.674 seconds)

等价的 SQL ,  注意 NOT IN 不能使用 LEFT SEMI JOIN 实现,我们需要使用 LEFT OUTER JOIN 进行实现:

等价的LEFT OUTER JOIN   的 SQL

SELECT 
  a.user_id
  ,a.sex_id 
FROM data_semi_a AS a
LEFT OUTER JOIN data_semi_b AS b
 ON a.user_id = b.user_id
 AND b.user_id IS NULL
WHERE a.user_id IS NOT NULL
 AND b.user_id IS NOT NULL
;
INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:04:47,896 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:04:55,176 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.91 sec
INFO  : 2020-04-12 23:05:00,288 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.53 sec
INFO  : 2020-04-12 23:05:06,449 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.45 sec
INFO  : MapReduce Total cumulative CPU time: 8 seconds 450 msec
INFO  : Ended Job = job_1586423165261_0107
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 8.45 sec   HDFS Read: 16358 HDFS Write: 87 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 8 seconds 450 msec
INFO  : Completed executing command(queryId=hive_20200412230438_62ce326e-1b03-4c5a-a842-6816dc6feda3); Time taken: 28.871 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
+------------+-----------+
No rows selected (28.979 seconds)

我们看下这两个SQL 的执行过程

NOT IN 的 EXPLAIN 结果:

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-4 is a root stage                          |
|   Stage-1 depends on stages: Stage-4               |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-4                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is null (type: boolean) |
|               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   aggregations: count()            |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     sort order:                    |
|                     Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                     value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: (_col0 = 0) (type: boolean) |
|             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: 0 (type: bigint)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: user_id (type: bigint), sex_id (type: bigint) |
|           TableScan                                |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0                                      |
|             1                                      |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: _col0 (type: bigint) |
|               Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint)  |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: _col0 (type: bigint) |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                 sort order: +                      |
|                 Map-reduce partition columns: _col0 (type: bigint) |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 _col0 (type: bigint)                 |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1, _col5   |
|           Statistics: Num rows: 6 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: _col5 is null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

LEFT OUTER JOIN 的 EXPLAIN 结果:

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is null and user_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is null and user_id is not null) (type: boolean) |
|               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col0, _col1, _col5   |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: _col5 is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
63 rows selected (0.143 seconds)

LEFT SEMI JOIN 实现多条件 IN , 即 EXISTS 

 

注意: IN 只能用户单列,如果是多列的话,我们需要使用 EXISTS

如下的IN 的SQL 是错误的

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE (a.user_id, a.sex_id) IN (
 SELECT  
 a.user_id
 ,a.sex_id
 FROM data_semi_b AS b
)
;

Error: Error while compiling statement: FAILED: ParseException line 6:0 mismatched input 'SELECT' expecting ( near '(' in expression specification (state=42000,code=40000)

我们需要用如下的形式,

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
 AND a.sex_id = b.sex_id
;

或者

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE EXISTS (
 SELECT 1
 FROM data_semi_b AS b
 WHERE 
  a.user_id = b.user_id
  AND a.sex_id = b.sex_id
)
;

运行结果

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:46:16,157 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:46:24,375 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.04 sec
INFO  : 2020-04-12 23:46:28,545 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.84 sec
INFO  : 2020-04-12 23:46:35,732 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.85 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 850 msec
INFO  : Ended Job = job_1586423165261_0110
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.85 sec   HDFS Read: 17951 HDFS Write: 119 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 850 msec
INFO  : Completed executing command(queryId=hive_20200412234607_8b6acba0-54bb-420f-80df-a5efd5dc9ae5); Time taken: 29.286 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 2          | 1         |
+------------+-----------+
2 rows selected (29.379 seconds)

我们看下两种方式 的 EXPLAIN 结果 :

LEFT SEMI JOIN

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 sort order: ++                     |
|                 Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint), _col1 (type: bigint) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1  |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                     sort order: ++                 |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint), sex_id (type: bigint) |
|             1 _col0 (type: bigint), _col1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.121 seconds)

EXISTS 

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 sort order: ++                     |
|                 Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint), _col1 (type: bigint) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1  |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                     sort order: ++                 |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint), sex_id (type: bigint) |
|             1 _col0 (type: bigint), _col1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.147 seconds)

可以看到两种方式的执行计划是一致的!!!

发布了519 篇原创文章 · 获赞 1146 · 访问量 283万+

猜你喜欢

转载自blog.csdn.net/u010003835/article/details/105476658