Hive_LEFT SEMI JOIN / LEFT OUTER JOIN 与 (IN / NOT IN), (EXISTS / NOT EXISTS ) 分析

参考文章： https://blog.csdn.net/happyrocking/article/details/79885071

本篇文章，我们主要就 Hive 中的 LEFT SEMI JOIN 和 (IN / NOT IN), (EXISTS / NOT EXISTS ) 子句查询做一个了解。

LEFT SEMI JOIN 基本认识

首先，我们先要了解下什么是 LEFT SEMI JOIN.

特点

1、left semi join 的限制是， JOIN 子句中右边的表只能在 ON 子句中设置过滤条件，在 WHERE 子句、SELECT 子句或其他地方过滤都不行。

2、left semi join 是只传递表的 join key 给 map 阶段，因此left semi join 中最后 select 的结果只许出现左表。

3、因为 left semi join 是 in(keySet) 的关系，遇到右表重复记录，左表会跳过，而 join 则会一直遍历。这就导致右表有重复值得情况下 left semi join 只产生一条，join 会产生多条，也会导致 left semi join 的性能更高。

比如以下A表和B表进行 join 或 left semi join，然后 select 出所有字段，结果区别如下：

注意：蓝色叉的那一列实际是不存在left semi join中的，因为最后 select 的结果只许出现左表。

其实可以这么认为 LEFT SEMI JOIN 就是子查询形式的 (IN / NOT IN), (EXISTS / NOT EXISTS ) 的替代方案。

因为 HIVE 0.13 版本之前，是不支持 (IN / NOT IN), (EXISTS / NOT EXISTS ) 中存在子查询语句的，此时我们需要使用 LEFT SEMI JOIN

文档如下：

构建基础的测试数据

DROP TABLE IF EXISTS data_semi_a;

DROP TABLE IF EXISTS data_semi_b;


CREATE TABLE IF NOT EXISTS data_semi_a 
(
 user_id BIGINT
 ,sex_id BIGINT 
);

CREATE TABLE IF NOT EXISTS data_semi_b
(
 user_id BIGINT
 ,sex_id BIGINT
 ,age BIGINT
);

INSERT INTO TABLE data_semi_a VALUES
(NULL ,0)
,(1, 1)
,(1, 0)
,(2, 1)
,(3, 0)
,(4, 1)
;

INSERT INTO TABLE data_semi_b VALUES
(NULL, 0, 3)
,(1, 0, 12)
,(2, 1, 14)
;

测试数据：

data_semi_a

+----------------------+---------------------+
| data_semi_a.user_id  | data_semi_a.sex_id  |
+----------------------+---------------------+
| NULL                 | 0                   |
| 1                    | 1                   |
| 1                    | 0                   |
| 2                    | 1                   |
| 3                    | 0                   |
| 4                    | 1                   |
+----------------------+---------------------+

data_semi_b

+----------------------+---------------------+------------------+
| data_semi_b.user_id  | data_semi_b.sex_id  | data_semi_b.age  |
+----------------------+---------------------+------------------+
| NULL                 | 0                   | 3                |
| 1                    | 0                   | 12               |
| 2                    | 1                   | 14               |
+----------------------+---------------------+------------------+

单条件的 LEFT SEMI JOIN 相当于 (IN )

注意

LEFT SEMI JOIN 等同于 IN ，其原理是只传递 LEFT SEMI JOIN 中的 KEY 。

所以 A LEFT SEMI JOIN B ， SELECT 语句中不能出现B 中的字段。

SELECT 
 a.user_id
 ,a.sex_id
 ,b.age
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
;

Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 4:1 Invalid table alias or column reference 'b': (possible column names are: user_id, sex_id) (state=42000,code=10004)

单条件的 LEFT SEMI JOIN 相当于 (IN ) , 例如如下SQL

SQL 语句

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
;

等价的 IN SQL

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id IN (
 SELECT b.user_id 
 FROM data_semi_b AS b
);

我们比较下2个SQL 的运行结果

LEFT SEMI JOIN 的执行结果

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 10:53:09,591 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 10:53:17,849 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.12 sec
INFO  : 2020-04-12 10:53:22,975 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.84 sec
INFO  : 2020-04-12 10:53:29,141 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.77 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 770 msec
INFO  : Ended Job = job_1586423165261_0087
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.77 sec   HDFS Read: 16677 HDFS Write: 135 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 770 msec
INFO  : Completed executing command(queryId=hive_20200412105301_9f643e42-c966-4140-8c72-330be6bdd73c); Time taken: 28.939 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 1          | 1         |
| 2          | 1         |
+------------+-----------+
3 rows selected (29.073 seconds)

IN 的执行结果

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 10:37:26,143 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 10:37:33,376 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.71 sec
INFO  : 2020-04-12 10:37:39,510 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.6 sec
INFO  : 2020-04-12 10:37:44,680 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.41 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 410 msec
INFO  : Ended Job = job_1586423165261_0085
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.41 sec   HDFS Read: 16726 HDFS Write: 135 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 410 msec
INFO  : Completed executing command(queryId=hive_20200412103717_2ab604da-f301-4fee-b9bd-9c22ad6e65a1); Time taken: 27.796 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 1          | 1         |
| 2          | 1         |
+------------+-----------+
3 rows selected (27.902 seconds)

我们再看下两个语句的 EXPLAIN 结果：

LEFT SEMI JOIN 的 EXPLAIN 结果：

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200412105949_53e51917-8c04-4f6f-b9fd-32ab71a2888b); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint)       |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col0 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
65 rows selected (0.136 seconds)

IN 的 EXPLAIN 结果：

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200412110229_81d9cf79-50e2-46f1-8152-a399038861c7); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint)       |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col0 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
65 rows selected (0.127 seconds)

可以看到两者在执行结果和 EXPLAIN 结果上是完全一致的。

其实 IN 内部也是使用的 LEFT SEMI JOIN

LEFT OUTER JOIN 实现 NOT IN

注意 LEFT SEMI JOIN 不能实现 NOT IN

本质： Hive 中不支持不等值连接！！！

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON (a.user_id != b.user_id)
;

Error: Error while compiling statement: FAILED: SemanticException [Error 10017]: Line 6:4 Both left and right aliases encountered in JOIN 'user_id' (state=42000,code=10017)

正确写法

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id NOT IN (
 SELECT b.user_id 
 FROM data_semi_b AS b
);

INFO  : Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:02:26,751 Stage-2 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:02:33,938 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 1.76 sec
INFO  : 2020-04-12 23:02:39,172 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.35 sec
INFO  : 2020-04-12 23:02:47,688 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 7.88 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 880 msec
INFO  : Ended Job = job_1586423165261_0106
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 6.49 sec   HDFS Read: 8372 HDFS Write: 96 SUCCESS
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.65 sec   HDFS Read: 11974 HDFS Write: 96 SUCCESS
INFO  : Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 7.88 sec   HDFS Read: 14131 HDFS Write: 87 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 20 seconds 20 msec
INFO  : Completed executing command(queryId=hive_20200412230117_fef818dc-e433-4880-9c8d-f6a9d28a08a9); Time taken: 91.471 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
+------------+-----------+
No rows selected (91.674 seconds)

等价的 SQL , 注意 NOT IN 不能使用 LEFT SEMI JOIN 实现，我们需要使用 LEFT OUTER JOIN 进行实现：

等价的LEFT OUTER JOIN 的 SQL

SELECT 
  a.user_id
  ,a.sex_id 
FROM data_semi_a AS a
LEFT OUTER JOIN data_semi_b AS b
 ON a.user_id = b.user_id
 AND b.user_id IS NULL
WHERE a.user_id IS NOT NULL
 AND b.user_id IS NOT NULL
;

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:04:47,896 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:04:55,176 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.91 sec
INFO  : 2020-04-12 23:05:00,288 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.53 sec
INFO  : 2020-04-12 23:05:06,449 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.45 sec
INFO  : MapReduce Total cumulative CPU time: 8 seconds 450 msec
INFO  : Ended Job = job_1586423165261_0107
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 8.45 sec   HDFS Read: 16358 HDFS Write: 87 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 8 seconds 450 msec
INFO  : Completed executing command(queryId=hive_20200412230438_62ce326e-1b03-4c5a-a842-6816dc6feda3); Time taken: 28.871 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
+------------+-----------+
No rows selected (28.979 seconds)

我们看下这两个SQL 的执行过程

NOT IN 的 EXPLAIN 结果：

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-4 is a root stage                          |
|   Stage-1 depends on stages: Stage-4               |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-4                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is null (type: boolean) |
|               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   aggregations: count()            |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     sort order:                    |
|                     Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                     value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: (_col0 = 0) (type: boolean) |
|             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: 0 (type: bigint)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: user_id (type: bigint), sex_id (type: bigint) |
|           TableScan                                |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0                                      |
|             1                                      |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: _col0 (type: bigint) |
|               Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint)  |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: _col0 (type: bigint) |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                 sort order: +                      |
|                 Map-reduce partition columns: _col0 (type: bigint) |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 _col0 (type: bigint)                 |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1, _col5   |
|           Statistics: Num rows: 6 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: _col5 is null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

LEFT OUTER JOIN 的 EXPLAIN 结果：

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is null and user_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is null and user_id is not null) (type: boolean) |
|               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col0, _col1, _col5   |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: _col5 is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
63 rows selected (0.143 seconds)

LEFT SEMI JOIN 实现多条件 IN , 即 EXISTS

注意： IN 只能用户单列，如果是多列的话，我们需要使用 EXISTS

如下的IN 的SQL 是错误的

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE (a.user_id, a.sex_id) IN (
 SELECT  
 a.user_id
 ,a.sex_id
 FROM data_semi_b AS b
)
;

Error: Error while compiling statement: FAILED: ParseException line 6:0 mismatched input 'SELECT' expecting ( near '(' in expression specification (state=42000,code=40000)

我们需要用如下的形式，

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
 AND a.sex_id = b.sex_id
;

或者

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE EXISTS (
 SELECT 1
 FROM data_semi_b AS b
 WHERE 
  a.user_id = b.user_id
  AND a.sex_id = b.sex_id
)
;

运行结果

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:46:16,157 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:46:24,375 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.04 sec
INFO  : 2020-04-12 23:46:28,545 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.84 sec
INFO  : 2020-04-12 23:46:35,732 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.85 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 850 msec
INFO  : Ended Job = job_1586423165261_0110
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.85 sec   HDFS Read: 17951 HDFS Write: 119 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 850 msec
INFO  : Completed executing command(queryId=hive_20200412234607_8b6acba0-54bb-420f-80df-a5efd5dc9ae5); Time taken: 29.286 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 2          | 1         |
+------------+-----------+
2 rows selected (29.379 seconds)

我们看下两种方式的 EXPLAIN 结果：

LEFT SEMI JOIN

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 sort order: ++                     |
|                 Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint), _col1 (type: bigint) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1  |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                     sort order: ++                 |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint), sex_id (type: bigint) |
|             1 _col0 (type: bigint), _col1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.121 seconds)

EXISTS

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 sort order: ++                     |
|                 Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint), _col1 (type: bigint) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1  |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                     sort order: ++                 |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint), sex_id (type: bigint) |
|             1 _col0 (type: bigint), _col1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.147 seconds)

可以看到两种方式的执行计划是一致的！！！

高达一号

发布了519 篇原创文章 · 获赞 1146 · 访问量 283万+

他的留言板关注