参考文章 : https://blog.csdn.net/happyrocking/article/details/79885071
本篇文章,我们主要就 Hive 中的 LEFT SEMI JOIN 和 (IN / NOT IN), (EXISTS / NOT EXISTS ) 子句查询做一个了解。
LEFT SEMI JOIN 基本认识
首先,我们先要了解下什么是 LEFT SEMI JOIN.
特点
1、left semi join 的限制是, JOIN 子句中右边的表只能在 ON 子句中设置过滤条件,在 WHERE 子句、SELECT 子句或其他地方过滤都不行。
2、left semi join 是只传递表的 join key 给 map 阶段,因此left semi join 中最后 select 的结果只许出现左表。
3、因为 left semi join 是 in(keySet) 的关系,遇到右表重复记录,左表会跳过,而 join 则会一直遍历。这就导致右表有重复值得情况下 left semi join 只产生一条,join 会产生多条,也会导致 left semi join 的性能更高。
比如以下A表和B表进行 join 或 left semi join,然后 select 出所有字段,结果区别如下:
注意:蓝色叉的那一列实际是不存在left semi join中的,因为最后 select 的结果只许出现左表。
其实可以这么认为 LEFT SEMI JOIN 就是 子查询形式的 (IN / NOT IN), (EXISTS / NOT EXISTS ) 的替代方案。
因为 HIVE 0.13 版本之前,是不支持 (IN / NOT IN), (EXISTS / NOT EXISTS ) 中存在子查询语句的,此时我们需要使用 LEFT SEMI JOIN
文档如下:
构建基础的测试数据
DROP TABLE IF EXISTS data_semi_a;
DROP TABLE IF EXISTS data_semi_b;
CREATE TABLE IF NOT EXISTS data_semi_a
(
user_id BIGINT
,sex_id BIGINT
);
CREATE TABLE IF NOT EXISTS data_semi_b
(
user_id BIGINT
,sex_id BIGINT
,age BIGINT
);
INSERT INTO TABLE data_semi_a VALUES
(NULL ,0)
,(1, 1)
,(1, 0)
,(2, 1)
,(3, 0)
,(4, 1)
;
INSERT INTO TABLE data_semi_b VALUES
(NULL, 0, 3)
,(1, 0, 12)
,(2, 1, 14)
;
测试数据:
data_semi_a
+----------------------+---------------------+
| data_semi_a.user_id | data_semi_a.sex_id |
+----------------------+---------------------+
| NULL | 0 |
| 1 | 1 |
| 1 | 0 |
| 2 | 1 |
| 3 | 0 |
| 4 | 1 |
+----------------------+---------------------+
data_semi_b
+----------------------+---------------------+------------------+
| data_semi_b.user_id | data_semi_b.sex_id | data_semi_b.age |
+----------------------+---------------------+------------------+
| NULL | 0 | 3 |
| 1 | 0 | 12 |
| 2 | 1 | 14 |
+----------------------+---------------------+------------------+
单条件的 LEFT SEMI JOIN 相当于 (IN )
注意
LEFT SEMI JOIN 等同于 IN ,其原理是 只传递 LEFT SEMI JOIN 中的 KEY 。
所以 A LEFT SEMI JOIN B , SELECT 语句中 不能出现B 中的字段。
SELECT
a.user_id
,a.sex_id
,b.age
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
ON a.user_id = b.user_id
;
Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 4:1 Invalid table alias or column reference 'b': (possible column names are: user_id, sex_id) (state=42000,code=10004)
单条件的 LEFT SEMI JOIN 相当于 (IN ) , 例如如下SQL
SQL 语句
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
ON a.user_id = b.user_id
;
等价的 IN SQL
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id IN (
SELECT b.user_id
FROM data_semi_b AS b
);
我们比较下2个SQL 的运行结果
LEFT SEMI JOIN 的执行结果
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 10:53:09,591 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-12 10:53:17,849 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.12 sec
INFO : 2020-04-12 10:53:22,975 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.84 sec
INFO : 2020-04-12 10:53:29,141 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.77 sec
INFO : MapReduce Total cumulative CPU time: 7 seconds 770 msec
INFO : Ended Job = job_1586423165261_0087
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.77 sec HDFS Read: 16677 HDFS Write: 135 SUCCESS
INFO : Total MapReduce CPU Time Spent: 7 seconds 770 msec
INFO : Completed executing command(queryId=hive_20200412105301_9f643e42-c966-4140-8c72-330be6bdd73c); Time taken: 28.939 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
| 1 | 0 |
| 1 | 1 |
| 2 | 1 |
+------------+-----------+
3 rows selected (29.073 seconds)
IN 的执行结果
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 10:37:26,143 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-12 10:37:33,376 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.71 sec
INFO : 2020-04-12 10:37:39,510 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.6 sec
INFO : 2020-04-12 10:37:44,680 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.41 sec
INFO : MapReduce Total cumulative CPU time: 7 seconds 410 msec
INFO : Ended Job = job_1586423165261_0085
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.41 sec HDFS Read: 16726 HDFS Write: 135 SUCCESS
INFO : Total MapReduce CPU Time Spent: 7 seconds 410 msec
INFO : Completed executing command(queryId=hive_20200412103717_2ab604da-f301-4fee-b9bd-9c22ad6e65a1); Time taken: 27.796 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
| 1 | 0 |
| 1 | 1 |
| 2 | 1 |
+------------+-----------+
3 rows selected (27.902 seconds)
我们再看下两个语句的 EXPLAIN 结果:
LEFT SEMI JOIN 的 EXPLAIN 结果:
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200412105949_53e51917-8c04-4f6f-b9fd-32ab71a2888b); Time taken: 0.005 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| value expressions: sex_id (type: bigint) |
| TableScan |
| alias: b |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 _col0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
65 rows selected (0.136 seconds)
IN 的 EXPLAIN 结果:
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200412110229_81d9cf79-50e2-46f1-8152-a399038861c7); Time taken: 0.005 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| value expressions: sex_id (type: bigint) |
| TableScan |
| alias: b |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 _col0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
65 rows selected (0.127 seconds)
可以看到两者在执行结果 和 EXPLAIN 结果上是完全一致的。
其实 IN 内部也是使用 的 LEFT SEMI JOIN
LEFT OUTER JOIN 实现 NOT IN
注意 LEFT SEMI JOIN 不能实现 NOT IN
本质 : Hive 中不支持不等值连接!!!
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
ON (a.user_id != b.user_id)
;
Error: Error while compiling statement: FAILED: SemanticException [Error 10017]: Line 6:4 Both left and right aliases encountered in JOIN 'user_id' (state=42000,code=10017)
正确写法
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id NOT IN (
SELECT b.user_id
FROM data_semi_b AS b
);
INFO : Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 23:02:26,751 Stage-2 map = 0%, reduce = 0%
INFO : 2020-04-12 23:02:33,938 Stage-2 map = 50%, reduce = 0%, Cumulative CPU 1.76 sec
INFO : 2020-04-12 23:02:39,172 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 3.35 sec
INFO : 2020-04-12 23:02:47,688 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 7.88 sec
INFO : MapReduce Total cumulative CPU time: 7 seconds 880 msec
INFO : Ended Job = job_1586423165261_0106
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 6.49 sec HDFS Read: 8372 HDFS Write: 96 SUCCESS
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 5.65 sec HDFS Read: 11974 HDFS Write: 96 SUCCESS
INFO : Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 7.88 sec HDFS Read: 14131 HDFS Write: 87 SUCCESS
INFO : Total MapReduce CPU Time Spent: 20 seconds 20 msec
INFO : Completed executing command(queryId=hive_20200412230117_fef818dc-e433-4880-9c8d-f6a9d28a08a9); Time taken: 91.471 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
+------------+-----------+
No rows selected (91.674 seconds)
等价的 SQL , 注意 NOT IN 不能使用 LEFT SEMI JOIN 实现,我们需要使用 LEFT OUTER JOIN 进行实现:
等价的LEFT OUTER JOIN 的 SQL
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
LEFT OUTER JOIN data_semi_b AS b
ON a.user_id = b.user_id
AND b.user_id IS NULL
WHERE a.user_id IS NOT NULL
AND b.user_id IS NOT NULL
;
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 23:04:47,896 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-12 23:04:55,176 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.91 sec
INFO : 2020-04-12 23:05:00,288 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.53 sec
INFO : 2020-04-12 23:05:06,449 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.45 sec
INFO : MapReduce Total cumulative CPU time: 8 seconds 450 msec
INFO : Ended Job = job_1586423165261_0107
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 8.45 sec HDFS Read: 16358 HDFS Write: 87 SUCCESS
INFO : Total MapReduce CPU Time Spent: 8 seconds 450 msec
INFO : Completed executing command(queryId=hive_20200412230438_62ce326e-1b03-4c5a-a842-6816dc6feda3); Time taken: 28.871 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
+------------+-----------+
No rows selected (28.979 seconds)
我们看下这两个SQL 的执行过程
NOT IN 的 EXPLAIN 结果:
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-1 depends on stages: Stage-4 |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: b |
| filterExpr: user_id is null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is null (type: boolean) |
| Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count() |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (_col0 = 0) (type: boolean) |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: 0 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| value expressions: user_id (type: bigint), sex_id (type: bigint) |
| TableScan |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 |
| 1 |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: bigint) |
| TableScan |
| alias: b |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 _col0 (type: bigint) |
| 1 _col0 (type: bigint) |
| outputColumnNames: _col0, _col1, _col5 |
| Statistics: Num rows: 6 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: _col5 is null (type: boolean) |
| Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
LEFT OUTER JOIN 的 EXPLAIN 结果:
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| value expressions: sex_id (type: bigint) |
| TableScan |
| alias: b |
| filterExpr: (user_id is null and user_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is null and user_id is not null) (type: boolean) |
| Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| outputColumnNames: _col0, _col1, _col5 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: _col5 is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
63 rows selected (0.143 seconds)
LEFT SEMI JOIN 实现多条件 IN , 即 EXISTS
注意: IN 只能用户单列,如果是多列的话,我们需要使用 EXISTS
如下的IN 的SQL 是错误的
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
WHERE (a.user_id, a.sex_id) IN (
SELECT
a.user_id
,a.sex_id
FROM data_semi_b AS b
)
;
Error: Error while compiling statement: FAILED: ParseException line 6:0 mismatched input 'SELECT' expecting ( near '(' in expression specification (state=42000,code=40000)
我们需要用如下的形式,
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
ON a.user_id = b.user_id
AND a.sex_id = b.sex_id
;
或者
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
WHERE EXISTS (
SELECT 1
FROM data_semi_b AS b
WHERE
a.user_id = b.user_id
AND a.sex_id = b.sex_id
)
;
运行结果
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 23:46:16,157 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-12 23:46:24,375 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.04 sec
INFO : 2020-04-12 23:46:28,545 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.84 sec
INFO : 2020-04-12 23:46:35,732 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.85 sec
INFO : MapReduce Total cumulative CPU time: 7 seconds 850 msec
INFO : Ended Job = job_1586423165261_0110
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.85 sec HDFS Read: 17951 HDFS Write: 119 SUCCESS
INFO : Total MapReduce CPU Time Spent: 7 seconds 850 msec
INFO : Completed executing command(queryId=hive_20200412234607_8b6acba0-54bb-420f-80df-a5efd5dc9ae5); Time taken: 29.286 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
| 1 | 0 |
| 2 | 1 |
+------------+-----------+
2 rows selected (29.379 seconds)
我们看下两种方式 的 EXPLAIN 结果 :
LEFT SEMI JOIN
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint), sex_id (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: b |
| filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), sex_id (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint), _col1 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint), sex_id (type: bigint) |
| 1 _col0 (type: bigint), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
64 rows selected (0.121 seconds)
EXISTS
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint), sex_id (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: b |
| filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), sex_id (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint), _col1 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint), sex_id (type: bigint) |
| 1 _col0 (type: bigint), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
64 rows selected (0.147 seconds)
可以看到两种方式的执行计划是一致的!!!