Hive_LEFT SEMI JOIN / LEFT OUTER JOIN 与 (IN / NOT IN), (EXISTS / NOT EXISTS ) 分析

 

Reference article:  https://blog.csdn.net/happyrocking/article/details/79885071

 

In this article, we mainly focus on the LEFT SEMI JOIN and (IN / NOT IN), (EXISTS / NOT EXISTS) clause queries in Hive.

 

Basic knowledge of LEFT SEMI JOIN

First, we must first understand what is LEFT SEMI JOIN.

 

Features

1. The limitation of left semi join is that the table on the right in the JOIN clause can only set filter conditions in the ON clause, and filtering in the WHERE clause, SELECT clause, or other places will not work.

2. Left semi join only transfers the join key of the table to the map stage, so the result of the last selection in left semi join can only appear in the left table.

3. Because left semi join is in (keySet), if the right table repeats the record, the left table will be skipped, and the join will continue to traverse. This results in only one left semi join in the case where the right table has duplicate values. The join will produce multiple entries, and it will also result in higher performance of left semi join. 

For example, the following A table and B table join or left semi join, and then select all the fields, the difference between the results is as follows:
 

Note: The column with the blue cross does not actually exist in the left semi join, because the result of the last select only allows the left table.

 

 

 

 

 

In fact, you can think of LEFT SEMI JOIN as an alternative to (IN / NOT IN), (EXISTS / NOT EXISTS) in the form of subqueries.

Before HIVE version 0.13, (IN / NOT IN), (EXISTS / NOT EXISTS) sub-query statements are not supported, at this time we need to use LEFT SEMI JOIN

The documentation is as follows:

 

Building basic test data

DROP TABLE IF EXISTS data_semi_a;

DROP TABLE IF EXISTS data_semi_b;


CREATE TABLE IF NOT EXISTS data_semi_a 
(
 user_id BIGINT
 ,sex_id BIGINT 
);

CREATE TABLE IF NOT EXISTS data_semi_b
(
 user_id BIGINT
 ,sex_id BIGINT
 ,age BIGINT
);

INSERT INTO TABLE data_semi_a VALUES
(NULL ,0)
,(1, 1)
,(1, 0)
,(2, 1)
,(3, 0)
,(4, 1)
;

INSERT INTO TABLE data_semi_b VALUES
(NULL, 0, 3)
,(1, 0, 12)
,(2, 1, 14)
;

 

Test Data:

data_semi_a

+----------------------+---------------------+
| data_semi_a.user_id  | data_semi_a.sex_id  |
+----------------------+---------------------+
| NULL                 | 0                   |
| 1                    | 1                   |
| 1                    | 0                   |
| 2                    | 1                   |
| 3                    | 0                   |
| 4                    | 1                   |
+----------------------+---------------------+

 

data_semi_b

+----------------------+---------------------+------------------+
| data_semi_b.user_id  | data_semi_b.sex_id  | data_semi_b.age  |
+----------------------+---------------------+------------------+
| NULL                 | 0                   | 3                |
| 1                    | 0                   | 12               |
| 2                    | 1                   | 14               |
+----------------------+---------------------+------------------+

 

 

 

 

 

The single-condition LEFT SEMI JOIN is equivalent to (IN)

 

note

LEFT SEMI JOIN is equivalent to IN, and its principle is to only pass the KEY in LEFT SEMI JOIN.

So A LEFT SEMI JOIN B, the field in B cannot appear in the SELECT statement.

SELECT 
 a.user_id
 ,a.sex_id
 ,b.age
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
;

 

Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 4:1 Invalid table alias or column reference 'b': (possible column names are: user_id, sex_id) (state=42000,code=10004)

 

 

The single-condition LEFT SEMI JOIN is equivalent to (IN), for example the following SQL

SQL statement

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
;

IN SQL equivalent

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id IN (
 SELECT b.user_id 
 FROM data_semi_b AS b
);

 

We compare the results of the next 2 SQL

Execution results of LEFT SEMI JOIN

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 10:53:09,591 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 10:53:17,849 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.12 sec
INFO  : 2020-04-12 10:53:22,975 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.84 sec
INFO  : 2020-04-12 10:53:29,141 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.77 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 770 msec
INFO  : Ended Job = job_1586423165261_0087
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.77 sec   HDFS Read: 16677 HDFS Write: 135 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 770 msec
INFO  : Completed executing command(queryId=hive_20200412105301_9f643e42-c966-4140-8c72-330be6bdd73c); Time taken: 28.939 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 1          | 1         |
| 2          | 1         |
+------------+-----------+
3 rows selected (29.073 seconds)

 

IN execution results

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 10:37:26,143 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 10:37:33,376 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.71 sec
INFO  : 2020-04-12 10:37:39,510 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.6 sec
INFO  : 2020-04-12 10:37:44,680 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.41 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 410 msec
INFO  : Ended Job = job_1586423165261_0085
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.41 sec   HDFS Read: 16726 HDFS Write: 135 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 410 msec
INFO  : Completed executing command(queryId=hive_20200412103717_2ab604da-f301-4fee-b9bd-9c22ad6e65a1); Time taken: 27.796 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 1          | 1         |
| 2          | 1         |
+------------+-----------+
3 rows selected (27.902 seconds)

 

Let us look at the EXPLAIN results of the next two statements:

EXPLAIN result of LEFT SEMI JOIN:

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200412105949_53e51917-8c04-4f6f-b9fd-32ab71a2888b); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint)       |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col0 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
65 rows selected (0.136 seconds)

EXPLAIN results for IN:

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200412110229_81d9cf79-50e2-46f1-8152-a399038861c7); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint)       |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col0 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
65 rows selected (0.127 seconds)

You can see that the two are completely identical in execution results and EXPLAIN results.

In fact, LEFT SEMI JOIN is also used inside IN

 

 

 

 

LEFT OUTER JOIN 实现 NOT IN 

 

Note that LEFT SEMI JOIN cannot realize NOT IN 

Essential: Unequal connections are not supported in Hive! ! !

 

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON (a.user_id != b.user_id)
;

Error: Error while compiling statement: FAILED: SemanticException [Error 10017]: Line 6:4 Both left and right aliases encountered in JOIN 'user_id' (state=42000,code=10017)

 

 

Correct writing

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id NOT IN (
 SELECT b.user_id 
 FROM data_semi_b AS b
);
INFO  : Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:02:26,751 Stage-2 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:02:33,938 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 1.76 sec
INFO  : 2020-04-12 23:02:39,172 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.35 sec
INFO  : 2020-04-12 23:02:47,688 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 7.88 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 880 msec
INFO  : Ended Job = job_1586423165261_0106
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 6.49 sec   HDFS Read: 8372 HDFS Write: 96 SUCCESS
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.65 sec   HDFS Read: 11974 HDFS Write: 96 SUCCESS
INFO  : Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 7.88 sec   HDFS Read: 14131 HDFS Write: 87 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 20 seconds 20 msec
INFO  : Completed executing command(queryId=hive_20200412230117_fef818dc-e433-4880-9c8d-f6a9d28a08a9); Time taken: 91.471 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
+------------+-----------+
No rows selected (91.674 seconds)

 

Equivalent SQL, note that NOT IN cannot be implemented using LEFT SEMI JOIN, we need to use LEFT OUTER JOIN to achieve:

SQL equivalent of LEFT OUTER JOIN

SELECT 
  a.user_id
  ,a.sex_id 
FROM data_semi_a AS a
LEFT OUTER JOIN data_semi_b AS b
 ON a.user_id = b.user_id
 AND b.user_id IS NULL
WHERE a.user_id IS NOT NULL
 AND b.user_id IS NOT NULL
;
INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:04:47,896 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:04:55,176 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.91 sec
INFO  : 2020-04-12 23:05:00,288 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.53 sec
INFO  : 2020-04-12 23:05:06,449 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.45 sec
INFO  : MapReduce Total cumulative CPU time: 8 seconds 450 msec
INFO  : Ended Job = job_1586423165261_0107
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 8.45 sec   HDFS Read: 16358 HDFS Write: 87 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 8 seconds 450 msec
INFO  : Completed executing command(queryId=hive_20200412230438_62ce326e-1b03-4c5a-a842-6816dc6feda3); Time taken: 28.871 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
+------------+-----------+
No rows selected (28.979 seconds)

 

Let's take a look at the execution process of these two SQL

EXPLAIN result of NOT IN:

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-4 is a root stage                          |
|   Stage-1 depends on stages: Stage-4               |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-4                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is null (type: boolean) |
|               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   aggregations: count()            |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     sort order:                    |
|                     Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                     value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: (_col0 = 0) (type: boolean) |
|             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: 0 (type: bigint)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: user_id (type: bigint), sex_id (type: bigint) |
|           TableScan                                |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0                                      |
|             1                                      |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: _col0 (type: bigint) |
|               Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint)  |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: _col0 (type: bigint) |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                 sort order: +                      |
|                 Map-reduce partition columns: _col0 (type: bigint) |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 _col0 (type: bigint)                 |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1, _col5   |
|           Statistics: Num rows: 6 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: _col5 is null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

 

EXPLAIN result of LEFT OUTER JOIN:

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is null and user_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is null and user_id is not null) (type: boolean) |
|               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col0, _col1, _col5   |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: _col5 is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
63 rows selected (0.143 seconds)

 

 

 

 

 

 

LEFT SEMI JOIN realizes multi-condition IN, namely EXISTS 

 

Note: IN can only be used for a single column, if there are multiple columns, we need to use EXISTS

 

The SQL of the following IN is wrong

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE (a.user_id, a.sex_id) IN (
 SELECT  
 a.user_id
 ,a.sex_id
 FROM data_semi_b AS b
)
;

Error: Error while compiling statement: FAILED: ParseException line 6:0 mismatched input 'SELECT' expecting ( near '(' in expression specification (state=42000,code=40000)

 

 

We need to use the following form,

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
 AND a.sex_id = b.sex_id
;

or

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE EXISTS (
 SELECT 1
 FROM data_semi_b AS b
 WHERE 
  a.user_id = b.user_id
  AND a.sex_id = b.sex_id
)
;

operation result

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:46:16,157 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:46:24,375 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.04 sec
INFO  : 2020-04-12 23:46:28,545 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.84 sec
INFO  : 2020-04-12 23:46:35,732 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.85 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 850 msec
INFO  : Ended Job = job_1586423165261_0110
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.85 sec   HDFS Read: 17951 HDFS Write: 119 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 850 msec
INFO  : Completed executing command(queryId=hive_20200412234607_8b6acba0-54bb-420f-80df-a5efd5dc9ae5); Time taken: 29.286 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 2          | 1         |
+------------+-----------+
2 rows selected (29.379 seconds)

 

 

We look at the EXPLAIN results in two ways:

LEFT SEMI JOIN

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 sort order: ++                     |
|                 Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint), _col1 (type: bigint) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1  |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                     sort order: ++                 |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint), sex_id (type: bigint) |
|             1 _col0 (type: bigint), _col1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.121 seconds)

 

EXISTS 

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 sort order: ++                     |
|                 Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint), _col1 (type: bigint) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1  |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                     sort order: ++                 |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint), sex_id (type: bigint) |
|             1 _col0 (type: bigint), _col1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.147 seconds)

 

You can see that the execution plan of the two methods is the same! ! !

 

 

 

 

Published 519 original articles · praised 1146 · 2.83 million views

Guess you like

Origin blog.csdn.net/u010003835/article/details/105476658