Hive_LEFT SEMI JOIN / LEFT OUTER JOINと(IN / NOT IN)、(EXISTS / NOT EXISTS)分析

 

参照記事:https//blog.csdn.net/happyrocking/article/details/79885071

 

この記事では、主に、HiveのLEFT SEMI JOINおよび(IN / NOT IN)、(EXISTS / NOT EXISTS)句のクエリに焦点を当てます。

 

LEFT SEMI JOINの基礎知識

まず、LEFT SEMI JOINとは何かを最初に理解する必要があります。

 

特徴

1.左準結合の制限は、JOIN句の右側のテーブルがON句でのみフィルター条件を設定でき、WHERE句、SELECT句、またはその他の場所でのフィルタリングが機能しないことです。

2.左準結合はテーブルの結合キーのみをマップステージに転送するため、左準結合で最後に選択した結果は左テーブルにのみ表示されます。

3.左側の準結合が(keySet)にあるため、右側のテーブルがレコードを繰り返す場合、左側のテーブルはスキップされ、結合は引き続きトラバースします。これにより、右側のテーブルに重複する値がある場合、左側の準結合が1つだけになります。結合により複数のエントリが生成され、左側の準結合のパフォーマンスが向上します。 

たとえば、次のAテーブルとBテーブルの結合または左セミ結合、そしてすべてのフィールドを選択すると、結果は次のようになります。
 

注:最後の選択の結果では左側のテーブルのみが許可されるため、青い十字の列は実際には左側の準結合には存在しません。

 

 

 

 

 

実際、LEFT SEMI JOINは、(IN / NOT IN)、(EXISTS / NOT EXISTS)の代わりにサブクエリの形で考えることができます。

HIVEバージョン0.13より前では、(IN / NOT IN)、(EXISTS / NOT EXISTS)サブクエリステートメントはサポートされていません。現時点では、LEFT SEMI JOINを使用する必要があります

ドキュメントは次のとおりです。

 

基本的なテストデータの作成

DROP TABLE IF EXISTS data_semi_a;

DROP TABLE IF EXISTS data_semi_b;


CREATE TABLE IF NOT EXISTS data_semi_a 
(
 user_id BIGINT
 ,sex_id BIGINT 
);

CREATE TABLE IF NOT EXISTS data_semi_b
(
 user_id BIGINT
 ,sex_id BIGINT
 ,age BIGINT
);

INSERT INTO TABLE data_semi_a VALUES
(NULL ,0)
,(1, 1)
,(1, 0)
,(2, 1)
,(3, 0)
,(4, 1)
;

INSERT INTO TABLE data_semi_b VALUES
(NULL, 0, 3)
,(1, 0, 12)
,(2, 1, 14)
;

 

テストデータ:

data_semi_a

+----------------------+---------------------+
| data_semi_a.user_id  | data_semi_a.sex_id  |
+----------------------+---------------------+
| NULL                 | 0                   |
| 1                    | 1                   |
| 1                    | 0                   |
| 2                    | 1                   |
| 3                    | 0                   |
| 4                    | 1                   |
+----------------------+---------------------+

 

data_semi_b

+----------------------+---------------------+------------------+
| data_semi_b.user_id  | data_semi_b.sex_id  | data_semi_b.age  |
+----------------------+---------------------+------------------+
| NULL                 | 0                   | 3                |
| 1                    | 0                   | 12               |
| 2                    | 1                   | 14               |
+----------------------+---------------------+------------------+

 

 

 

 

 

単一条件LEFT SEMI JOINは(IN)と同等です。

 

注目

LEFT SEMI JOINはINと同等であり、その原則はLEFT SEMI JOINのKEYのみを渡すことです。

したがって、LEFT SEMI JOIN B、BのフィールドはSELECTステートメントに表示できません。

SELECT 
 a.user_id
 ,a.sex_id
 ,b.age
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
;

 

エラー:ステートメントのコンパイル中にエラーが発生しました:失敗:SemanticException [エラー10004]:行4:1無効なテーブルエイリアスまたは列参照 'b':(可能な列名は次のとおりです:user_id、sex_id)(state = 42000、code = 10004)

 

 

単一条件LEFT SEMI JOINは(IN)と同等です。たとえば、次のSQL

SQLステートメント

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
;

IN SQLと同等

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id IN (
 SELECT b.user_id 
 FROM data_semi_b AS b
);

 

次の2つのSQLの結果を比較します

LEFT SEMI JOINの実行結果

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 10:53:09,591 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 10:53:17,849 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.12 sec
INFO  : 2020-04-12 10:53:22,975 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.84 sec
INFO  : 2020-04-12 10:53:29,141 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.77 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 770 msec
INFO  : Ended Job = job_1586423165261_0087
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.77 sec   HDFS Read: 16677 HDFS Write: 135 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 770 msec
INFO  : Completed executing command(queryId=hive_20200412105301_9f643e42-c966-4140-8c72-330be6bdd73c); Time taken: 28.939 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 1          | 1         |
| 2          | 1         |
+------------+-----------+
3 rows selected (29.073 seconds)

 

IN実行結果

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 10:37:26,143 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 10:37:33,376 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.71 sec
INFO  : 2020-04-12 10:37:39,510 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.6 sec
INFO  : 2020-04-12 10:37:44,680 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.41 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 410 msec
INFO  : Ended Job = job_1586423165261_0085
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.41 sec   HDFS Read: 16726 HDFS Write: 135 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 410 msec
INFO  : Completed executing command(queryId=hive_20200412103717_2ab604da-f301-4fee-b9bd-9c22ad6e65a1); Time taken: 27.796 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 1          | 1         |
| 2          | 1         |
+------------+-----------+
3 rows selected (27.902 seconds)

 

次の2つのステートメントのEXPLAIN結果を見てみましょう。

LEFT SEMI JOINのEXPLAIN結果:

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200412105949_53e51917-8c04-4f6f-b9fd-32ab71a2888b); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint)       |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col0 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
65 rows selected (0.136 seconds)

INのEXPLAIN結果:

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200412110229_81d9cf79-50e2-46f1-8152-a399038861c7); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint) |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint)       |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint) |
|                     sort order: +                  |
|                     Map-reduce partition columns: _col0 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
65 rows selected (0.127 seconds)

実行結果とEXPLAIN結果で2つが完全に同じであることがわかります。

実際、LEFT SEMI JOINはIN内でも使用されます

 

 

 

 

LEFT OUTER JOIN実質现ない 

 

LEFT SEMI JOINはNOT INを認識できないことに注意してください 

必須:Hiveでは不等な接続はサポートされていません!

 

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON (a.user_id != b.user_id)
;

エラー:ステートメントのコンパイル中にエラーが発生しました:失敗:SemanticException [エラー10017]:行6:4 JOIN 'user_id'で左と右の両方のエイリアスが見つかりました(状態= 42000、コード= 10017)

 

 

正しい文章

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id NOT IN (
 SELECT b.user_id 
 FROM data_semi_b AS b
);
INFO  : Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:02:26,751 Stage-2 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:02:33,938 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 1.76 sec
INFO  : 2020-04-12 23:02:39,172 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.35 sec
INFO  : 2020-04-12 23:02:47,688 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 7.88 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 880 msec
INFO  : Ended Job = job_1586423165261_0106
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-4: Map: 1  Reduce: 1   Cumulative CPU: 6.49 sec   HDFS Read: 8372 HDFS Write: 96 SUCCESS
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.65 sec   HDFS Read: 11974 HDFS Write: 96 SUCCESS
INFO  : Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 7.88 sec   HDFS Read: 14131 HDFS Write: 87 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 20 seconds 20 msec
INFO  : Completed executing command(queryId=hive_20200412230117_fef818dc-e433-4880-9c8d-f6a9d28a08a9); Time taken: 91.471 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
+------------+-----------+
No rows selected (91.674 seconds)

 

同等のSQL。NOTINはLEFT SEMI JOINを使用して実装できないことに注意してください。LEFTOUTER JOINを使用して以下を実現する必要があります。

LEFT OUTER JOINに相当するSQL

SELECT 
  a.user_id
  ,a.sex_id 
FROM data_semi_a AS a
LEFT OUTER JOIN data_semi_b AS b
 ON a.user_id = b.user_id
 AND b.user_id IS NULL
WHERE a.user_id IS NOT NULL
 AND b.user_id IS NOT NULL
;
INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:04:47,896 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:04:55,176 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.91 sec
INFO  : 2020-04-12 23:05:00,288 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.53 sec
INFO  : 2020-04-12 23:05:06,449 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.45 sec
INFO  : MapReduce Total cumulative CPU time: 8 seconds 450 msec
INFO  : Ended Job = job_1586423165261_0107
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 8.45 sec   HDFS Read: 16358 HDFS Write: 87 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 8 seconds 450 msec
INFO  : Completed executing command(queryId=hive_20200412230438_62ce326e-1b03-4c5a-a842-6816dc6feda3); Time taken: 28.871 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
+------------+-----------+
No rows selected (28.979 seconds)

 

これら2つのSQLの実行プロセスを見てみましょう

NOT INのEXPLAIN結果:

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-4 is a root stage                          |
|   Stage-1 depends on stages: Stage-4               |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-4                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: user_id is null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is null (type: boolean) |
|               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   aggregations: count()            |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     sort order:                    |
|                     Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                     value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: (_col0 = 0) (type: boolean) |
|             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: 0 (type: bigint)             |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: user_id (type: bigint), sex_id (type: bigint) |
|           TableScan                                |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0                                      |
|             1                                      |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               key expressions: _col0 (type: bigint) |
|               sort order: +                        |
|               Map-reduce partition columns: _col0 (type: bigint) |
|               Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col1 (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: user_id (type: bigint)  |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: _col0 (type: bigint) |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
|                 sort order: +                      |
|                 Map-reduce partition columns: _col0 (type: bigint) |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 _col0 (type: bigint)                 |
|             1 _col0 (type: bigint)                 |
|           outputColumnNames: _col0, _col1, _col5   |
|           Statistics: Num rows: 6 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: _col5 is null (type: boolean) |
|             Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

 

LEFT OUTER JOINのEXPLAIN結果:

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: user_id is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: user_id is not null (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: sex_id (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is null and user_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is null and user_id is not null) (type: boolean) |
|               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint) |
|                 sort order: +                      |
|                 Map-reduce partition columns: user_id (type: bigint) |
|                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Outer Join0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint)               |
|             1 user_id (type: bigint)               |
|           outputColumnNames: _col0, _col1, _col5   |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           Filter Operator                          |
|             predicate: _col5 is not null (type: boolean) |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: bigint), _col1 (type: bigint) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
63 rows selected (0.143 seconds)

 

 

 

 

 

 

LEFT SEMI JOINは複数条件INを実現します。 

 

注:INは単一の列にのみ使用できます。複数の列がある場合は、EXISTSを使用する必要があります

 

次のINのSQLが間違っています

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE (a.user_id, a.sex_id) IN (
 SELECT  
 a.user_id
 ,a.sex_id
 FROM data_semi_b AS b
)
;

エラー:ステートメントのコンパイル中にエラーが発生しました:失敗:ParseException行6:0が一致しない入力 'SELECT'が予期されています(式仕様の '('付近(state = 42000、code = 40000)

 

 

次のフォームを使用する必要があります。

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
 ON a.user_id = b.user_id
 AND a.sex_id = b.sex_id
;

または

SELECT 
 a.user_id
 ,a.sex_id
FROM data_semi_a AS a
WHERE EXISTS (
 SELECT 1
 FROM data_semi_b AS b
 WHERE 
  a.user_id = b.user_id
  AND a.sex_id = b.sex_id
)
;

運用実績

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-12 23:46:16,157 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-12 23:46:24,375 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.04 sec
INFO  : 2020-04-12 23:46:28,545 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.84 sec
INFO  : 2020-04-12 23:46:35,732 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.85 sec
INFO  : MapReduce Total cumulative CPU time: 7 seconds 850 msec
INFO  : Ended Job = job_1586423165261_0110
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 7.85 sec   HDFS Read: 17951 HDFS Write: 119 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 7 seconds 850 msec
INFO  : Completed executing command(queryId=hive_20200412234607_8b6acba0-54bb-420f-80df-a5efd5dc9ae5); Time taken: 29.286 seconds
INFO  : OK
+------------+-----------+
| a.user_id  | a.sex_id  |
+------------+-----------+
| 1          | 0         |
| 2          | 1         |
+------------+-----------+
2 rows selected (29.379 seconds)

 

 

EXPLAINの結果を2つの方法で確認します。

LEFT SEMI JOIN

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 sort order: ++                     |
|                 Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint), _col1 (type: bigint) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1  |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                     sort order: ++                 |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint), sex_id (type: bigint) |
|             1 _col0 (type: bigint), _col1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.121 seconds)

 

存在する 

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 sort order: ++                     |
|                 Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
|                 Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
|             Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (user_id is not null and sex_id is not null) (type: boolean) |
|               Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: user_id (type: bigint), sex_id (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: bigint), _col1 (type: bigint) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1  |
|                   Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: bigint), _col1 (type: bigint) |
|                     sort order: ++                 |
|                     Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
|                     Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Left Semi Join 0 to 1               |
|           keys:                                    |
|             0 user_id (type: bigint), sex_id (type: bigint) |
|             1 _col0 (type: bigint), _col1 (type: bigint) |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+
64 rows selected (0.147 seconds)

 

2つのメソッドの実行プランは同じであることがわかります。

 

 

 

 

元の記事を519件公開 1146 件を賞賛 283万回の閲覧

おすすめ

転載: blog.csdn.net/u010003835/article/details/105476658