参照記事:https : //blog.csdn.net/happyrocking/article/details/79885071
この記事では、主に、HiveのLEFT SEMI JOINおよび(IN / NOT IN)、(EXISTS / NOT EXISTS)句のクエリに焦点を当てます。
LEFT SEMI JOINの基礎知識
まず、LEFT SEMI JOINとは何かを最初に理解する必要があります。
特徴
1.左準結合の制限は、JOIN句の右側のテーブルがON句でのみフィルター条件を設定でき、WHERE句、SELECT句、またはその他の場所でのフィルタリングが機能しないことです。
2.左準結合はテーブルの結合キーのみをマップステージに転送するため、左準結合で最後に選択した結果は左テーブルにのみ表示されます。
3.左側の準結合が(keySet)にあるため、右側のテーブルがレコードを繰り返す場合、左側のテーブルはスキップされ、結合は引き続きトラバースします。これにより、右側のテーブルに重複する値がある場合、左側の準結合が1つだけになります。結合により複数のエントリが生成され、左側の準結合のパフォーマンスが向上します。
たとえば、次のAテーブルとBテーブルの結合または左セミ結合、そしてすべてのフィールドを選択すると、結果は次のようになります。
注:最後の選択の結果では左側のテーブルのみが許可されるため、青い十字の列は実際には左側の準結合には存在しません。
実際、LEFT SEMI JOINは、(IN / NOT IN)、(EXISTS / NOT EXISTS)の代わりにサブクエリの形で考えることができます。
HIVEバージョン0.13より前では、(IN / NOT IN)、(EXISTS / NOT EXISTS)サブクエリステートメントはサポートされていません。現時点では、LEFT SEMI JOINを使用する必要があります
ドキュメントは次のとおりです。
基本的なテストデータの作成
DROP TABLE IF EXISTS data_semi_a;
DROP TABLE IF EXISTS data_semi_b;
CREATE TABLE IF NOT EXISTS data_semi_a
(
user_id BIGINT
,sex_id BIGINT
);
CREATE TABLE IF NOT EXISTS data_semi_b
(
user_id BIGINT
,sex_id BIGINT
,age BIGINT
);
INSERT INTO TABLE data_semi_a VALUES
(NULL ,0)
,(1, 1)
,(1, 0)
,(2, 1)
,(3, 0)
,(4, 1)
;
INSERT INTO TABLE data_semi_b VALUES
(NULL, 0, 3)
,(1, 0, 12)
,(2, 1, 14)
;
テストデータ:
data_semi_a
+----------------------+---------------------+
| data_semi_a.user_id | data_semi_a.sex_id |
+----------------------+---------------------+
| NULL | 0 |
| 1 | 1 |
| 1 | 0 |
| 2 | 1 |
| 3 | 0 |
| 4 | 1 |
+----------------------+---------------------+
data_semi_b
+----------------------+---------------------+------------------+
| data_semi_b.user_id | data_semi_b.sex_id | data_semi_b.age |
+----------------------+---------------------+------------------+
| NULL | 0 | 3 |
| 1 | 0 | 12 |
| 2 | 1 | 14 |
+----------------------+---------------------+------------------+
単一条件LEFT SEMI JOINは(IN)と同等です。
注目
LEFT SEMI JOINはINと同等であり、その原則はLEFT SEMI JOINのKEYのみを渡すことです。
したがって、LEFT SEMI JOIN B、BのフィールドはSELECTステートメントに表示できません。
SELECT
a.user_id
,a.sex_id
,b.age
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
ON a.user_id = b.user_id
;
エラー:ステートメントのコンパイル中にエラーが発生しました:失敗:SemanticException [エラー10004]:行4:1無効なテーブルエイリアスまたは列参照 'b':(可能な列名は次のとおりです:user_id、sex_id)(state = 42000、code = 10004)
単一条件LEFT SEMI JOINは(IN)と同等です。たとえば、次のSQL
SQLステートメント
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
ON a.user_id = b.user_id
;
IN SQLと同等
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id IN (
SELECT b.user_id
FROM data_semi_b AS b
);
次の2つのSQLの結果を比較します
LEFT SEMI JOINの実行結果
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 10:53:09,591 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-12 10:53:17,849 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.12 sec
INFO : 2020-04-12 10:53:22,975 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.84 sec
INFO : 2020-04-12 10:53:29,141 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.77 sec
INFO : MapReduce Total cumulative CPU time: 7 seconds 770 msec
INFO : Ended Job = job_1586423165261_0087
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.77 sec HDFS Read: 16677 HDFS Write: 135 SUCCESS
INFO : Total MapReduce CPU Time Spent: 7 seconds 770 msec
INFO : Completed executing command(queryId=hive_20200412105301_9f643e42-c966-4140-8c72-330be6bdd73c); Time taken: 28.939 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
| 1 | 0 |
| 1 | 1 |
| 2 | 1 |
+------------+-----------+
3 rows selected (29.073 seconds)
IN実行結果
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 10:37:26,143 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-12 10:37:33,376 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.71 sec
INFO : 2020-04-12 10:37:39,510 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.6 sec
INFO : 2020-04-12 10:37:44,680 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.41 sec
INFO : MapReduce Total cumulative CPU time: 7 seconds 410 msec
INFO : Ended Job = job_1586423165261_0085
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.41 sec HDFS Read: 16726 HDFS Write: 135 SUCCESS
INFO : Total MapReduce CPU Time Spent: 7 seconds 410 msec
INFO : Completed executing command(queryId=hive_20200412103717_2ab604da-f301-4fee-b9bd-9c22ad6e65a1); Time taken: 27.796 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
| 1 | 0 |
| 1 | 1 |
| 2 | 1 |
+------------+-----------+
3 rows selected (27.902 seconds)
次の2つのステートメントのEXPLAIN結果を見てみましょう。
LEFT SEMI JOINのEXPLAIN結果:
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200412105949_53e51917-8c04-4f6f-b9fd-32ab71a2888b); Time taken: 0.005 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| value expressions: sex_id (type: bigint) |
| TableScan |
| alias: b |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 _col0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
65 rows selected (0.136 seconds)
INのEXPLAIN結果:
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200412110229_81d9cf79-50e2-46f1-8152-a399038861c7); Time taken: 0.005 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| value expressions: sex_id (type: bigint) |
| TableScan |
| alias: b |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 _col0 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
65 rows selected (0.127 seconds)
実行結果とEXPLAIN結果で2つが完全に同じであることがわかります。
実際、LEFT SEMI JOINはIN内でも使用されます
LEFT OUTER JOIN実質现ない
LEFT SEMI JOINはNOT INを認識できないことに注意してください
必須:Hiveでは不等な接続はサポートされていません!!!
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
ON (a.user_id != b.user_id)
;
エラー:ステートメントのコンパイル中にエラーが発生しました:失敗:SemanticException [エラー10017]:行6:4 JOIN 'user_id'で左と右の両方のエイリアスが見つかりました(状態= 42000、コード= 10017)
正しい文章
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
WHERE a.user_id NOT IN (
SELECT b.user_id
FROM data_semi_b AS b
);
INFO : Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 23:02:26,751 Stage-2 map = 0%, reduce = 0%
INFO : 2020-04-12 23:02:33,938 Stage-2 map = 50%, reduce = 0%, Cumulative CPU 1.76 sec
INFO : 2020-04-12 23:02:39,172 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 3.35 sec
INFO : 2020-04-12 23:02:47,688 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 7.88 sec
INFO : MapReduce Total cumulative CPU time: 7 seconds 880 msec
INFO : Ended Job = job_1586423165261_0106
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 6.49 sec HDFS Read: 8372 HDFS Write: 96 SUCCESS
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 5.65 sec HDFS Read: 11974 HDFS Write: 96 SUCCESS
INFO : Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 7.88 sec HDFS Read: 14131 HDFS Write: 87 SUCCESS
INFO : Total MapReduce CPU Time Spent: 20 seconds 20 msec
INFO : Completed executing command(queryId=hive_20200412230117_fef818dc-e433-4880-9c8d-f6a9d28a08a9); Time taken: 91.471 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
+------------+-----------+
No rows selected (91.674 seconds)
同等のSQL。NOTINはLEFT SEMI JOINを使用して実装できないことに注意してください。LEFTOUTER JOINを使用して以下を実現する必要があります。
LEFT OUTER JOINに相当するSQL
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
LEFT OUTER JOIN data_semi_b AS b
ON a.user_id = b.user_id
AND b.user_id IS NULL
WHERE a.user_id IS NOT NULL
AND b.user_id IS NOT NULL
;
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 23:04:47,896 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-12 23:04:55,176 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.91 sec
INFO : 2020-04-12 23:05:00,288 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.53 sec
INFO : 2020-04-12 23:05:06,449 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.45 sec
INFO : MapReduce Total cumulative CPU time: 8 seconds 450 msec
INFO : Ended Job = job_1586423165261_0107
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 8.45 sec HDFS Read: 16358 HDFS Write: 87 SUCCESS
INFO : Total MapReduce CPU Time Spent: 8 seconds 450 msec
INFO : Completed executing command(queryId=hive_20200412230438_62ce326e-1b03-4c5a-a842-6816dc6feda3); Time taken: 28.871 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
+------------+-----------+
No rows selected (28.979 seconds)
これら2つのSQLの実行プロセスを見てみましょう
NOT INのEXPLAIN結果:
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-1 depends on stages: Stage-4 |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: b |
| filterExpr: user_id is null (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is null (type: boolean) |
| Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count() |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (_col0 = 0) (type: boolean) |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: 0 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| value expressions: user_id (type: bigint), sex_id (type: bigint) |
| TableScan |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 |
| 1 |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 6 Data size: 73 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: bigint) |
| TableScan |
| alias: b |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 _col0 (type: bigint) |
| 1 _col0 (type: bigint) |
| outputColumnNames: _col0, _col1, _col5 |
| Statistics: Num rows: 6 Data size: 80 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: _col5 is null (type: boolean) |
| Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 3 Data size: 40 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
LEFT OUTER JOINのEXPLAIN結果:
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: user_id is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| value expressions: sex_id (type: bigint) |
| TableScan |
| alias: b |
| filterExpr: (user_id is null and user_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is null and user_id is not null) (type: boolean) |
| Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint) |
| sort order: + |
| Map-reduce partition columns: user_id (type: bigint) |
| Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Outer Join0 to 1 |
| keys: |
| 0 user_id (type: bigint) |
| 1 user_id (type: bigint) |
| outputColumnNames: _col0, _col1, _col5 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: _col5 is not null (type: boolean) |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: bigint), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
63 rows selected (0.143 seconds)
LEFT SEMI JOINは複数条件INを実現します。
注:INは単一の列にのみ使用できます。複数の列がある場合は、EXISTSを使用する必要があります
次のINのSQLが間違っています
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
WHERE (a.user_id, a.sex_id) IN (
SELECT
a.user_id
,a.sex_id
FROM data_semi_b AS b
)
;
エラー:ステートメントのコンパイル中にエラーが発生しました:失敗:ParseException行6:0が一致しない入力 'SELECT'が予期されています(式仕様の '('付近(state = 42000、code = 40000)
次のフォームを使用する必要があります。
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
LEFT SEMI JOIN data_semi_b AS b
ON a.user_id = b.user_id
AND a.sex_id = b.sex_id
;
または
SELECT
a.user_id
,a.sex_id
FROM data_semi_a AS a
WHERE EXISTS (
SELECT 1
FROM data_semi_b AS b
WHERE
a.user_id = b.user_id
AND a.sex_id = b.sex_id
)
;
運用実績
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-12 23:46:16,157 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-12 23:46:24,375 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.04 sec
INFO : 2020-04-12 23:46:28,545 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.84 sec
INFO : 2020-04-12 23:46:35,732 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.85 sec
INFO : MapReduce Total cumulative CPU time: 7 seconds 850 msec
INFO : Ended Job = job_1586423165261_0110
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 7.85 sec HDFS Read: 17951 HDFS Write: 119 SUCCESS
INFO : Total MapReduce CPU Time Spent: 7 seconds 850 msec
INFO : Completed executing command(queryId=hive_20200412234607_8b6acba0-54bb-420f-80df-a5efd5dc9ae5); Time taken: 29.286 seconds
INFO : OK
+------------+-----------+
| a.user_id | a.sex_id |
+------------+-----------+
| 1 | 0 |
| 2 | 1 |
+------------+-----------+
2 rows selected (29.379 seconds)
EXPLAINの結果を2つの方法で確認します。
LEFT SEMI JOIN
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint), sex_id (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: b |
| filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), sex_id (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint), _col1 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint), sex_id (type: bigint) |
| 1 _col0 (type: bigint), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
64 rows selected (0.121 seconds)
存在する
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: user_id (type: bigint), sex_id (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: user_id (type: bigint), sex_id (type: bigint) |
| Statistics: Num rows: 6 Data size: 19 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: b |
| filterExpr: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (user_id is not null and sex_id is not null) (type: boolean) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: user_id (type: bigint), sex_id (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint), _col1 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
| Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 user_id (type: bigint), sex_id (type: bigint) |
| 1 _col0 (type: bigint), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 6 Data size: 20 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
64 rows selected (0.147 seconds)
2つのメソッドの実行プランは同じであることがわかります。!!