この記事の概要アドレス:
https://blog.csdn.net/u010003835/article/details/105334641
テストテーブルとテストデータ
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`( |
| `company_name` string COMMENT '????', |
| `dep_name` string COMMENT '????', |
| `user_id` bigint COMMENT '??id', |
| `user_name` string COMMENT '????', |
| `salary` decimal(10,2) COMMENT '??', |
| `create_time` date COMMENT '????', |
| `update_time` date COMMENT '????') |
| PARTITIONED BY ( |
| `pt` string COMMENT '????') |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'=',', |
| 'serialization.format'=',') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1586310488') |
+----------------------------------------------------+
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name | datacube_salary_org.dep_name | datacube_salary_org.user_id | datacube_salary_org.user_name | datacube_salary_org.salary | datacube_salary_org.create_time | datacube_salary_org.update_time | datacube_salary_org.pt |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200405 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200405 |
| s.zh | engineer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | engineer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200406 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200406 |
| s.zh | enginer | 1 | szh | 28000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| s.zh | enginer | 2 | zyq | 26000.00 | 2020-04-03 | 2020-04-03 | 20200407 |
| s.zh | tester | 3 | gkm | 20000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 4 | pip | 13400.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 5 | kip | 24500.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | finance | 6 | zxxc | 13000.00 | 2020-04-07 | 2020-04-07 | 20200407 |
| x.qx | kiccp | 7 | xsz | 8600.00 | 2020-04-07 | 2020-04-07 | 20200407 |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
シーン1.重複除外の問題
1)UNION-UNION ALLの違い、選択方法
2)DISTINCT代替方法GROUP BY
1)UNION-UNION ALLの違い、選択方法
UNION ALLとUNIONはSQLでは異なります。
UNION ALLはマージされたデータを重複排除しません
UNIONはマージされたデータを重複排除します
例:
EXPLAIN
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION / UNION ALL
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
;
UNION ALLのEXPLAIN結果
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200409232517_c76f15cf-20cf-415d-8086-123953fffc75); Time taken: 0.006 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200405') (type: boolean) |
| Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
| Union |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200406') (type: boolean) |
| Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
| Union |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
UNION EXPLAINの結果
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200409232436_8c1754b6-36ef-4846-a6db-719211b6b6a8); Time taken: 0.022 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200405') (type: boolean) |
| Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
| Union |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| sort order: ++++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200406') (type: boolean) |
| Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
| Union |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| sort order: ++++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
2つのEXPLAIN結果を比較すると、UNIONに追加の削減プロセスがあることを見つけるのは難しくありません。それは理由に難しいことではありません重い需要が存在しない場合に下る、UNION ALLの代わりに、UNIONを使用しています。
さらに、UNION ALLを使用してからGROUP BYを使用して重複除外効果を実行する方が、UNIONよりも効率的であると言われています。
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
;
へ
SELECT
company_name
,dep_name
,user_id
,user_name
FROM
(
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION ALL
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
) tmp
GROUP BY
company_name
,dep_name
,user_id
,user_name
;
効率は一貫していると思います、改善された方法のEXPLAIN結果を見てください
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200410020255_57b936d7-ffde-41a6-af6e-3d0dc0d3a007); Time taken: 0.015 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200405') (type: boolean) |
| Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
| Union |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| sort order: ++++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| TableScan |
| alias: datacube_salary_org |
| filterExpr: (pt = '20200406') (type: boolean) |
| Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
| Union |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| sort order: ++++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
| Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
EXPLAINの2つの方法に違いはないため、最適化は行われていないと見なされます。
コントラスト時間(データ量が小さい)
UNION ALL再GROUP BY
時間のかかる5.2秒
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-10 02:06:37,784 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-10 02:06:44,970 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.67 sec
INFO : 2020-04-10 02:06:49,094 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.23 sec
INFO : 2020-04-10 02:06:55,291 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.2 sec
INFO : MapReduce Total cumulative CPU time: 5 seconds 200 msec
INFO : Ended Job = job_1586423165261_0005
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 5.2 sec HDFS Read: 21685 HDFS Write: 304 SUCCESS
INFO : Total MapReduce CPU Time Spent: 5 seconds 200 msec
INFO : Completed executing command(queryId=hive_20200410020629_c216e339-181a-4b52-8a59-ac527963e32b); Time taken: 28.112 seconds
INFO : OK
+---------------+-----------+----------+------------+
| company_name | dep_name | user_id | user_name |
+---------------+-----------+----------+------------+
| s.zh | engineer | 1 | szh |
| s.zh | engineer | 2 | zyq |
| s.zh | tester | 3 | gkm |
| x.qx | finance | 4 | pip |
| x.qx | finance | 5 | kip |
| x.qx | finance | 6 | zxxc |
| x.qx | kiccp | 7 | xsz |
+---------------+-----------+----------+------------+
7 rows selected (28.31 seconds)
連合
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2020-04-10 02:09:24,102 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-10 02:09:31,308 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.78 sec
INFO : 2020-04-10 02:09:35,427 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.39 sec
INFO : 2020-04-10 02:09:41,582 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.04 sec
INFO : MapReduce Total cumulative CPU time: 5 seconds 40 msec
INFO : Ended Job = job_1586423165261_0006
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 5.04 sec HDFS Read: 21813 HDFS Write: 304 SUCCESS
INFO : Total MapReduce CPU Time Spent: 5 seconds 40 msec
INFO : Completed executing command(queryId=hive_20200410020915_477574a0-4763-4717-8f9c-25d9f4b04706); Time taken: 27.033 seconds
INFO : OK
+-------------------+---------------+--------------+----------------+
| _u2.company_name | _u2.dep_name | _u2.user_id | _u2.user_name |
+-------------------+---------------+--------------+----------------+
| s.zh | engineer | 1 | szh |
| s.zh | engineer | 2 | zyq |
| s.zh | tester | 3 | gkm |
| x.qx | finance | 4 | pip |
| x.qx | finance | 5 | kip |
| x.qx | finance | 6 | zxxc |
| x.qx | kiccp | 7 | xsz |
+-------------------+---------------+--------------+----------------+
上記の比較後、違いはないと見なすことができます
2)DISTINCT代替方法GROUP BY
実際の重複排除シナリオでは、重複排除を実行するためにDISTINCTを選択します。
ただし、実際のシナリオでは、GROUP BYを選択する効率が高くなります。以下で実験を行います。
最初に非効率的なCOUNT(DISTINCT)メソッドを選択します
SQL
SELECT
COUNT(DISTINCT company_name, dep_name, user_id)
FROM datacube_salary_org
;
EXPLAIN結果
INFO : Starting task [Stage-2:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200410023914_3ed9bbfc-9b01-4351-b559-a797b8ae2c85); Time taken: 0.007 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint) |
| outputColumnNames: company_name, dep_name, user_id |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(DISTINCT company_name, dep_name, user_id) |
| keys: company_name (type: string), dep_name (type: string), user_id (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint) |
| sort order: +++ |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1, KEY._col0:0._col2) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
小さなデータランタイム
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2020-04-10 03:06:39,390 Stage-1 map = 0%, reduce = 0%
INFO : 2020-04-10 03:06:46,735 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.94 sec
INFO : 2020-04-10 03:06:52,969 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.72 sec
INFO : MapReduce Total cumulative CPU time: 4 seconds 720 msec
INFO : Ended Job = job_1586423165261_0010
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.72 sec HDFS Read: 12863 HDFS Write: 101 SUCCESS
INFO : Total MapReduce CPU Time Spent: 4 seconds 720 msec
INFO : Completed executing command(queryId=hive_20200410030629_7b6df91e-a78a-4bc1-b558-abbb8d506596); Time taken: 24.023 seconds
INFO : OK
+------+
| _c0 |
+------+
| 9 |
+------+
====================
効率的なGROUP BYメソッドを選択します
SQL
SELECT COUNT(1)
FROM (
SELECT
company_name
,dep_name
,user_id
FROM datacube_salary_org
GROUP BY
company_name
,dep_name
,user_id
) AS tmp
;
EXPLAIN結果
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20200410024128_fc60e84d-be8d-4b4d-aad8-a53466fa1559); Time taken: 0.005 seconds
INFO : OK
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: datacube_salary_org |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint) |
| outputColumnNames: company_name, dep_name, user_id |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: company_name (type: string), dep_name (type: string), user_id (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint) |
| sort order: +++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint) |
| Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(1) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
小さなデータランタイム
INFO : Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
INFO : 2020-04-10 03:09:34,476 Stage-2 map = 0%, reduce = 0%
INFO : 2020-04-10 03:09:40,662 Stage-2 map = 100%, reduce = 0%
INFO : 2020-04-10 03:09:47,850 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 4.3 sec
INFO : MapReduce Total cumulative CPU time: 4 seconds 300 msec
INFO : Ended Job = job_1586423165261_0014
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.11 sec HDFS Read: 11827 HDFS Write: 114 SUCCESS
INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.3 sec HDFS Read: 5111 HDFS Write: 101 SUCCESS
INFO : Total MapReduce CPU Time Spent: 8 seconds 410 msec
INFO : Completed executing command(queryId=hive_20200410030859_f89c708b-e76a-44fc-9e99-a6f9a404200f); Time taken: 49.78 seconds
INFO : OK
+------+
| _c0 |
+------+
| 9 |
最適化の原則
大規模なデータセットでGROUP BYの次にCOUNTの効率が直接COUNT(DISTINCT ...)よりも優れている理由について話しましょう。
COUNT(DISTINCT ...)のため、関連する列はキーに結合され、Reduceに渡されます。つまり、カウント(DISTINCT KEY._col0:0._col0、KEY._col0:0._col1、KEY._col0:0._col2)|これは、完全なソートと重複排除を完了するために、Reducerで実行する必要があります。
最初にGROUP BY、次にCOUNT、次にGROUP BYは、異なるKEYを複数のReducerに配布し、GROUP BYプロセスで重複排除を完了することができます。このとき、重複排除によってデータがReducerに入れられない場合は、分散型を利用します。この重複排除はより効率的です。次のステップのCOUNTステージでは、GROUP BY重複排除後のKEYが、統計計算のために前のステップで再び再生されます。
したがって、大量のデータの下では、最初にGROUP BY、次にCOUNTの方がCOUNT(DISTINCT)よりも効率的です。
上記の演算結果を比較してみましょう
EXPLAIN:COUNT(DISTINCT)は、COUNTより前のGROUP BYよりもステージが少ない。GROUP BYはすでにMRステージであり、COUNTは別のステージだからです。
実行時間:2つの間に差がないことがわかりますCOUNT(DISTINCT)の合計時間もGROUP BYの次にCOUNTより短いです。これは、STAGEを実行するには、リソースの申請、リソースの開放、および時間コストが必要になるためです。そのため、少量のデータでは、GROUP BYとCOUNTの時間はCOUNT(DISTINCT)よりも多く、主にリソースの申請とコンテナの作成に費やされます。
また、合計実行時間COUNT(DISTINCT)は、COUNTより前のGROUP BYよりも短い
上記の結果の理由は、依然としてデータセットのサイズが原因です。つまり、Reducerのグローバルソートの時間コストは、複数のジョブステージでリソースを分割するコストと比較されます。!!
したがって、実際のデータ量に応じて合理的な選択を行います!!!!