Example of mapjoin optimization in Hive

1 Basic information

3个表,1个事实表,2个维度表
事实表 test_fact (mid string,sex_id string,age_id string
维度表dim_user_demography_age (age_id  string,age_name string
维度表dim_user_demography_sex (sex_id string,sex_name  string

test sql

select mid,sex_name,age_name from test_fact   f
join dim_user_demography_age d1 on d1.age_id=f.age_id
join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;

d1 and d2 are dimension tables with only a few rows of data, and the amount of data is <10k

2 Parameters and descriptions used in the test

parameters used in the test

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;

2.1 hive.auto.convert.join

After enabling this parameter, hive will automatically convert ordinary join to mapjoin based on the size of the table

2.2 hive.auto.convert.join.noconditionaltask

After enabling this parameter, hive will automatically convert ordinary join to mapjoin based on the size of the table. For an n-way join, if the data size of n-1 tables or partitions is smaller than a certain value, an ordinary join will be Converted to sql for mapjoin
test, it can be regarded as a 3-way connection

select mid,sex_name,age_name from test_fact   f
join dim_user_demography_age d1 on d1.age_id=f.age_id
join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;

2.3 hive.auto.convert.join.noconditionaltask.size

The default value is 10M. Taking the above SQL as an example, the data size of the two tables, d1 and d2, is required to be less than 10M. Only takes effect when hive.auto.convert.join.noconditionaltask=true

3 Test Instructions

The effects of the upper and lower parameters on the task are discussed in the following cases.

3.1 Enable hive.auto.convert.join.noconditionaltask

If this parameter is enabled, the sql has only 1 job.
The log is as follows. The dimension tables d1 and d2 will be converted into a hash table locally and uploaded to hdfs. In the mr task, it is used to do mapjoin. Fastest task.

Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:17:09     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:17:10     Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
2018-04-24 16:17:10     Uploaded 1 File to: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable (308 bytes)
2018-04-24 16:17:10     Dump the side-table for tag: 1 with group count: 5 into file: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile11--.hashtable
2018-04-24 16:17:10     Uploaded 1 File to: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile11--.hashtable (387 bytes)

3.2 Do not enable hive.auto.convert.join.noconditionaltask

The log is as follows, there will be two mapjoin processes, and the performance is poor.
1) Generate a map table locally on d1, upload hdfs
2) Do a mapjoin between f and d1, and keep the result
3) Generate a map table locally on d2, upload hdfs
4) Do a mapjoin with the map table of d2
5) Query results

hive> select mid,sex_name,age_name from test_fact   f
    > join dim_user_demography_age d1 on d1.age_id=f.age_id
    > join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;
Query ID = hdp-ads-audit_20180424161919_0663afcb-af1f-4b0f-b17b-23117c257969
Total jobs = 5
Stage-13 is selected by condition resolver.
Stage-1 is filtered out by condition resolver.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:19:25     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:19:26     Dump the side-table for tag: 1 with group count: 5 into file: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10008/HashTable-Stage-8/MapJoin-mapfile21--.hashtable
2018-04-24 16:19:26     Uploaded 1 File to: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10008/HashTable-Stage-8/MapJoin-mapfile21--.hashtable (387 bytes)
2018-04-24 16:19:26     End of local task; Time Taken: 0.984 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1524125616287_138024, Tracking URL = http://mhdp12.namenodetest:8888/proxy/application_1524125616287_138024/
Kill Command = /usr/bin/hadoop/software/yarn//bin/hadoop job  -kill job_1524125616287_138024
Hadoop job information for Stage-8: number of mappers: 1; number of reducers: 0
2018-04-24 16:19:40,865 Stage-8 map = 0%,  reduce = 0%
2018-04-24 16:19:59,570 Stage-8 map = 100%,  reduce = 0%, Cumulative CPU 15.58 sec
MapReduce Total cumulative CPU time: 15 seconds 580 msec
Ended Job = job_1524125616287_138024
Moved to trash: /home/dddddd/hive/scratchdir/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-mr-10013/bd115834-c22a-4d5e-b960-d09ddaf5a361/map.xml
Moved to trash: /home/dddddd/hive/scratchdir/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/_task_tmp.-mr-10003
Stage-11 is selected by condition resolver.
Stage-12 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:20:05     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:20:06     Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
2018-04-24 16:20:06     Uploaded 1 File to: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable (308 bytes)
2018-04-24 16:20:06     End of local task; Time Taken: 0.95 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1524125616287_138033, Tracking URL = http://mhdp12.namenodetest:8888/proxy/application_1524125616287_138033/
Kill Command = /usr/bin/hadoop/software/yarn//bin/hadoop job  -kill job_1524125616287_138033
Interrupting... Be patient, this might take some time.
Press Ctrl+C again to kill JVM

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325429875&siteId=291194637