Difference between bucket mapjoin and SMB join (Sort-Merge-Bucket) in hive

1 bucket mapjoin

1.1 Conditions

1) set hive.optimize.bucketmapjoin = true;
2) The number of buckets in one table is an integer multiple of the number of buckets in another table
3) Bucket column == join column
4) Must be used in map join scenarios

1.2 Attention

1) If the table is not a bucket, just do a normal join.

2 SMB join (an optimization for bucket mapjoin)

2.1 Conditions

1)
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
2) Number of buckets for small tables = Large table bucket number
3) Bucket column == Join column == sort column
4) Must be applied in the scenario of bucket mapjoin

2.2 Attention

Hive does not check whether the two joined tables are bucketed and sorted. Users need to ensure the joined tables themselves, otherwise the data may be incorrect. There are two ways

1) hive.enforce.sorting is set to true.
2) Manually generate qualified data by using distributed c1 sort by c1 or cluster by c1 in sql.
The table must be CLUSTERED and SORTED when created, as follows
create table test_smb_2(mid string, age_id string)
CLUSTERED BY(mid) SORTED BY (mid) INTO 500 BUCKETS;

Referenced from
https://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables
https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join .pdf

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324814923&siteId=291194637