left join 和 left semi join区别

Article from: https://www.cnblogs.com/zzhangyuhang/p/9792794.html

// background: maxcompute exists in the sub-query can not access the external table data instead join the calculation time is too long

1 Contact

They way is a kind of hive join in, join on belonging to common join (shuffle join / reduce join), while the left semi join belongs to a variant of the map join (broadcast join), and can be seen from the name of their implementation principle Differences.

2, the difference

(1) Semi Join, also called semi-connected, borrowed from distributed databases over the method. It produces motive: to reduce side join, the amount of data transmitted across a very large machine, which has become a bottleneck join operation, if we can filter out data will not participate in the join operation in the end of the map, you can greatly reduce network IO to improve the efficiency.
Realization method is very simple: Select a small table, the assumption that the File1, to participate in the join key extracted, saved to a file in File3, File3 files are generally small, are loaded into memory. In the phase map, using the File3 DistributedCache TaskTracker copied to each, and then the corresponding key File3 File2 not in the record was filtered off, the same work and reduce side join reduce the remaining phase.
Since the hive is not in / exist such a clause (the new version will support), it needs to be converted to this type of clause is left semi join. left semi join is only passed to the map table join key stage, if the key is still small enough to perform map join, or if it is not common join. On the principle of common join (shuffle join / reduce join ) Please refer to the end of the text refer.

(2) left semi join clause right table can set up filters in the ON clause, a WHERE clause in the filter will not work, the SELECT clause or elsewhere.

(3) treat the right way to deal with differences in the table duplicate key: Since left semi join is in (keySet) the relationship between encountered right table duplicate records, the left table will be skipped and will join on has been traversed.

The end result is that this will result in performance, as well as differences in the join result.

(4) in the final result left semi join the select can only appear left table, right table because only join key associate involved in the calculation, while the default join on the whole are involved in the relational model is calculated.

3 Both join the "pit"

  Since the HIVE are equivalent connection, when the JOIN use, there are two writing theoretically possible to achieve the same effect, but not the same due to the actual case, the difference data in the subtable cause results are not the same . 

Writing a: left semi join

select
        a.bucket_id,
        a.search_type,
        a.level1,
        a.name1,
        a.level2,
        a.name2,
        cast((a.alipay_fee) as double) as zhuliu_alipay,
        cast(0 as double) as total_alipay
        from tmall_data_fdi_search_zhuliu_alipay_cocerage_bucket_1 a
     left semi join
     tmall_data_fdi_dim_main_auc b
     on (a.level2 = b.cat_id2
         and a.brand_id = b.brand_id
         and b.cat_id2 > 0
         and b.brand_id > 0
         and b.max_price = 0
     )

As a result 3121

Written two: join on

select
        a.bucket_id,
        a.search_type,
        a.level1,
        a.name1,
        a.level2,
        a.name2,
        cast((a.alipay_fee) as double) as zhuliu_alipay,
        cast(0 as double) as total_alipay
        from tmall_data_fdi_search_zhuliu_alipay_cocerage_bucket_1 a
     join   tmall_data_fdi_dim_main_auc b
     on (a.level2 = b.cat_id2
         and a.brand_id = b.brand_id)
  where  b.cat_id2 > 0
         and b.brand_id > 0
         and b.max_price = 0

  

As a result 3142

Both versions actually bring value are not equal, I always thought understand the wording of these two methods are the same, but the statistical result is not the same. 
Layer by layer through the search and found to be due to the presence of duplicate data subtable (tmall_data_fdi_dim_main_auc) of, when using the JOIN ON, A, B table is associated with the two records, should meet on the ON condition; 
but followed by LEFT SEMI JOIN a when the recording table generated in compliance with the conditions in table B returns, will not continue searching the records table B, table B, so if there is repeated, it does not produce a plurality of duplicate records. 

JOIN ON and left semi on are peers in most cases, but there will be duplicate records in such cases, lead to differences in the results, so when we use the best way to understand these two principles, to avoid out "pit . "

Guess you like

Origin www.cnblogs.com/maple-q/p/12518347.html