MaxCompute JOIN Optimization Summary

Abstract:  Join is the most basic syntax in MaxCompute, but due to data volume and skew issues, performance problems are prone to occur. In general, there are two types of problems caused by join: Data skew problem: join will distribute data with the same key to the same instance for processing. If the amount of data on a key is particularly large, the processing time of the instance will be longer than that of the instance. Other instances...

Original address: http://click.aliyun.com/m/43804/

Join is the most basic syntax in MaxCompute, but due to data volume and skew issues, performance problems are very likely to occur. In general, the problems caused by joins fall into two categories:

  • Data skew problem: join will distribute data with the same key to the same instance for processing. If the amount of data on a key is particularly large, the processing time of the instance will be longer than that of other instances. This is the data we often say Tilt, which is also the culprit of join calculation performance problems;
  • Data volume problem: There is basically no hotspot problem between the two related tables, but the large data volume of the two tables will also affect performance, such as billions of records, such as commodity tables, inventory tables, etc.;

       Although MaxCompute provides some general optimization algorithms, it is often more accurate and effective to solve performance problems from a business perspective. For the optimization of MaxCompute sql, there has been a lot of experience accumulated in the Yunqi community. This article mainly summarizes the performance problems and solutions caused by join.

Key association of different data types

example

       Browse the IPV log and associate the commodity table with the commodity id. Assuming that the commodity id field of the log table is of string type and the commodity id in the commodity table is of bigint type, then in the association, the associated keys will all be converted into double types for comparison. There is a lot of non-numeric dirty data in the commodity id in the point problem log table. After converting to double, the value becomes NULL or the previous value is intercepted, which will cause data skew during association, and even more serious data errors.

solution

      Data format conversion is performed manually during association. In this case, the bigint type key is generally converted into a string type.

select a.* 
from ipv_log_table a 
left outer join item_table b 
on a.item_id = cast(b.item_id as string)

       Think about it, if you convert the string type to bigint in turn, if most of the product IDs in the IPV log table are invalid values ​​(such as 0), and if there are no invalid values ​​in the IPV log table but there is a hot key What will be the problem? The following examples will answer these questions.

small table join large table

         There is a small table in Join. Generally, this small table is within 100M. You can use mapjoin to avoid the long tail caused by distribution. Take the above example, if the data volume of the commodity table is only tens of thousands of records ( this is just an example, the commodity table in real business is generally very large ), but the commodity id 80% value in the IPV log table is 0 The invalid value of , and the number of records are billions. If the above SQL writing method is used, the data skew is obvious, but this problem can be effectively solved by using mapjoin:

select /*+ MAPJOIN(b) */a.* 
from ipv_log_table a 
left outer join item_table b 
on a.item_id = cast(b.item_id as string)

The principle of mapjoin       

       Broadcast the small table to all Join Task Instances, and then do Hash Join directly with the large table. Simply put, the join operation is advanced to the map side instead of the reduce side.

Notes on using mapjoin

  • The small table left outer join can only be the right table at the  right outer join time, only the left table at the  inner join time, there is no limit, and full outer joinmapjoin is not supported;
  •  mapjoin only supports up to 8 small tables, otherwise a syntax error will be reported;
  • The memory limit of all small tables in mapjoin cannot exceed 2GB, and the default is 512M;
  • mapjoin supports small tables as subqueries;
  • mapjoin supports unequal joins or uses or to join multiple conditions;

Invalid value exists in large table join large table

       When a small table joins a large table, we have learned that loading all the small tables to the map side through mapjoin can solve the skew problem, but if the 'small table' is not small enough, what should we do when the mapjoin fails? Also taking the first scenario of this article as an example, 80% of the product IDs in the IPV log table are invalid values ​​of 0 ( currently, the bottom layer of MaxCompute has been optimized for NULL values, and there is no skew problem ). At this time, the correlation is on the order of one billion. Commodity list that is a disaster.

Solution 1 - Divide and Conquer:

       We can know in advance that invalid values ​​are impossible to associate with the result, and we don't need to participate in the association at all, so we can treat invalid values ​​separately from valid value data:

select a.visitor_id
      ,b.seller_id
from (
      select 
      from ipv_log_table
      where item_id > 0
) a 
left outer join item_table b 
on a.item_id = b.item_id

union all

select a.visitor_id
      ,cast(null as bigint) seller_id
from ipv_log_table
where item_id = 0

Solution 2 - Random values ​​are scattered:

       We can also use random values ​​instead of NULL values ​​as the join key, so that the original one reduce to process skewed data becomes multi-reduce parallel processing, because invalid values ​​do not participate in the association, even if they are distributed to different reducers, it will not affect the final calculation. result:

select a.visitor_id
      ,b.seller_id 
from ipv_log_table a 
left outer join item_table b 
on if(a.item_id > 0, cast(a.item_id as string), concat('rand',cast(rand() as string))) = cast(b.item_id as string)

Solution 3 - Convert to mapjoin:

      Although there are more than one billion records in the commodity table, which cannot be directly processed by mapjoin, in actual business, we know that the number of commodities accessed by users in a day is limited, which is especially obvious in business. Based on this, we can transform through some processing. into a mapjoin:

select /*+ MAPJOIN(b) */
       a.visitor_id
      ,b.seller_id 
from ipv_log_table a 
left outer join (
   select /*+ MAPJOIN(log) */
         itm.seller_id 
        ,itm.item_id
   from (
         select item_id 
         from ipv_log_table 
         where item_id > 0
         group by item_id
   ) log join item_table itm
   on log.item_id = itm.item_id
) b 
on a.item_id = b.item_id

Solution comparison

       Solution 1 and solution 2 are general solutions. For solution 1, the log table is read twice, while solution 2 only needs to be read once. In addition, the number of tasks in solution 2 is less than solution 1, so in general, the solution is 2 is better than solution 1. Solution 3 is based on certain assumptions. With the development of the business or in some special cases, the assumptions may fail (for example, some crawler logs can cause the number of accessed products to be close to the full amount), which will cause mapjoin to fail. situation to assess.

an ancient example

       Finally, I want to talk about an ancient optimization case. Although the history is relatively long and there are no related problems, the optimization ideas are worth learning from. The situation is like this. There have been two sets of commodity dimension tables in history. One primary key is a string id. The new commodity table, that is, the primary key currently in use, is a digital id. The string id and the digital id are mapped, and there are commodities. There are two fields in the table, so in use, it is necessary to filter the digital id and string id respectively, and then associate them with the commodity table, and finally union to get the final result.

      Think about it if you replace it with the following optimization ideas, is it better?

select ... 
from ipv_log_table a 
join (
      select auction_id as auction_id 
      from auctions
      union all
      select auction_string_id as auction_id 
      from auctions 
      where auction_string_id is not null
) b
on a.auction_id = b.auction_id

 

The answer is yes. It can be seen that after the optimization, the number of reads of the commodity table has been reduced from 2 to 1, the same is true of the IPV log table, and the number of MR jobs has also changed from 2 to 1.

Summarize

       The most effective way to optimize MaxCompute sql is to start from a business perspective, and convert MaxCompute sql into a mapreduce program for interpretation. This article has done some sorting out for the optimization of various scenarios in join. The reality is likely to be a combination of the above multiple scenarios. At this time, it is necessary to flexibly use the corresponding optimization methods to draw inferences from one case.

 

Author: Song Zhi

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325074795&siteId=291194637
Recommended