Hive data skew optimization (large table join large table)

     Business Background The performance bottleneck of user track engineering has always been extract_track_info. The most time-consuming part is the link between trackinfo and pm_info. The two tables, trackinfo and pm_info, are both GB-level. The left-related code block is as follows: [mw_shl_code=sql,true ]from trackinfo a 
left outer join pm_info b 
on (a.ext_field7 = b.id) [/mw_shl_code]It takes 1.5 hours to use the above code block. The first optimization of the
optimization process takes into account that the id of the pm_info table is of type bigint, and the ext_field7 of trackinfo table is of type string. The data types are inconsistent when they are associated. The default hash operation will be allocated according to the id of bigint type, which will cause all string types. The ext_field7 is concentrated in a reduce, so it is changed to the following: [mw_shl_code=sql,true]from trackinfo a  left outer join pm_info b  on (cast(a.ext_field7 as bigint) = b.id) [/mw_shl_code] Change to After the above code, the effect is still not ideal, and it takes 1.5 hours. The second optimization takes into account the fact that the ext_field7 field of the trackinfo table has a high missing rate (empty, field length is zero, and fields are filled with non-integers), it is actually meaningless to perform the association operation of the left association spatiotemporal field. Therefore, if the left The table association field ext_field7 is an invalid field, so no association is required, so it is changed to the following:






[mw_shl_code=bash,true]from trackinfo a 
left outer join pm_info b 
on (a.ext_field7 is not null 
and length(a.ext_field7) > 0 
and a.ext_field7 rlike '^[0-9]+$' 
and a. ext_field7 = b.id)[/mw_shl_code]


The function of the above code block is that if the left table associated field ext_field7 is an invalid field (empty, the field length is zero, the field is filled with non-integer), the right table is not associated, because The right table field obtained after the empty field is left associated is still null, so it will not affect the result. 
After changing to the above code, the effect is still not ideal, and it takes 50 minutes.


After thinking about the third optimization


for a long time, the reason why the effect of the second optimization is not satisfactory is actually that in the left association, although the associated field of the left table is set to be empty, the right table is not associated, but in this way, the left table is not associated. The records (ext_field7 is empty) will all be gathered in one reduce for processing, which means that the reduce progress is at 99% for a long time. 
In another way of thinking, the breakthrough point of the solution is how to break up the keys of the unrelated records in the left table as much as possible, so you can do this: if the associated field in the left table is invalid (empty, field length is zero, field is filled non-integer), set the associated field of the left table to a random number before the association, and then associate the right table. The purpose of this is that even the unassociated records of the left table, its keys are very evenly distributed


[mw_shl_code= sql,true]from trackinfo a 
left outer join pm_info b 
on (
    case when (a.ext_field7 is not null 
  and length(a.ext_field7) > 0 
  and a.ext_field7 rlike '^[0-9]+$') 
    then 
  cast(a.ext_field7 as bigint) 
    else 
  cast(ceiling(rand() * -65535) as bigint) 
    end = b.id
) [/mw_shl_code]

第三次改动后,耗时从50分钟降为了1分钟32秒,效果显著!


http://wsq.discuz.com/?siteid=264104844&c=index&a=viewthread&tid=13077&source=pcscan

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324764580&siteId=291194637