Hive map阶段缓慢

不同数据类型关联产生数据倾斜

场景:用户表中user_id字段为int,log表中user_id字段既有string类型也有int类型。当按照user_id进行两个表的Join操作时,默认的Hash操作会按int型的id来进行分配,这样会导致所有string类型id的记录都分配到一个Reducer中。

解决方法:把数字类型转换成字符串类型

select * from users a

  left outer join logs b

  on a.usr_id = cast(b.user_id as string)

现实描述:(默认是mapjoin)

fin_ihotel_ceq_external_ctrip表(大表)的countryid和cityid 为int类型 ,

而dim_ihotel_country_ctrip(小表)的country_id,dim_ihotel_city_ctrip(小表)的city_id都为string类型

from fin_ihotel_ceq_external_ctrip t1 
left join dim_ihotel_country_ctrip t2 
on t1.countryid=t2.country_id
left join dim_ihotel_city_ctrip t3 
on t1.cityid=t3.city_id and t1.countryid=t2.country_id 

改成

from fin_ihotel_ceq_external_ctrip t1

left join dim_ihotel_country_ctrip t2

on cast(t1.countryid as string)=t2.country_id

left join dim_ihotel_city_ctrip t3

on cast(t1.cityid as string)=t3.city_id and cast(t1.countryid as string)=t2.country_id

速度提升1个半小时到3分钟

以后需要关联的字段最好都 用string类型定义

附其他参考:

Hive map阶段缓慢,优化过程详细分析

http://bigdata.51cto.com/art/201703/535606.htm

猜你喜欢

转载自blog.csdn.net/hellojoy/article/details/86625859