hive 数据倾斜问题

由于同事将未登录网站的相关信息导入到日常访问信息表中,之前的数据总量为7亿,现在13亿,数据差不多扩了一倍,所以在统计独立IP、UV、PV、独立cookie数,出现数据倾斜,reduce 进度99%时就被卡住了,因为未登录用户的用户ID为0,这样导致所有用户ID为0的数据都分到了一个reduce上,6亿的数据。目前简单的解决方法:

关联查询的sql:


         insert overwrite local directory '$HIVE_RESULT'

         select sum(case when d.pv_flag=1 then 1 else 0 end) as pv,count(distinct d.id) as uv,count(distinct d.ip) as ip,sum(d.stime),count(distinct d.cookie),d.product,u.friendcount_level  from user  u join access_dap  d  on (d.log_date='$YESTERDAY' and u.id=d.id)   group by d.product,u.friendcount_level;


改过后的sql:


         insert overwrite local directory '$HIVE_RESULT'

         select sum(case when d.pv_flag=1 then 1 else 0 end) as pv,count(distinct d.id) as uv,count(distinct d.ip) as ip,sum(d.stime),count(distinct d.cookie),d.product,u.friendcount_level  from user  u join access_dap  d  on (d.log_date='$YESTERDAY' and u.id=d.id)  and d.id!=0 group by d.product,u.friendcount_level;

将access_dap表id为0的过滤掉,因为join  user表时,user表中没有用户id=0的数据;暂时解决了问题;后续继续优化,,,

猜你喜欢

转载自wrn19851021-163-com.iteye.com/blog/1748843