一、数据清洗key的发现:
1、如何发现某个key值偏多:
步骤一:通过查询的方式创建表
create table tmp.tableA as select id,count(*) as num from tableA group by id;
步骤二:进行排序
select id,num from tmp.tableA order by num desc limit 10;
二、数据倾斜
1. join导致的数据倾斜
假设有俩张表user表和order表
user表
order表
当join uid中uid字段存在数据倾斜
方式一:
如果只是大小表关联,后面没有聚集操作,默认,就会有map, join
方式二:
1~99维表 dim_n ,这张表只存储0-99的数字,即只有1,2,3,4.......97,98,99这几个数字
create table dim_n;
select t1.imei,t1.package,t2.name
from
(select imei,package,ceiling(rand()*100) num
from edw.app_list_fact
where data_date=20191126
) t1
join
(select n,package,name
from public.package_info,tmp.dim_n #进行笛卡尔积
) t2
on (t1.package=t2.package and t1.num%100 = t2.n)
将维表,用笛卡尔集放大,上面的是100倍,然后再Join,实际就是将存在倾斜的imei给分散了。
维表不能太大,太大放大100倍,数据量就太大了,可能就得将倾斜的数据过滤出来再单独处理了
2.group by 导致的数据倾斜
当group by uid中uid字段存在数据倾斜,可以使用俩次group by,
第一次group by 是
select
uid,
ceiling(rand()*10)
from dual group by uid,ceiling(rand()*10);
假设这次得到的表是 t1
进行第二次group by
select
uid,
from t1 group by uid
总的语句:
select
uid
from (
select
uid,
ceiling(rand()*10)
from dual
group by uid,ceiling(rand()*10)
)
group by uid