一次spark任务优化

这是之前开发人群画像的时候，处理用户信息标签的一个任务，当时的资源还十分有限。最终结果是将任务由30分钟在相同的资源情况下可以在四分钟内跑完。

CREATE EXTERNAL TABLE `user_info_tag`(

`mac` string,

`tags` map<string,string>)

PARTITIONED BY (

`day` string)

STORED AS ORC

LOCATION

'/apps/external/hive/user_info_tag';

ALTER TABLE user_info_tag CHANGE mac mac string COMMENT 'device only id';

ALTER TABLE user_info_tag CHANGE tags tags map<string,string> COMMENT 'if one device has more than one account and tag has more than one value ,multiple values separated by commas';

select d_device_id,count(*) as cnt from caos_tv.user_db where day = "2017-02-27" group by d_device_id order by cnt desc limit 20;

FFFFFFFFFFFF 2195

001020304050 1175

fca38681ade6 979

001a9a000000 935

001a9ae2ff79 765

001A9AE2FF99 318

00301bba02db 292

001A9AE2FF79 255

001a9ae2ff66 241

fca38681c5d5 212

fca3867569da 110

001A9AE2FF66 110

fca38681c721 110

001A9AE2FF16 90

bcec23e0ee7e 85

ffffffffffff 81

bca789dfceda 73

bcec2376768a 66

数据量300万左右

大小六七百M

rdd原始分区三个

需要按mac进行 groupbykey 操作

第一次：

--executor-memory=5g \

--driver-memory=2g \

--num-executors=10 \

写入分区 repartition(1)

根本跑不出来，失败

第二次

--executor-memory=5g \

--driver-memory=2g \

--num-executors=10 \

写入分区 repartition(3)

在spark shell 里可以成功，但是spark submit 偶尔成功一次 30分钟左右

早上执行更快可能是由于 spark清理了缓存

第三次：

--executor-memory=5g \

--driver-memory=2g \

--num-executors=10 \

repartition(6)

将groupbykey 改成reducebykey

早上14分钟跑成功了，其他时间尝试前两步map成功很快，第三部insert 跑到4/6或者5/6 然后time out 失败偶尔成功

这篇博文介绍了数据倾斜的一些情况

http://blog.csdn.net/pengych_321/article/details/52260361

过滤掉还有 500个以上value 的key，防止数据倾斜

然而并没有什么用

这篇博文讲述了shuffle FetchFailed 的一些问题

http://blog.csdn.net/lsshlsw/article/details/51213610

--executor-memory=6g \

--driver-memory=2g \

--num-executors=9 \

repartition(9)

3分50秒

第四次优化：

--executor-memory=4g \

--driver-memory=2g \

--num-executors=9 \

repartition(9)

9分37秒

减小内存时间变慢了

第五次优化：

--executor-memory=3g \

--driver-memory=1g \

--num-executors=12 \

repartition(12)

相同内存总量增加num数和分区数

8分4秒

继续增加分区区别不大

增加原来value特别多的key

9分5秒影响也不大

一次spark任务优化

猜你喜欢