How to deal with the problem of data skew in Spark tasks?

foreword

Recently, a student in the digital warehouse asked me, Sparkhow to deal with the problem of data skew in the task? So today, let’s talk about how to deal with the problem of data skew.

1) Definition of data skew

SparkThe data skew problem in mainly refers to shufflethe data skew problem that occurs during the process, which is caused by the different keyamount of data processed by different tasks due to the different amount of corresponding data.

2) The performance of data skew

SparkMost of the jobs taskare executed quickly, and one or several taskare executed very slowly. At this time, there may be data skew. The job can run, but it runs very slowly;
Sparkmost of the jobs taskare executed quickly, but some are taskin the process of running It will be reported suddenly , and an error will be reported OOMat a certain point after repeated execution several times . At this time, there may be data skew and the job cannot run normally. The execution speed and number of processed data records of other tasks in the above screenshots are similar, but the amount of data processed by the first one is as high as nearly 100 million data records, and data skew is found.taskOOM
insert image description here
15GB

3) Positioning data skew problem

Step 1: See where data skew occursStage

Guess you like

Origin blog.csdn.net/u011109589/article/details/131965685