foreword
Recently, a student in the digital warehouse asked me, Spark
how to deal with the problem of data skew in the task? So today, let’s talk about how to deal with the problem of data skew.
1) Definition of data skew
Spark
The data skew problem in mainly refers to shuffle
the data skew problem that occurs during the process, which is caused by the different key
amount of data processed by different tasks due to the different amount of corresponding data.
2) The performance of data skew
Spark
Most of the jobs task
are executed quickly, and one or several task
are executed very slowly. At this time, there may be data skew. The job can run, but it runs very slowly;
Spark
most of the jobs task
are executed quickly, but some are task
in the process of running It will be reported suddenly , and an error will be reported OOM
at a certain point after repeated execution several times . At this time, there may be data skew and the job cannot run normally. The execution speed and number of processed data records of other tasks in the above screenshots are similar, but the amount of data processed by the first one is as high as nearly 100 million data records, and data skew is found.task
OOM
15GB
3) Positioning data skew problem
Step 1: See where data skew occursStage