Spark data skew_Causes and location solutions_Production environment

        In a recent project, data skew occurred when historical and real-time data were correlated and smoothed, resulting in a Cartesian product. The specific phenomenon is as follows: running memory 175GB, core number 64, when running the code, view the active jobs of the SparkUI interface, and data input It is 1G, and the successful stage is 0, which has always been 0/120. Therefore, through investigation, it was found that the Cartesian product did occur.

Causes and solutions to Spark data skew:

        Spark data skew is mainly caused by the different amounts of data corresponding to different keys during the shuffle process. The specific manifestation is that different tasks process different amounts of data. In a Spark job, if there is a key that may cause data skew, you can consider filtering the key to filter out the data that may cause data skew, thereby avoiding data skew in the Spark job. In addition, increasing the parallelism of the reduce side during the shuffle process, that is, increasing the number of tasks on the reduce side, can reduce the amount of data allocated to each task, thereby alleviating the data skew problem.

Reference articles:

How Spark handles data skew-CSDN Blog

Guess you like

Origin blog.csdn.net/qq_52128187/article/details/134434587