Troubleshooting abnormal CPU usage of Spark Driver

Author of this article: Jamie, the father of Guanyuan Data in-house Spark, an epic engineer in the ABI field.

At the beginning of the year, we received a customer feedback that the server cpu usage was abnormal, so we remotely connected to the server to check and found that the Spark driver took up most of the cpu. For the problem of cpu occupation, use jstack to quickly locate the execution logic of jvm. Analysis of jstack shows that most of the CPU-occupied threads are executing an optimization rule called transpose window, and they are all related to a method in this logic:

private def compatiblePartitions(ps1 : Seq[Expression], ps2: Seq[Expression]): Boolean = {
  ps1.length < ps2.length && ps2.take(ps1.length).permutations.exists(
    ps1.zip(_).forall {
      case (l, r) => l.semanticEquals(r)
    })
  }

permutations This logic does not seem complicated. Notice that there is a function call in this method  , and this function will return all the permutations of an array. For an array with n different elements, the number of permutations is n! , that is to say, the time complexity of this algorithm will reach O(n!), then the problems we encounter are likely to be related to this. But we still need to find out what sql statement triggers this problem.

From the monitoring point of view, the CPU of the driver is rising in steps, so the rising time point should be the time point when the problematic task is submitted. Combined with the logic in the call stack, it is doing window-related optimization. We focus on finding These time points contain window function related tasks. Soon we located an ETL. From the monitoring point of view, every time this ETL is run, the Spark driver will occupy an extra CPU and will not release it for a long time.

The original ETL logic is relatively complicated. After we simplified it, we found that it is only related to two calculated fields with window functions. The characteristic is that partition by uses more fields. For the convenience of debugging, we reproduced it with spark-shell:

val df = spark.range(10).selectExpr("id AS a1", "id AS a2", "id AS a3", "id AS a4", "id AS a5", "id AS a6", "id AS a7", "id AS a8", "id AS a9", "id AS a10", "id AS a11", "id AS a12", "id AS a13", "id AS a14", "id AS a15", "id AS a16") 
df.selectExpr(
  "sum(`a16`) OVER(PARTITION BY `a1`,`a2`,`a3`,`a4`,`a5`,`a6`,`a7`,`a8`,`a9`,`a10`,`a11`,`a12`,`a13`,`a14`,`a15`) as p1", 
  "sum(`a16`) OVER(PARTITION BY `a14`,`a2`,`a3`,`a4`,`a5`,`a6`,`a7`,`a8`,`a9`,`a10`,`a11`,`a12`,`a13`,`a1`) as p2"
  ).explain

If you run the above code in spark-shell above version 3.0, you will find that spark-shell is stuck, and the stuck place is  compatiblePartitions the method.

That is to say, if you add multiple calculated fields with window functions, and there are too many partition by fields, it is easy to trigger this problem. For example, in the above example, you can calculate how many full arrays of 14 elements are arranged. combination. So is there any way we can optimize this problem?

Generally, when encountering Spark-related problems, we can go to JIRA to search for similar problems, because Spark is widely used, and the problems encountered are likely to have been discovered or even fixed before. But we couldn't find relevant problems after searching with various keywords. It seems that this problem can only be solved by ourselves.

First of all, we need to look at when the relevant logic was introduced and why. Looking at the submission history, we can find that it is to solve this problem: [SPARK-20636] Eliminate unnecessary shuffle with adjacent Window expressions - ASF JIRA , the reason why you want to transpose window is to reduce one shuffle operation. The logic inside  compatiblePartitions is to judge whether to transpose two windows.

Deduce the definition of compatible from the logic of the existing code, it should be that the partition field of window1 is an arrangement of the prefix of the partition field of window2. A few examples will make it clearer. For example, window2 is partition by('a', 'b', 'c', 'd'), then window1 can be partition by('a'), partition by('a' , 'b'), partition by('b', 'a'), partition by('c', 'a', 'b') and so on, but not partition by('b'), partition by( 'a', 'c') etc.

But this logic is actually not very reasonable. One is that the cost of permutation is too high, and the other is that some cases that can be transposed have not been done, such as the above partition by('b'), partition by('a', 'c' ) wait. In addition, consider the case of some repeated fields, such as partition by('a', 'a'), the original algorithm is not acceptable, so this compatible can be defined as partition by of window1, all the fields in window2 can be found , then we can do transpose to reduce shuffle, expressed in code:

private def compatiblePartitions(ps1 : Seq[Expression], ps2: Seq[Expression]): Boolean = {
  ps1.length < ps2.length && ps1.forall { expr1 =>
    ps2.exists(expr1.semanticEquals)
  }
}

In this way, we not only avoid complex permutation, but also increase the scope of application of this optimization.

After a series of tests, we found that the changes are effective, so we submitted an issue and related PR to the community. Interested students can view the specific content:

Although it sank in the ocean of PR at the beginning, it was picked up by international friends after half a year, and finally adopted by the community, so that we only need to upgrade to the official version to solve this problem. There is also a new student added to the list of Spark contributors of Guanyuan Data. In the future, we will continue to pay attention to Spark/Delta and other related issues encountered in practice, so as to contribute to the development of open source projects.

Guess you like

Origin blog.csdn.net/GUANDATA_/article/details/126469737