spark crossjoin method optimization

scene description

The problem encountered is that DF1.crossJoin(DF2)the execution time is very slow. The data volume of the two DFs is about tens of millions. At first, I thought that the execution was very time-consuming due to the large amount of data, but later I found that on another batch of data of the same order of magnitude, crossJoin is very fast. Then there is a problem, take the time to study it.


reason

It's all zoning's fault. Spark stores data according to partitions. When performing operations, each partition is used as a task and multiple tasks are operated in parallel to improve processing efficiency. Obviously the number of partitions has a great impact on the speed of calculation. It is not that the more partitions, the faster the execution efficiency, because too many partitions mean more tasks, and the management of these tasks also requires a lot of overhead, so sometimes it will reduce the execution efficiency.

When resources are fully utilized, for the execution of the crossJoin method, the smaller the partition of DF, the higher the execution efficiency.

The number of partitions in the crossJoin result

Experiment 1: Small data volume

When the amount of DF data is small, the number of partitions after crossJoin execution is equal to the number of partitions of one of the input DFs.

scala> val xDF = (1 to 1000).toList.toDF("x")
scala> xDF.rdd.partitions.size
res11: Int = 64
scala> val yDF = (1 to 1000).toList.toDF("y")
scala> yDF.rdd.partitions.size
res12: Int = 64
scala> val crossJoinDF = xDF.crossJoin(yDF)
scala> crossJoinDF.rdd.partitions.size
res13: Int = 64
scala> val zDF = yDF.repartition(128)
scala> zDF.rdd.partitions.size
res5: Int = 128
scala> val xzcrossJoinDF = xDF.crossJoin(zDF)
scala> xzcrossJoinDF.rdd.partitions.size
res6: Int = 64  

Experiment 2: Large data volume

When the amount of DF data is small, the number of partitions after crossJoin execution is equal to the product of the two input partition numbers.

scala> val xDF = (1 to 1000000).toList.toDF("x")
scala> xDF.rdd.partitions.size
res15: Int = 2
scala> val yDF = (1 to 1000000).toList.toDF("y")
scala> yDF.rdd.partitions.size
res16: Int = 2
scala> val crossJoinDF = xDF.crossJoin(yDF)
scala> crossJoinDF.rdd.partitions.size
res17: Int = 4

What is the criterion for whether the partition changes after crossJoin?

My understanding: Keep the amount of data in each area as small as possible to make the task execute faster. When the DF data itself is large, the amount of data becomes M*N after the Cartesian product, increasing the number of partitions and reducing the amount of data in each partition.


What impact does the number of partitions have on computational efficiency?

The number of partitions of DataFrame has the following effects on calculation efficiency:

  1. When the number of partitions is small, the parallelism of calculation is small, the applied resources are not fully utilized, and the maximum efficiency is not exerted.
  2. When there are many partitions, there will be a lot of overhead in managing small tasks, thus reducing computing efficiency.

The dataFrame after crossJoin belongs to the second type. Too many partitions make any operation on the dataFrame very slow. The following exceptions may also occur:

org.apache.spark.SparkException Job aborted due to stage failure: 
Total size of serialized results of 147936 tasks (1024.0 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB)

The reason for this exception is very simple. The driver stores the metadata of all tasks and tracks the execution of the tasks. After the executor completes the task, it needs to transfer the status data of each task back to the driver. A large number of partitions will generate the same amount of status information. Returning to the driver, the result exceeds the default size of the driver.

The solution is also very simple: just use spark.driver.maxResultSizeto increase the size. However, this method only solves the above exception and has no effect on optimizing computational efficiency.


How to increase calculation speed?

This is also very simple: reduce the number of input DF partitions before cross join.

Continue experimenting:

variable The amount of data Number of partitions
df1 17,000 200
df2 15,000 200
scala> df1.count()
res73: Long = 17000
scala> df1.rdd.partitions.size
res74: Int = 200
scala> df2.count()
res75: Long = 15000
scala> df2.rdd.partitions.size
res76: Int = 200
scala> val finalDF = df1.crossJoin(df2)
scala> finalDF.rdd.partitions.size 
res77: Int = 40000
scala> time {
    
    finalDF.count()}
Elapsed time: 285163427988ns
res78: Long = 255000000

Adjust the number of partitions of df1 and df2 to 40

scala> val df1 = df1.repartition(40)
scala> df1.rdd.partitions.size 
res80: Int = 40
scala> val df2 = df2.repartition(40)
scala> df2.rdd.partitions.size 
res81: Int = 40
scala> val finalDF = df1.crossJoin(df2)
scala> finalDF.rdd.partitions.size 
res82: Int = 1600
scala> time {
    
    finalDF.count()}
Elapsed time: 47178149994ns
res86: Long = 255000000

It can be seen that the calculation time before adjustment is about 6 times that after adjustment.


reference link

https://towardsdatascience.com/make-computations-on-large-cross-joined-spark-dataframes-faster-6cc36e61a222

Guess you like

Origin blog.csdn.net/yy_diego/article/details/128568343