Optimization based on Spark Shuffle Block




org.apache.spark.storage.DiskStore

  if(length<minMemoryMapBytes){

  valbuf=ByteBuffer.allocate(length.toInt)

  channel.position(offset)

  while(buf.remaining()!=0){

  if(channel.read(buf)==-1){

  thrownewIOException("ReachedEOFbeforefillingbuffer\n"+

  s"offset=$offset\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}")

  }

  }

  buf.flip()

  Some(buf)

  }else{

  Some(channel.map(MapMode.READ_ONLY,offset,length))


When Spark  SQL performs aggregation (ie shuffle), there are 200 partitions by default.

  Controlled by the parameter spark.sql.shuffle.partitions

The smaller the number of partitions, the larger the size of the Shuffle Block

Very large amount of data, the default number of partitions of 200 may not be enough

The data is skewed, causing the block size of a few partitions to be too large


=====Optimization based on Spark Shuffle Block====

solution

In Spark SQL, increase the number of partitions to reduce the block size of Spark SQL during shuffle

  Increase spark.sql.shuffle.partitions value in Spark SQL

Avoid data skew

In Spark RDD, set repartiton, coalesce

  rdd.repartiton ()rdd.coalesce ()

How to determine the number of partitions

Rule of thumb: each partition is around 128M in size


During shuffle, when the number of partitions is greater than 2000 and less than 2000, Spark uses different data structures to preserve

save data.

  org.apache.spark.scheduler.MapStatus

  defapply(loc:BlockManagerId,uncompressedSizes:Array[Long]):MapStatus={

  if(uncompressedSizes.length>2000){

  HighlyCompressedMapStatus(loc,uncompressedSizes)

  }else{

  newCompressedMapStatus(loc,uncompressedSizes)

  }

  }



  Number of partitions > 2000 VS Number of partitions <= 2000 

Suggestion: When the number of partitions of the Spark application is less than 2000, but very close to 2000, it will be divided into

The number of zones is adjusted to be slightly larger than 2000





Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325729972&siteId=291194637