org.apache.spark.storage.DiskStore
if(length<minMemoryMapBytes){
valbuf=ByteBuffer.allocate(length.toInt)
channel.position(offset)
while(buf.remaining()!=0){
if(channel.read(buf)==-1){
thrownewIOException("ReachedEOFbeforefillingbuffer\n"+
s"offset=$offset\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}")
}
}
buf.flip()
Some(buf)
}else{
Some(channel.map(MapMode.READ_ONLY,offset,length))
When Spark SQL performs aggregation (ie shuffle), there are 200 partitions by default.
Controlled by the parameter spark.sql.shuffle.partitions
The smaller the number of partitions, the larger the size of the Shuffle Block
Very large amount of data, the default number of partitions of 200 may not be enough
The data is skewed, causing the block size of a few partitions to be too large
=====Optimization based on Spark Shuffle Block====
solution
In Spark SQL, increase the number of partitions to reduce the block size of Spark SQL during shuffle
Increase spark.sql.shuffle.partitions value in Spark SQL
Avoid data skew
In Spark RDD, set repartiton, coalesce
rdd.repartiton ()或rdd.coalesce ()
How to determine the number of partitions
Rule of thumb: each partition is around 128M in size
During shuffle, when the number of partitions is greater than 2000 and less than 2000, Spark uses different data structures to preserve
save data.
org.apache.spark.scheduler.MapStatus
defapply(loc:BlockManagerId,uncompressedSizes:Array[Long]):MapStatus={
if(uncompressedSizes.length>2000){
HighlyCompressedMapStatus(loc,uncompressedSizes)
}else{
newCompressedMapStatus(loc,uncompressedSizes)
}
}
Number of partitions > 2000 VS Number of partitions <= 2000
Suggestion: When the number of partitions of the Spark application is less than 2000, but very close to 2000, it will be divided into
The number of zones is adjusted to be slightly larger than 2000