Understand the spark partition strategy from the perspective of operators

Table of contents

I. Overview

2. Understand spark partitioning from the perspective of operators

1.Source operator

2.Transformation operator

①repartition&coalease

②groupby & groupbykey &partitionby(new HashPartitioner(num)) & reducebykey... & repartitionAndSortWithinPartitions(new HashPartitioner(10)) & join...

③sortby & sortbykey & partitionby(new RangePartitioner(num,rdd)) &  repartitionAndSortWithinPartitions(new RangePartitioner(num,rdd))

3.Action operator


I. Overview

       First of all, my understanding of the so-called partition strategy is where to let the data go. I read a lot about spark partition strategy on the Internet. Although I know that spark has defaultPartitioner, HashPartitioner, and RangerPartitioner, I am still confused when I return to actual work. For example: When will the partition strategy be used? What specific strategy is used when using it? So, I did a little research and summarized it here.

2. Understand spark partitioning from the perspective of operators

1.Source operator

       Generally in daily work, most source operators obtain data from HDFS, MySQL, local files, etc. At this time, the partition where the data goes has basically nothing to do with the partition strategy, but depends on these external systems, such as reading To get an HDFS file, calculate the number of partitions based on the blocksize and the partition size you set, and then read the data according to the block. At this time, which partition the data is in depends on which block it belongs to; another example is mysql, spark reads mysql generally Single parallelism reading, if it is multi-parallelism, which partition the data goes to depends on the conditions you set;

2.Transformation operator

      Transformation operators are also different due to different partitioning strategies of operators, but they are summarized into the following categories:

①repartition&coalease

       The bottom layer of repartition is to call the coalease that enables shuffle (not discussing the coalease that does not enable shuffle). When they are called, no matter whether it is key/value type data, the data in each partition of the upstream operator is traversed, and the data is polled and placed downstream. In the operator partition, the initial partition ID of the downstream operator is random when placed, and is then increased by 1. The source code is as follows:

②groupby & groupbykey & partitionby(new HashPartitioner(num)) & reducebykey... & repartitionAndSortWithinPartitions(new HashPartitioner(10)) & join...

     The bottom layer of these operators uses HashPartitioner. Here are a few examples:

def groupBy[K](
      f: T => K,
      numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy(f, new HashPartitioner(numPartitions))
  }

def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    //defaultPartitioner底层也是使用的HashPartitioner
    groupByKey(defaultPartitioner(self))
  }

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

//join,cartesian,intersection底层都是cogroup实现的
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    cogroup(other, defaultPartitioner(self, other))
  }

       As can be seen from groupby, not only key/value data can use HashPartitioner, but also data that is not key/value can be used, although the bottom layer is to use a piece of data as the key and then call groupbykey;

③sortby & sortbykey & partitionby(new RangePartitioner(num,rdd)) &  repartitionAndSortWithinPartitions(new RangePartitioner(num,rdd))

      The bottom layer of sortby is sortbykey, but it uses RangePartitioner

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

        I haven’t read the source code of RangePartitioner yet, but from testing, the same key will still be placed in the same partition. What he said is as balanced as possible should refer to different keys; HashPartitioner is based on the hash value %numPartition of the key, which comes from here It is said that HashPartitioner may cause data skew more easily;

3.Action operator

        There is no partition strategy for the action operator. It is either returned to the driver or written to an external system.

Guess you like

Origin blog.csdn.net/m0_64640191/article/details/129711861