Spark_Spark算子_repartitionAndSortWithinPartitions

Spark 提供了 repartitionAndSortWithinPartitions 算子，首先我们说说这个算子的用处：

给算子可以通过指定的分区器进行分组，并在分组内排序。

因此，可以满足我们如下的需求：

例如：

例子1. 将rdd数据中相同班级的学生分到一个partition中，并根据分数降序排序

例子2. 相同组合Key分组到同一分区，分区中先按照KEY排序，KEY相同的情况下按照其他键进行排序

首先，从官网上看上函数介绍：

地址：http://spark.apache.org/docs/latest/rdd-programming-guide.html#working-with-key-value-pairs

repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

可以看到 repartitionAndSortWithinPartitions 主要是通过给定的分区器，将相同KEY的元素发送到指定分区，并根据KEY 进行排排序。Tips: 我们可以按照自定义的排序规则，进行二次排序。

此外，repartitionAndSortWithinPartitions 是一个高效的算子，比先调用 repartition , 再调用 sorting 在分组内排序效率要高，这是由于它的排序是在shuffle过程中进行，一边shuffle，一边排序；具体见 spark shuffle的读操作；

粗要的看下源码：

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

  /**
   * Repartition the RDD according to the given partitioner and, within each resulting partition,
   * sort records by their keys.
   *
   * This is more efficient than calling `repartition` and then sorting within each partition
   * because it can push the sorting down into the shuffle machinery.
   */
 def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
    new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
  }

下面以一个具体需求，看一下这个算子如何使用：

例子1. 将rdd数据中相同班级的学生分到一个partition中，并根据分数降序排序

实现代码

package com.gaosi.spark.demo

/**
  * Created by szh on 2019/9/19.
  */

import org.apache.spark.{SparkConf, SparkContext}


class Student {

}


//创建key类，key组合键为grade，score
case class StudentKey(grade: String, score: Int)
//  extends Ordered[StudentKey] {
//  def compare(that: StudentKey): Int = {
//    var result: Int = this.grade.compareTo(that.grade)
//    if (result == 0) {
//      result = that.score.compareTo(this.score)
//    }
//    result
//  }
//}

object StudentKey {
  implicit def orderingByGradeStudentScore[A <: StudentKey]: Ordering[A] = {
    Ordering.by(fk => (fk.grade, fk.score * -1))
  }
}

//创建分区类
import org.apache.spark.Partitioner

class StudentPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  override def numPartitions: Int = partitions

  override def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[StudentKey]
    Math.abs(k.grade.hashCode()) % numPartitions
  }
}

object Student {


  def main(args: Array[String]) {


    //定义hdfs文件索引值
    val grade_idx: Int = 0
    val student_idx: Int = 1
    val course_idx: Int = 2
    val score_idx: Int = 3

    //定义转化函数，不能转化为Int类型的，给默认值0
    def safeInt(s: String): Int = try {
      s.toInt
    } catch {
      case _: Throwable => 0
    }

    //定义提取key的函数
    def createKey(data: Array[String]): StudentKey = {
      StudentKey(data(grade_idx), safeInt(data(score_idx)))
    }

    //定义提取value的函数
    def listData(data: Array[String]): List[String] = {
      List(data(grade_idx), data(student_idx), data(course_idx), data(score_idx))
    }

    def createKeyValueTuple(data: Array[String]): (StudentKey, List[String]) = {
      (createKey(data), listData(data))
    }



    //设置master为local，用来进行本地调试
    val conf = new SparkConf().setAppName("Student_partition_sort").setMaster("local")
    val sc = new SparkContext(conf)

    //学生信息是打乱的
    val student_array = Array(
      "c001,n003,chinese,59",
      "c002,n004,english,79",
      "c002,n004,chinese,13",
      "c001,n001,english,88",
      "c001,n002,chinese,10",
      "c002,n006,chinese,29",
      "c001,n001,chinese,54",
      "c001,n002,english,32",
      "c001,n003,english,43",
      "c002,n005,english,80",
      "c002,n005,chinese,48",
      "c002,n006,english,69"
    )
    //将学生信息并行化为rdd
    val student_rdd = sc.parallelize(student_array)
    //生成key-value格式的rdd
    val student_rdd2 = student_rdd.map(line => line.split(",")).map(createKeyValueTuple)
    //根据StudentKey中的grade进行分区，并根据score降序排列
    val student_rdd3 = student_rdd2.repartitionAndSortWithinPartitions(new StudentPartitioner(10))
    //打印数据
    student_rdd3.collect.foreach(println)
  }
}

输出

(StudentKey(c001,88),List(c001, n001, english, 88))
(StudentKey(c001,59),List(c001, n003, chinese, 59))
(StudentKey(c001,54),List(c001, n001, chinese, 54))
(StudentKey(c001,43),List(c001, n003, english, 43))
(StudentKey(c001,32),List(c001, n002, english, 32))
(StudentKey(c001,10),List(c001, n002, chinese, 10))
(StudentKey(c002,80),List(c002, n005, english, 80))
(StudentKey(c002,79),List(c002, n004, english, 79))
(StudentKey(c002,69),List(c002, n006, english, 69))
(StudentKey(c002,48),List(c002, n005, chinese, 48))
(StudentKey(c002,29),List(c002, n006, chinese, 29))
(StudentKey(c002,13),List(c002, n004, chinese, 13))

这里我们注意2个点，

一是分区器

import org.apache.spark.Partitioner

class StudentPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  override def numPartitions: Int = partitions

  override def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[StudentKey]
    Math.abs(k.grade.hashCode()) % numPartitions
  }
}

注意 hashCode 可能为负，所以要调用 Math.abs

第二个点是排序的实现

object StudentKey {
  implicit def orderingByGradeStudentScore[A <: StudentKey]: Ordering[A] = {
    Ordering.by(fk => (fk.grade, fk.score * -1))
  }
}

高达一号

发布了519 篇原创文章 · 获赞 1146 · 访问量 283万+

他的留言板关注

Spark_Spark算子_repartitionAndSortWithinPartitions

猜你喜欢