spark RDD算子总结

1. map

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    
    
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))
  }

逐条输入，逐条输出，数量不会发生改变，输入输出类型可以改变

    // map
	val dataRDD: RDD[Int] = sc.makeRDD(List(1,2,3,4))
    val mapRDD = dataRDD.map(_*2)
    mapRDD.collect().foreach(println(_))

	-------------print--------------
	2
	4
	6
	8

map算子会按照分区执行，单个分区执行类似于单线程，是有序的，多个分区间是无序的，且数据执行的时候，是一条一条的执行，执行完后续所有逻辑后，才开始执行下一条数据。

	val dataRDD: RDD[Int] = sc.makeRDD(List(1,2,3,4), 2)
    dataRDD.saveAsTextFile("data")
    dataRDD.map(
      num => {
    
    
        println(">>>>>> " + num * 2)
        num * 2
      }
    ).map(
      num => {
    
    
        println("!!!!!! " + num * 2)
        num * 2
      }
    ).collect()
    -------------print--------------
	>>>>>> 2
	!!!!!! 4
	>>>>>> 4
	!!!!!! 8
	>>>>>> 6
	!!!!!! 12
	>>>>>> 8
	!!!!!! 16

其中1,2在一个分区，3,4在一个分区
在这里插入图片描述

思考:如果map后面遇到mapPartitions算子，是否会等到所有数据执行完再执行下一段逻辑。

2. mapPartitions

def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    
    
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (_: TaskContext, _: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
}

传入一个迭代器，返回一个迭代器。输入输出前后数量可以不相同，特点是以分区为单位进行逻辑运算。相比较map来说有了类似于批处理操作，可以稍微提高运行效率，但存在数据引用，内存无法释放内存溢出的风险。

 val dataRDD: RDD[Int] = sc.makeRDD(List(1,2,3,4),2)
    val mapPartRDD: RDD[Int] = dataRDD.mapPartitions(
      it => {
    
    
        println("######## ")
        it.map(_ + 1)
      }
    )
    mapPartRDD.collect().foreach(println(_))

  -------------print--------------
    ######## 
	######## 
	2
	3
	4
	5

由打印结果可以看出print一共执行了两次，每个分区执行一次。

	//求各个分区中的最大值
	val rdd = sc.makeRDD(List(1,2,3,4), 2)
        val mpRDD = rdd.mapPartitions(
            iter => {
    
    
                List(iter.max).iterator
            }
        )
     mpRDD.collect().foreach(println)

	-------------print--------------
	2
	4

3. mapPartitionsWithIndex

传入一个元组包括分区号和迭代器，返回迭代器

def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    
    
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (_: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
}

以分区为单位进行计算，每个迭代器代表一个分区内的所有数据
demo：获取一号分区的数据

	val rdd = sc.makeRDD(List(1,2,3,3,3,4,3,4,3,2,1,4), 2)
    rdd.saveAsTextFile("data")
    val mpiRDD = rdd.mapPartitionsWithIndex(
        (index, iter) => {
    
    
            if ( index == 1 ) {
    
    
               iter
            } else {
    
    
           		 Nil.iterator
            }
        }
    )
    mpiRDD.collect().foreach(println)

	-------------print--------------
	3
	4
	3
	2
	1
	4

4. flatMap

对数据进行扁平化处理

  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    
    
	    val cleanF = sc.clean(f)
	    new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.flatMap(cleanF))
  }

demo1：字符串切分

	val rdd: RDD[String] = sc.makeRDD(List(
            "Hello Scala", "Hello Spark"
    ))
    val flatRDD: RDD[String] = rdd.flatMap(
          s => {
    
    
             s.split(" ")
          }
     )
    flatRDD.collect().foreach(println)

	-------------print--------------
	Hello
	Scala
	Hello
	Spark

demo2：模式匹配

	val rdd: RDD[Any] = sc.makeRDD(List(
            List(1, 2), List(3, 4), "hello"
    ))
    val flatRDD: RDD[Any] = rdd.flatMap(
       data => {
    
    
          data match {
    
    
                case list: List[_] => list
                case dat => List(dat)
          }
       }
    )
 	flatRDD.collect().foreach(println)
	
	-------------print--------------
	1
	2
	3
	4
	hello

5. glom

将分区内的数据转换成数组

  /**
   * Return an RDD created by coalescing all elements within each partition into an array.
   */
  def glom(): RDD[Array[T]] = withScope {
    
    
    new MapPartitionsRDD[Array[T], T](this, (_, _, iter) => Iterator(iter.toArray))
  }

demo1

	val rdd : RDD[Int] = sc.makeRDD(List(1,2,3,4), 2)
    
    val glomRDD: RDD[Array[Int]] = rdd.glom()

    glomRDD.collect().foreach(data=> println(data.mkString(",")))

	-------------print--------------
	1,2
	3,4

思考：要是分区内的数据不是相同类型会怎样？
不是相同类型的数据会转换成Array[Any]数组；

	val rdd: RDD[Any] = sc.makeRDD(List(1,3,"a",2,4,"C"), 2)
    val glomRDD: RDD[Array[Any]] = rdd.glom()
    glomRDD.collect().foreach(
      data=> {
    
    
        data.foreach(
          data1 =>{
    
    
            data1 match {
    
    
              case a : Int => println(a+1)
              case b : String => println(b)
              case _ => print("unknow")
            }
          }
        )
      }
    )
    -------------print--------------
    2
	4
	a
	3
	5
	C

6. groupBy

将数据根据指定的规则进行分组，分区数不变，数据会被打乱重新组合。可能会存在数据倾斜问题。

/**
   * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    
    
    groupBy[K](f, defaultPartitioner(this))
  }

根据源码注释说明，groupBy只负责分组，不负责聚合，如果想用聚合功能，使用reduceByKey和aggregateByKey函数会更好。

groupBy输入一个函数，返回一个元组 RDD[(K, Iterable[T])
demo1

	val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4), 2)
    val grpRDD: RDD[(Int, Iterable[Int])] = rdd.groupBy(
      data => data % 2
    )
    grpRDD.collect().foreach(println(_))

	-------------print--------------
	(0,CompactBuffer(2, 4))
	(1,CompactBuffer(1, 3))

RDD[(K, Iterable[T])中的泛型K是我们分组的key值。
demo2

	val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4), 2)
   val grpRDD: RDD[(Int, Iterable[Int])] = rdd.groupBy(
     data => data % 2
   )
   grpRDD.flatMap(
     data =>{
    
    
       data._2
     }
   ).collect().foreach(println(_))
   -------------print--------------
   2
   4
   1
   3

7. filter

传入一个表达式，根据表达式最终的比较结果，把符合规则的数据保留，不符合规则的数据丢弃，相同分区的数据，可能会存在过滤掉太多数据，导致数据倾斜问题。

 /**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = withScope {
    
    
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (_, _, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

demo

	val dataRDD = sparkContext.makeRDD(List(
	 1,2,3,4
	),1)
	val dataRDD1 = dataRDD.filter(_%2 == 0)
	-------------print--------------
	2
	4

8. distinct

去除重复数据

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = withScope {
    
    
    distinct(partitions.length)
  }

demo

 	val rdd = sc.makeRDD(List(1,2,3,4,1,2,3,4))
    val rdd1: RDD[Int] = rdd.distinct()
    rdd1.collect().foreach(println)

	-------------print--------------
	1
	2
	3
	4

思考：distinct是否会shuffle，缩减分区？

 	val rdd = sc.makeRDD(List(1,2,2,3,2,1,3,4), 2)
    rdd.saveAsTextFile("data1")
    val distinctRDD = rdd.distinct()
    distinctRDD.saveAsTextFile("data2")
    distinctRDD.collect().foreach(println(_))
    -------------print--------------
    4
	2
	1
	3

在这里插入图片描述

从以上执行结果来看，distinct执行时，分区数不变，但是执行过程中数据发生了shuffle。

9. coalesce

缩减或扩大分区，传入两个参数，第一个参数为新分区数量，第二个参数是是否进行shuffle，Boolean值，默认为false，可不传。

  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)

demo

	val rdd = sc.makeRDD(List(1,2,3,4,5,6), 3)
    val newRDD1: RDD[Int] = rdd.coalesce(2)
    val newRDD2: RDD[Int] = rdd.coalesce(2,true)
    newRDD1.saveAsTextFile("output1")
    newRDD2.saveAsTextFile("output2")

在这里插入图片描述

如果选择扩大分区，一定要进行shuffle，不然无法扩大分区数

	val rdd = sc.makeRDD(List(1,2,3,4,5,6), 3)
    val newRDD1: RDD[Int] = rdd.coalesce(4)
    newRDD1.saveAsTextFile("output1")

在这里插入图片描述

  val rdd = sc.makeRDD(List(1,2,3,4,5,6), 3)
  val newRDD1: RDD[Int] = rdd.coalesce(4,true)
  newRDD1.saveAsTextFile("output2")

在这里插入图片描述
如上图所示，如果不传入true进行shuffle，分区数不会改变。

10. repartition

也是进行分区重置，可参考repartition源码：

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    
    
    coalesce(numPartitions, shuffle = true)
  }

repartition底层默认使用coalesce进行实现的，默认进行shuffle操作。

11. sortBy

根据指定的规则对数据源中的数据进行排序（第一个参数），默认为升序（true），第二个参数可以改变排序的方式，第三个参数会改变分区数，不传默认不改变。

  /**
   * Return this RDD sorted by the given key function.
   */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    
    
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

demo1

 val rdd = sc.makeRDD(List(6,2,4,5,3,1), 2)

 val newRDD = rdd.sortBy(
     num => num,
     false
 )
 newRDD.saveAsTextFile("output")

执行结果为两个分区
在这里插入图片描述
demo2

  val rdd = sc.makeRDD(List(6,2,4,5,3,1), 2)
  val newRDD = rdd.sortBy(
  		num => num,
        false,
        1
  )
  newRDD.saveAsTextFile("output")

执行结果为一个分区
在这里插入图片描述
由上面现象可以得知，sortBy默认会进行shuffle操作。

1. map

思考:如果map后面遇到mapPartitions算子，是否会等到所有数据执行完再执行下一段逻辑。

2. mapPartitions

3. mapPartitionsWithIndex

4. flatMap

5. glom

6. groupBy

7. filter

8. distinct

9. coalesce

10. repartition

11. sortBy

-------------------------- continue ----------------------

猜你喜欢