Spark-Core之map与mapPartitions

文章目录

map和MapPartitions
源代码
案例分析
foreach和foreachPartition
关于textFile算子

map和MapPartitions

源代码

①map源代码：

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

Return a new RDD by applying a function to all elements of this RDD.
对一个RDD做map，就是map把一个函数作用于一个RDD的所有元素，最后返回一个新的RDD。

②mapPartitions源代码：

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

Return a new RDD by applying a function to each partition of this RDD.
对一个RDD做mapPartitions，就是mapPartitions把一个函数作用于一个RDD下的每个分区，最后返回一个新的RDD。

案例分析

如果现在有个RDD，10个partition，每个partition有100万条数据，现在把这个RDD保存到一个数据库里面，比如MySQL。用map和mapPartitions如何实现？会不会有问题？

如果用map去处理的话，它会对RDD里每个元素都保存一遍，会连接到MySQL 10*100万=1000万次connection。

如果用mapPartitions去处理的话，它只是对RDD里每个分区都保存一遍，会连接到MySQL 10次connection。

进行1000万次的连接和进行10次的连接，肯定1000万次更消耗性能的了。

现在用map来测试一下：

package com.ruozedata.spark.com.ruozedata.spark.core
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
import scala.util.Random

object MapPartitionApp {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("MapPartitionApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val stus = new ListBuffer[String]
    for(i <- 1.to(100)){
      stus += "student" + i
    }
    val rdd = sc.parallelize(stus)  //把stus转换成一个RDD
    rdd.map(x =>{
      val conn = DB.getConn()   //假装连接到数据库
      //TODO....save to DB
      println(conn + "....")

      DB.returnConnect(conn)    //关闭数据库连接
    }).collect()

    sc.stop()
  }
}

//仅仅测试
object DB{
  def getConn() = {
    new Random().nextInt(10)
  }
  def returnConnect(conn:Int): Unit ={
  }
}

输出结果：

5....
8....
3....
5....
7....
9....
。。。。。总共100次的连接

现在用mapPartitions来测试一下：

package com.ruozedata.spark.com.ruozedata.spark.core
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
import scala.util.Random

object MapPartitionApp {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("MapPartitionApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val stus = new ListBuffer[String]
    for(i <- 1.to(100)){
      stus += "student" + i
    }
    val rdd = sc.parallelize(stus).repartition(5)  //分区变成5个

	//看看总共有多少个分区
    println("how many partitions in this RDD : " + rdd.partitions.length)
    
    rdd.mapPartitions(partition => {
      val conn = DB.getConn()  //获取连接

      partition.foreach(x =>{  对每个分区里的数据存到数据库
        //TODO....save to DB
      })

      println(conn + "....") 
      DB.returnConnect(conn) //关闭连接
      partition
    }).collect()

    sc.stop()
  }
}

object DB{
  def getConn() = {
    new Random().nextInt(10)
  }
  def returnConnect(conn:Int): Unit ={
  }
}

输出：

how many partitions in this RDD : 5
0....
7....
2....
5....
5....

从上面可以看出：

map: 比如一个partition中有100万条数据，10个partition，那么你的function要执行和计算1000万次。
MapPartitions:一个partition仅仅会执行一次function，function一次接收一个partition数据，每次处理的是一个partition的数据。只要执行10次就可以了，性能比较高。
将rdd中的数据通过jdbc写入数据库,map需要为每个元素创建一个链接，而mapPartition为每个partition创建一个链接，则mapPartitions效率比map高的多。

但是：

map：function每次处理一条数据，处理完的数据在内存中过段时间就会被清掉，内存空间会被释放，一般不会导致OOM，内存溢出。
MapPartitions： function一次接收一个partition数据，每次处理的是一个partition的数据，假如每个partition数据量很大，一次性全放入内存，容易发生OOM。

如何选择？要看情况而定，看你数据情况。首选的话选择MapPartitions，性能肯定更好，如果发生异常，再修改成map。

另外，map和MapPartitions都是transformation的操作，都是lazy的。

foreach和foreachPartition

还有一组类似的算子，foreach和foreachPartition

/**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

  /**
   * Applies a function f to each partition of this RDD.
   */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

可以看出来：
foreach：把一个函数作用于一个RDD的所有元素。为action类型的操作。
foreachPartition：把一个函数作用于一个RDD的每个分区。为action类型的操作。

而map和MapPartitions都是transformation的操作，都是lazy的，需要被其它action的算子触发。
如果写到外部的存储上面去，如文件或者数据库，用foreachPartition

关于textFile算子

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   * @param path path to the text file on a supported file system
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of lines of the text file
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

textFile这个函数从hdfs、本地系统、或者任何Hadoop支持的文件系统URI ，读取文件，返回一个元素类型为String的RDD。

它的第一个参数为文件路径，第二个参数为你打算要分区的个数，不写就是默认，默认最小分区为2。

它里面调用的是hadoopFile这个函数。hadoopFile函数里面又调用HadoopRDD这个函数。

从上面源码中可以看到，有个map(pair => pair._2.toString)，这是因为hadoopFile出来的是<key,value>结构，其中key值是每个数据的记录在数据分片中的字节偏移量，数据类型是LongWritable， value值是每行的内容，数据类型是Text。
所以map(pair => pair._2.toString)这个的意思是只获取value值，不要偏移量。

小知识点：
经典面试题：
MapReduce中的Mapper和Reducer的参数有几个？4个
class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
对应的map和reducer方法有几个参数？3个
map(KEYIN key, VALUEIN value, Context context)
reduce(KEYIN key, Iterable values, Context context)

关于这个知识点可去参考：
MapReduce计算模型详解：https://www.e-learn.cn/content/qita/605318