spark源码阅读笔记RDD(一)RDD的基本概念



spark源码阅读笔记RDD(一)RDD的基本概念

什么是RDD?

从文献1我们知道,Matei Zaharia博士给RDD下的定义是:

Formally, an RDD is aread-only,partitionedcollection ofrecords。关键词有只读已分区记录的集合,也就是

说:我们操作的RDD是一个只可读不可写的集合,而且这个集合是已经分好区且会有标记的集合。下面我们通过源码

来说明一下RDD为什么是只读已分区记录的集合。(源码见附录源码1):总结如下:
1、每个RDD都包含五部分信息:

(1)数据分区的集合:partitions 

(2)能根据本地性快速访问到数据的偏好位置:preferredLocations

(3)一个RDD的依赖列表:dependencies

(4)一个RDD的分区数:getNumPartitions

(5)一个RDD的计算迭代器:iterator

2、这五个信息都是用final修饰的,也就是说不可改变、只读的

3、一个job下的各RDD的信息








RDD怎么创建?

RDD只能通过(1)稳定的存储器(比如HDFS)(2) 其他的RDD的数据上的确定性操作来创建(也就是转换操

作)正因为RDD的record的特性,RDD之间转换会记录如何转换而来,它的最厉害的一个特性就是当一个RDD失效

了,程序可以凭借记录,唤起这个失效的RDD.


RDD怎么实现?

下面是RDD class源码

abstract class RDD[T: ClassTag](
*RDD是继承于Serializable(对象可以被序列化),Logging(创建SLF4J用于记录)
*               _sc:一个sparkContext包含环境等变量
*               deps: Seq[Dependency[_]],用于记录之间的依赖关系
* */
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {/*集合体*/}
从RDD类知道,包含:

1、集群(或其他模式)的属性

2、记录运行的步骤

3、记录RDD之间的依赖关系


RDD五个主要的属性

* Internally, each RDD is characterized by five main properties:
*
*  - A list of partitions
*  - A function for computing each split
*  - A list of dependencies on other RDDs
*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
*    an HDFS file)
1)一组分片(partition):partition是数据集的基本的组成单位,每个分片都会被一个计算任务处理。这个分片决定计算的粒度。分片可以自己设置也可以系统默认(根据你的CPU的core的数目来决定,每个core 2-4个分片),但是

自己设置的时候一个partition不要超过一个block的最大容量

小插曲:存储的基本单位是block:一个partition 逻辑映射(by BlockManager)一个block,被一个task负责计算

2)一个计算每个分区(Split)的函数,每个RDD都会实现compute函数来对RDD进行分片

3)一个和其他RDD之间依赖关系的列表(List):RDD每次转换生成新的RDD,这个新的RDD会记录父RDD的信

息。所以RDD直接会形成类似于流水线一样的依赖关系。

4)一个partitioner(RDD的分片函数),spark有两个类型的分片函数,HashPartitioner和非key-value的

paritioner(值默认为None)

5)一个用于存储每个partition优先位置的列表(list),也就是说这个list保存了每个partition所在那个block的位置



实验总结

其中程序中的数据形式如下图:


import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by legotime on 2016/4/21.
  */
object WorkSheet {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("RDD的基本理解").setMaster("local")
    val sc = new SparkContext(conf)

    // Load  the data
    val data1 = sc.textFile("E:\\SparkCore2\\data\\mllib\\ridge-data\\lpsa.data")
    println("data1的类型"+data1)
    //MapPartitionsRDD[1] at textFile at WorkSheet.scala:15
    println("data1的partittion:" + data1.partitions.size)//1
    println("data1的length:" +data1.collect.length)//67
    println("data1的count:" +data1.count())//67
    println("缓存:"+data1.cache())
    //MapPartitionsRDD[1] at textFile at WorkSheet.scala:15
    println("data1的name:"+data1.name)
    //data1的name:null
    println("data1的id:"+data1.id)
    //data1的id:1

    data1.partitions.foreach { partition =>
       println("index:" + partition.index + "  hasCode:" + partition.hashCode())
    }//index:0  hasCode:1681
    println("data1 father dependency: " + data1.dependencies)
    //data1 father dependency: List(org.apache.spark.OneToOneDependency@36480b2d)

    data1.dependencies.foreach { dep =>
           println("dependency type:" + dep.getClass)
             println("dependency RDD:" + dep.rdd)
             println("dependency partitions:" + dep.rdd.partitions)
             println("dependency partitions size:" + dep.rdd.partitions.length)
           }
    //dependency type:class org.apache.spark.OneToOneDependency
    //dependency RDD:E:\SparkCore2\data\mllib\ridge-data\lpsa.data HadoopRDD[0] at textFile at WorkSheet.scala:15
    //dependency partitions:[Lorg.apache.spark.Partition;@3c3c4a71
    //dependency partitions size:1
    //
    val data1Map = data1.map(_+1)
    //经过一次转换
    data1Map.dependencies.foreach { dep =>
      println("dependency type:" + dep.getClass)
      println("dependency RDD:" + dep.rdd)
      println("dependency partitions:" + dep.rdd.partitions)
      println("dependency partitions size:" + dep.rdd.partitions.length)
    }
    //dependency type:class org.apache.spark.OneToOneDependency
    //dependency RDD:MapPartitionsRDD[1] at textFile at WorkSheet.scala:15
    //dependency partitions:[Lorg.apache.spark.Partition;@3c3c4a71
    //dependency partitions size:1
    println("data1Map father dependency: " + data1Map.dependencies)
    //data1Map father dependency: List(org.apache.spark.OneToOneDependency@b887730)
    data1Map.dependencies.foreach(x =>
      println("data1Map的依赖:"+x)
    )
    //data1Map的依赖:org.apache.spark.OneToOneDependency@b887730


    val data2 = sc.textFile("E:\\SparkCore2\\data\\mllib\\ridge-data\\lpsa.data",2)
    println("data2的类型"+data2)
    //data2的类型MapPartitionsRDD[4] at textFile at WorkSheet.scala:45
    println("data2的partittion:" + data2.partitions.size)//2
    println("data2的length:" +data2.collect.length)//67
    println("data2的count:" +data2.count())//67
    println("缓存:"+data2.cache())
    //缓存:MapPartitionsRDD[4] at textFile at WorkSheet.scala:45
    println("data2的name:"+data2.name)
    //data2的name:null
    data2.setName("huhu!!")

    println("data2的new name:"+data2.name)
    //data2的new name:huhu!!

    println("data2的id:"+data2.id)
    //data2的id:4
    data2.partitions.foreach { partition =>
      println("index:" + partition.index + "  hasCode:" + partition.hashCode())
    }
    //index:0  hasCode:1804
    //index:1  hasCode:1805
    println(data2.first())
    //-0.4307829,-1.63735562648104 -2.00621178480549 -1.86242597251066 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306
    //println(data2.take(0))//java.lang.String;@5f2bd6d9
    println(data2.take(2))

    sc.stop()
  }
}
分析:

1、所说的partition就是把一个大的数据分成几块,这个partition放入一个函数中,partition中有基本的element
这个element是函数处理的最小单元(这个element也就是一行)


2、从依赖关系:一个RDD的知道父的RDD,不知道爷爷以上的RDD,如果要知道爷爷RDD,那么就通过父RDD来得到
、依次类推,可以回溯到第一个RDD


3、一个RDD默认的名字是null,名字可以自己设置,但是作用不大


4、每个RDD都有一个编号,每向上回溯一次,系统都可以判断父RDD是否存在(cache、persist)

















附录:

源码1

/**
 * Get the list of dependencies of this RDD, taking into account whether the
 * RDD is checkpointed or not
 */
final def dependencies: Seq[Dependency[_]] = {
  checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
    if (dependencies_ == null) {
      dependencies_ = getDependencies
    }
    dependencies_
  }
}

/**
 * Get the array of partitions of this RDD, taking into account whether the
 * RDD is checkpointed or not.
 */
/**
  * 得到我们调用partitions函数的RDD的一个包含partition信息的数组
  * 然后把这个数组放入计算,判断这个RDD是否checkpoint
  */


final def partitions: Array[Partition] = {
  checkpointRDD.map(_.partitions).getOrElse {
    if (partitions_ == null) {
      partitions_ = getPartitions
      partitions_.zipWithIndex.foreach { case (partition, index) =>
        require(partition.index == index,
          s"partitions($index).partition == ${partition.index}, but it should equal $index")
      }
    }
    partitions_
  }
}

/**
  * Returns the number of partitions of this RDD.
  */
@Since("1.6.0")
final def getNumPartitions: Int = partitions.length

/**
 * Get the preferred locations of a partition, taking into account whether the
 * RDD is checkpointed.
 */
final def preferredLocations(split: Partition): Seq[String] = {
  checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
    getPreferredLocations(split)
  }
}

/**
 * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
 * This should ''not'' be called by users directly, but is available for implementors of custom
 * subclasses of RDD.
 */
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
  if (storageLevel != StorageLevel.NONE) {
    getOrCompute(split, context)
  } else {
    computeOrReadCheckpoint(split, context)
  }
}













































































































参考文献

1、http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf


spark源码阅读笔记RDD(一)RDD的基本概念

猜你喜欢

转载自blog.csdn.net/u014236541/article/details/80271862