创建RDD《一》textFile源码

生成RDD有三种方法，textFile()，makeRDD()，parallelize()。

makeRDD()也是调用parallelize()，也可以说是两种，一种是读Seq，另一种是读文件，今天我们说读文件的这种。

1.textFile()方法

//需要两个参数，文件路径，分片数
def textFile(path: String, minSplits: Int = defaultMinSplits): RDD[String] = {
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minSplits)
      .map(pair => pair._2.toString)
  }

当用户未指定分片数时，默认为defaultMinSplits，该值为defaultParallelism，2个的较小值

def defaultMinSplits: Int = math.min(defaultParallelism, 2)

defaultParallelism该值为调度器的默认并行度

def defaultParallelism: Int = scheduler.defaultParallelism

由于这次的源码是早期的spark代码，具体可看Spark源码《一》RDD，没有yarn调度，只有local和mesos调度

先看local调度,该值为线程数，线程数为local[n]指定，不指定时为默认1,local[*]为core数

 override def defaultParallelism() = threads

再看mesos 调度，该值默认为8

override def defaultParallelism() =
    System.getProperty("spark.default.parallelism", "8").toInt

textFile()方法直接调用了hadoopFile()方法。

2.hadoopFile()

def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minSplits: Int = defaultMinSplits
      ) : RDD[(K, V)] = {
    val conf = new JobConf()//新建mapreduce JobConf
    FileInputFormat.setInputPaths(conf, path)//设置输入路径
    val bufferSize = System.getProperty("spark.buffer.size", "65536")//spark缓冲区
    conf.set("io.file.buffer.size", bufferSize)//将mapreduce缓存区设为spark缓冲区大小
    new HadoopRDD(this, conf, inputFormatClass, keyClass, valueClass, minSplits)
  }

该方法返回值为键值对的RDD，创建了一个HadoopRDD对象，并将value（行内容）取出。

3.HadoopRDD类

继承自RDD

class HadoopRDD[K, V](
    sc: SparkContext,
    @transient conf: JobConf,
    inputFormatClass: Class[_ <: InputFormat[K, V]],
    keyClass: Class[K],
    valueClass: Class[V],
    minSplits: Int)
  extends RDD[(K, V)](sc) {
  
  val serializableConf = new SerializableWritable(conf)//序列化jobConf

   
  //返回一个长度为分区数的数组，数组存储数据的分区
  @transient
  val splits_ : Array[Split] = {
    val inputFormat = createInputFormat(conf)
    val inputSplits = inputFormat.getSplits(conf, minSplits)
    val array = new Array[Split](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopSplit(id, i, inputSplits(i))
    }
    array
  }
   
  //创建一个InputForamt对象
  def createInputFormat(conf: JobConf): InputFormat[K, V] = {
    ReflectionUtils.newInstance(inputFormatClass.asInstanceOf[Class[_]], conf)
      .asInstanceOf[InputFormat[K, V]]
  }
  
  
  override def splits = splits_
  
  //compute()方法，返回存储键值对数据的迭代器
  override def compute(theSplit: Split) = new Iterator[(K, V)] {
    val split = theSplit.asInstanceOf[HadoopSplit]
    var reader: RecordReader[K, V] = null


    val conf = serializableConf.value
    val fmt = createInputFormat(conf)
    reader = fmt.getRecordReader(split.inputSplit.value, conf, Reporter.NULL)

    val key: K = reader.createKey()
    val value: V = reader.createValue()
    var gotNext = false
    var finished = false
    
    //判断分区是否还有数据
    override def hasNext: Boolean = {
      if (!gotNext) {
        try {
          finished = !reader.next(key, value)
        } catch {
          case eof: EOFException =>
            finished = true
        }
        gotNext = true
      }
      if (finished) {
        reader.close()
      }
      !finished
    }
    
    //读取数据
    override def next: (K, V) = {
      if (!gotNext) {
        finished = !reader.next(key, value)
      }
      if (finished) {
        throw new NoSuchElementException("End of stream")
      }
      gotNext = false
      (key, value)
    }
  }

  override def preferredLocations(split: Split) = {
    // TODO: Filtering out "localhost" in case of file:// URLs
    val hadoopSplit = split.asInstanceOf[HadoopSplit]
    hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
  }
  
   //无依赖
  override val dependencies: List[Dependency[_]] = Nil
}

该方法涉及较多的mapreduce源码，这里暂不赘述，HadoopRDD的compute()方法，系统默认的RecordReader是LineRecordReader，如TextInputFormat，是将行偏移量作为key，行内容作为value，生成迭代器返回。