生成RDD有三种方法,textFile(),makeRDD(),parallelize()。
makeRDD()也是调用parallelize(),也可以说是两种,一种是读Seq,另一种是读文件,今天我们说读文件的这种。
目录
1.textFile()方法
//需要两个参数,文件路径,分片数
def textFile(path: String, minSplits: Int = defaultMinSplits): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minSplits)
.map(pair => pair._2.toString)
}
当用户未指定分片数时,默认为defaultMinSplits,该值为defaultParallelism,2个的较小值
def defaultMinSplits: Int = math.min(defaultParallelism, 2)
defaultParallelism该值为调度器的默认并行度
def defaultParallelism: Int = scheduler.defaultParallelism
由于这次的源码是早期的spark代码,具体可看Spark源码《一》RDD,没有yarn调度,只有local和mesos调度
先看local调度,该值为线程数,线程数为local[n]指定,不指定时为默认1,local[*]为core数
override def defaultParallelism() = threads
再看mesos 调度,该值默认为8
override def defaultParallelism() =
System.getProperty("spark.default.parallelism", "8").toInt
textFile()方法直接调用了hadoopFile()方法。
2.hadoopFile()
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minSplits: Int = defaultMinSplits
) : RDD[(K, V)] = {
val conf = new JobConf()//新建mapreduce JobConf
FileInputFormat.setInputPaths(conf, path)//设置输入路径
val bufferSize = System.getProperty("spark.buffer.size", "65536")//spark缓冲区
conf.set("io.file.buffer.size", bufferSize)//将mapreduce缓存区设为spark缓冲区大小
new HadoopRDD(this, conf, inputFormatClass, keyClass, valueClass, minSplits)
}
该方法返回值为键值对的RDD,创建了一个HadoopRDD对象,并将value(行内容)取出。
3.HadoopRDD类
继承自RDD
class HadoopRDD[K, V](
sc: SparkContext,
@transient conf: JobConf,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minSplits: Int)
extends RDD[(K, V)](sc) {
val serializableConf = new SerializableWritable(conf)//序列化jobConf
//返回一个长度为分区数的数组,数组存储数据的分区
@transient
val splits_ : Array[Split] = {
val inputFormat = createInputFormat(conf)
val inputSplits = inputFormat.getSplits(conf, minSplits)
val array = new Array[Split](inputSplits.size)
for (i <- 0 until inputSplits.size) {
array(i) = new HadoopSplit(id, i, inputSplits(i))
}
array
}
//创建一个InputForamt对象
def createInputFormat(conf: JobConf): InputFormat[K, V] = {
ReflectionUtils.newInstance(inputFormatClass.asInstanceOf[Class[_]], conf)
.asInstanceOf[InputFormat[K, V]]
}
override def splits = splits_
//compute()方法,返回存储键值对数据的迭代器
override def compute(theSplit: Split) = new Iterator[(K, V)] {
val split = theSplit.asInstanceOf[HadoopSplit]
var reader: RecordReader[K, V] = null
val conf = serializableConf.value
val fmt = createInputFormat(conf)
reader = fmt.getRecordReader(split.inputSplit.value, conf, Reporter.NULL)
val key: K = reader.createKey()
val value: V = reader.createValue()
var gotNext = false
var finished = false
//判断分区是否还有数据
override def hasNext: Boolean = {
if (!gotNext) {
try {
finished = !reader.next(key, value)
} catch {
case eof: EOFException =>
finished = true
}
gotNext = true
}
if (finished) {
reader.close()
}
!finished
}
//读取数据
override def next: (K, V) = {
if (!gotNext) {
finished = !reader.next(key, value)
}
if (finished) {
throw new NoSuchElementException("End of stream")
}
gotNext = false
(key, value)
}
}
override def preferredLocations(split: Split) = {
// TODO: Filtering out "localhost" in case of file:// URLs
val hadoopSplit = split.asInstanceOf[HadoopSplit]
hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
}
//无依赖
override val dependencies: List[Dependency[_]] = Nil
}
该方法涉及较多的mapreduce源码,这里暂不赘述,HadoopRDD的compute()方法,系统默认的RecordReader是LineRecordReader,如TextInputFormat,是将行偏移量作为key,行内容作为value,生成迭代器返回。