Source code analysis process a small mind -------- spark-job trigger

 job is executed serially, on the implementation of a complete execution before the next
 
eg: Wordcount Case
Lines Val = sc.textFile ( " local HDFS or the URL of the URL of " ) // Comments See Code 1 
Val words = lines.flatMap (Line => line.split ( "  " )) // also returns a MapPartitionsRDD 
Val pairs = words .map (Word => (Word, . 1 )) // is also returns a MapPartitionsRDD 
Val Counts = pairs.reduceByKey (_ + _) // Comments see codes 2 
Counts. the foreach (COUNT => println (+ count._1 " : " + count._2)) // see Code 4

 

Source Location:
SparkContext 类: kick-core_2.11-2.1.0-sources.jar> org.apache.spark.SparkContext.scala
Code 1
 / * *
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
/**
* First, call hadoopFile () method creates a HadoopRDD, in which elements, in fact (key, value) pair RDD. Key is offset each line hdfs or text file, value is a line of text
* Then call to HadoopRDD map () method will eliminate key, leaving only the value, and then get a MapPartitionsRDD, elements within MapPartitionsRDD, in fact, a line of text line by line
*/
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}
 
// Because RDD.scala class is not ReduceByKey method, so it calls ReduceByKey method will trigger an implicit conversion of scala; this time will be in scope, looking implicit conversion, will be found in the RDD in rddToPairRDDFunctions ( ) implicit conversion, then go inside PairRDDFunctions class method calls ReduceByKey 
implicit DEF rddToPairRDDFunctions [K, V] (RDD: RDD [(K, V)])
  ( Implicit KT: ClassTag [K], VT: ClassTag [V], the ord: Ordering [K] = null ): PairRDDFunctions [K, V] = {
   new new PairRDDFunctions (RDD) // Code Refer to code 3 
}
 
Code 3
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
  combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
 
 
Code 4
// performed by runjob foreach overloaded to the present method of multiple methods RunJob 
def runJob [T, U: ClassTag ] (
    eet: eet [T]
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit): Unit = {
  if (stopped.get()) {
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite.shortForm)
  if (conf.getBoolean("spark.logLineage", false)) {
    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  }
// call SparkContext, create initialization before Runjob method of DAGScheduler
 // will perform the action current operations spread runjob RDD method DAGScheduler in 
  dagScheduler.runJob (RDD, cleanedFunc, Partitions, callSite, resultHandler, localProperties. GET )
  progressBar.foreach(_.finishAll())
  rdd.doCheckpoint ()
}

 

 
 
 
 

Guess you like

Origin www.cnblogs.com/yzqyxq/p/12232685.html