job is executed serially, on the implementation of a complete execution before the next
eg: Wordcount Case Lines Val = sc.textFile ( " local HDFS or the URL of the URL of " ) // Comments See Code 1 Val words = lines.flatMap (Line => line.split ( " " )) // also returns a MapPartitionsRDD Val pairs = words .map (Word => (Word, . 1 )) // is also returns a MapPartitionsRDD Val Counts = pairs.reduceByKey (_ + _) // Comments see codes 2 Counts. the foreach (COUNT => println (+ count._1 " : " + count._2)) // see Code 4
Source Location:
SparkContext 类: kick-core_2.11-2.1.0-sources.jar> org.apache.spark.SparkContext.scala
Code 1 / * * * Read a text file from HDFS, a local file system (available on all nodes), or any * Hadoop-supported file system URI, and return it as an RDD of Strings. */ /** * First, call hadoopFile () method creates a HadoopRDD, in which elements, in fact (key, value) pair RDD. Key is offset each line hdfs or text file, value is a line of text * Then call to HadoopRDD map () method will eliminate key, leaving only the value, and then get a MapPartitionsRDD, elements within MapPartitionsRDD, in fact, a line of text line by line */ def textFile( path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope { assertNotStopped() hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) } // Because RDD.scala class is not ReduceByKey method, so it calls ReduceByKey method will trigger an implicit conversion of scala; this time will be in scope, looking implicit conversion, will be found in the RDD in rddToPairRDDFunctions ( ) implicit conversion, then go inside PairRDDFunctions class method calls ReduceByKey implicit DEF rddToPairRDDFunctions [K, V] (RDD: RDD [(K, V)]) ( Implicit KT: ClassTag [K], VT: ClassTag [V], the ord: Ordering [K] = null ): PairRDDFunctions [K, V] = { new new PairRDDFunctions (RDD) // Code Refer to code 3 } Code 3 def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope { combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner) } Code 4 // performed by runjob foreach overloaded to the present method of multiple methods RunJob def runJob [T, U: ClassTag ] ( eet: eet [T] func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], resultHandler: (Int, U) => Unit): Unit = { if (stopped.get()) { throw new IllegalStateException("SparkContext has been shutdown") } val callSite = getCallSite val cleanedFunc = clean(func) logInfo("Starting job: " + callSite.shortForm) if (conf.getBoolean("spark.logLineage", false)) { logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString) } // call SparkContext, create initialization before Runjob method of DAGScheduler // will perform the action current operations spread runjob RDD method DAGScheduler in dagScheduler.runJob (RDD, cleanedFunc, Partitions, callSite, resultHandler, localProperties. GET ) progressBar.foreach(_.finishAll()) rdd.doCheckpoint () }