什么是DAG
DAG是有向无环图,它的功能是在Spark运行应用程序(Application)时,首先建立一个有向无环图(DAG),图上的每个节点是一个操作,而Spark的操作分为两类,一类是Transform,一类是Action。在应用程序执行过程中,只有遇到Action类的操作时,才会出发作业(Job)的提交。一个应用程序可以包含多个作业。在提交作业后,首先根据DAG计算这个作业包含哪些Stage,然后每个Stage分解成一些Task
SparkContext、SparkConf和SparkEnv
在实例化SparkContext的过程中,会实例化SparkEnv,为了实例化SparkEnv,Spark启动了多个环节,这从SparkEnv的构造函数中即可看到端倪
new SparkEnv( executorId, actorSystem, serializer, closureSerializer, cacheManager, mapOutputTracker, shuffleManager, broadcastManager, blockTransferService, blockManager, securityManager, httpFileServer, sparkFilesDir, metricsSystem, shuffleMemoryManager, conf
上面的每个变量都对应着Spark的某个方面,每个变量所属的类型如下:
class SparkEnv ( val executorId: String, val actorSystem: ActorSystem, val serializer: Serializer, val closureSerializer: Serializer, val cacheManager: CacheManager, val mapOutputTracker: MapOutputTracker, val shuffleManager: ShuffleManager, val broadcastManager: BroadcastManager, val blockTransferService: BlockTransferService, val blockManager: BlockManager, val securityManager: SecurityManager, val httpFileServer: HttpFileServer, val sparkFilesDir: String, val metricsSystem: MetricsSystem, val shuffleMemoryManager: ShuffleMemoryManager, val conf: SparkConf) extends Logging { ////方法体 }Spark对于SparkEnv的ScalaDoc说明是:
/** * :: DeveloperApi :: * Holds all the runtime environment objects for a running Spark instance (either master or worker), * including the serializer, Akka actor system, block manager, map output tracker, etc. Currently * Spark code finds the SparkEnv through a global variable, so all the threads can access the same * SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext). * * NOTE: This is not intended for external use. This is exposed for Shark and may be made private * in a future release. */3. 如果
val rdd = sc.textFile("file:///D:/words"),如果words是一个目录,而它底下有N个文本文件,那么最终的数据结果中有N个文件,分别是part-00000到part-0000X(X=N-1),这表示Spark对N个文件进行了分区,产生N个分区,每个分区对应一个Task?理论是这样,实际上,分区数还要看文件划分的block块个数
package spark.examples.rdd import org.apache.spark.{SparkContext, SparkConf} object SparkSaveMultiFiles { def main(args: Array[String]) { val conf = new SparkConf().setAppName("SparkRDDJoin").setMaster("local"); val sc = new SparkContext(conf); val rdd = sc.textFile("file:///D:/wordcount") val result = rdd.filter(_.contains("WordCount")) result.foreach(println) } }如上代码,d:/wordcount目录保存了多个文本文件