【Spark二十三】未分类

什么是DAG

DAG是有向无环图,它的功能是在Spark运行应用程序(Application)时,首先建立一个有向无环图(DAG),图上的每个节点是一个操作,而Spark的操作分为两类,一类是Transform,一类是Action。在应用程序执行过程中,只有遇到Action类的操作时,才会出发作业(Job)的提交。一个应用程序可以包含多个作业。在提交作业后,首先根据DAG计算这个作业包含哪些Stage,然后每个Stage分解成一些Task

SparkContext、SparkConf和SparkEnv

在实例化SparkContext的过程中,会实例化SparkEnv,为了实例化SparkEnv,Spark启动了多个环节,这从SparkEnv的构造函数中即可看到端倪

    new SparkEnv(
      executorId,
      actorSystem,
      serializer,
      closureSerializer,
      cacheManager,
      mapOutputTracker,
      shuffleManager,
      broadcastManager,
      blockTransferService,
      blockManager,
      securityManager,
      httpFileServer,
      sparkFilesDir,
      metricsSystem,
      shuffleMemoryManager,
      conf

上面的每个变量都对应着Spark的某个方面,每个变量所属的类型如下:

class SparkEnv (
    val executorId: String,
    val actorSystem: ActorSystem,
    val serializer: Serializer,
    val closureSerializer: Serializer,
    val cacheManager: CacheManager,
    val mapOutputTracker: MapOutputTracker,
    val shuffleManager: ShuffleManager,
    val broadcastManager: BroadcastManager,
    val blockTransferService: BlockTransferService,
    val blockManager: BlockManager,
    val securityManager: SecurityManager,
    val httpFileServer: HttpFileServer,
    val sparkFilesDir: String,
    val metricsSystem: MetricsSystem,
    val shuffleMemoryManager: ShuffleMemoryManager,
    val conf: SparkConf) extends Logging {
        ////方法体
    }


  Spark对于SparkEnv的ScalaDoc说明是:
/**
 * :: DeveloperApi ::
 * Holds all the runtime environment objects for a running Spark instance (either master or worker),
 * including the serializer, Akka actor system, block manager, map output tracker, etc. Currently
 * Spark code finds the SparkEnv through a global variable, so all the threads can access the same
 * SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
 *
 * NOTE: This is not intended for external use. This is exposed for Shark and may be made private
 *       in a future release.
 */
   3. 如果
val rdd = sc.textFile("file:///D:/words"),如果words是一个目录,而它底下有N个文本文件,那么最终的数据结果中有N个文件,分别是part-00000到part-0000X(X=N-1),这表示Spark对N个文件进行了分区,产生N个分区,每个分区对应一个Task?理论是这样,实际上,分区数还要看文件划分的block块个数  
package spark.examples.rdd

import org.apache.spark.{SparkContext, SparkConf}

object SparkSaveMultiFiles {

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkRDDJoin").setMaster("local");
    val sc = new SparkContext(conf);
    val rdd = sc.textFile("file:///D:/wordcount")
    val result = rdd.filter(_.contains("WordCount"))
    result.foreach(println)
  }
}
 如上代码,d:/wordcount目录保存了多个文本文件        

猜你喜欢

转载自bit1129.iteye.com/blog/2176364