SparkContext剖析

1 SparkContext概述

SparkContext就Spark的入口,相当于应用程序的main函数。目前在一个JVM进程中可以创建多个SparkContext,但是只能有一个active级别。如果需要创建一个新的SparkContext实例,必须先调用stop()方法停掉当前active级别的SparkContext实例。

图片来自Spark官网,可以看到SparkContext处于DriverProgram核心位置,所有与Cluster、Worder Node交互的操作都需要SparkContext来完成。

2 SparkContext相关部件

  • SparkConf:Spark配置类,配置以键值对形式存储,封装了一个ConcurrentHashMap类实例setting用于存储Spark的配置信息
  • SparkEnv:Spark运行环境。Executor是处理任务的执行器,它依赖于SparkEnv提供的运行时环境;此外,在Driver中也包含了SparkEnv,这是为了保证local模式下任务的执行。
  • LiveListenerBus:SparkContext中的事件总线,可以接收各个使用方的事件,并且通过异步方式对事件进行匹配后调用SparkListener的不同方法。
  • SparkUI:Spark的用户界面。SparkUI间接依赖于计算引擎、调度系统、存储体系,作业(Job)、阶段(Stage)、存储、执行器(Executor)等组件的监控数据都会以SparkListenerEvent的形式投递到LiveListenerBus中,SparkUI将从各个SparkListener中读取数据并显示到Web界面。
  • SparkStatusTracker:提供对作业、Stage(阶段)等的监控信息
  • DAGScheduler:DAG调度器,是调度系统中的重要组件之一,负责创建Job,将DAG中的RDD划分到不同的Stage、提交Stage等。SparkUI中有关Job和Stage的监控数据都来自DAGScheduler。
  • TaskScheduler:任务调度器,是调度系统中的重要组件之一。TaskScheduler按照调度算法对集群管理器已经分配给应用程序的资源进行二次调度后分配给任务。TaskScheduler调度的Task是由DAGScheduler创建的,所以DAGScheduler是TaskScheduler的前置调度。
  • HeartbeatReceiver:心跳接收器。所有Excutor都会向HeartbeatReceiver发送心跳信息,HeartbeatReceiver接收到Executor的心跳信息后,首先更新Executor的最后可见时间,然后将此信息交给TaskScheduler作进一步处理。
  • ContextCleaner:上下文清理器。ContextCleaner实际用异步方式清理那些超出应用作用域范围的RDD、ShuffleDependency和Broadcast等信息。
  • ExecutorAllocationManager:Executor动态分配管理器。可以根据负载动态调整Executor的数量,在配置spark.dynamicAllocation.enabled属性为true的前提下,在非local模式下或者当spark.dynamicAllocation.testing属性为ture时启用。
  • ShutdownHookManager:用于设置关闭钩子的管理器。可以给应用设置关闭钩子,这样就可以在JVM进程退出时,执行一些清理操作。
  • HadoopConfiguration:Hadoop的配置信息,具体根据Hadoop(Hadoop 2.0之前的版本)和Hadoop Yarn(Hadoop 2.0+的版本)的环境有所区别。如果系统属性SPARK_YARN_MODE为true或者环境变量SPARK_YARN_MODE为true,那么将会是YARN的配置,否则为Hadoop配置。
  • ExecutorMemory:Executor的内存大小,默认值为1024MB。可以通过设置环境变量(SPARK_MEM或者SPARK_EXECUTOR_MEMORY)或者spark.executor.memory属性指定。其中,spark.executor.memory的优先级别最高,SPARK_EXECUTOR_MEMORY次之,SPARK_MEM是老版本Spark遗留下来的配置方式,未来将会被废弃。

3 代码分析

SparkContext.markPartiallyConstructed(this, allowMultipleContexts)
用来确保实例的唯一性

try {
  _conf = config.clone()
  _conf.validateSettings()
  if (!_conf.contains("spark.master")) {
    throw new SparkException("A master URL must be set in your configuration")
  }
  if (!_conf.contains("spark.app.name")) {
    throw new SparkException("An application name must be set in your configuration")
  }
  if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
    throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
      "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
  }
  if (_conf.getBoolean("spark.logConf", false)) {
    logInfo("Spark configuration:\n" + _conf.toDebugString)
  }
  _conf.setIfMissing("spark.driver.host", Utils.localHostName())
  _conf.setIfMissing("spark.driver.port", "0")
  _conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)
  _jars = Utils.getUserJars(_conf)
  _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
    .toSeq.flatten
  _eventLogDir =
    if (isEventLogEnabled) {
      val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
        .stripSuffix("/")
      Some(Utils.resolveURI(unresolvedDir))
    } else {
      None
    }
  _eventLogCodec = {
    val compress = _conf.getBoolean("spark.eventLog.compress", false)
    if (compress && isEventLogEnabled) {
      Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
    } else {
      None
    }
  }
  if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true")
  _jobProgressListener = new JobProgressListener(_conf)
  listenerBus.addListener(jobProgressListener)
  _env = createSparkEnv(_conf, isLocal, listenerBus)
  SparkEnv.set(_env)
  _conf.getOption("spark.repl.class.outputDir").foreach { path =>
    val replUri = _env.rpcEnv.fileServer.addDirectory("/classes", new File(path))
    _conf.set("spark.repl.class.uri", replUri)
  }
  _statusTracker = new SparkStatusTracker(this)
  _progressBar =
    if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && !log.isInfoEnabled) {
      Some(new ConsoleProgressBar(this))
    } else {
      None
    }
  _ui =
    if (conf.getBoolean("spark.ui.enabled", true)) {
      Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
        _env.securityManager, appName, startTime = startTime))
    } else {
      None
    }
  _ui.foreach(_.bind())
  _hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)
  if (jars != null) {
    jars.foreach(addJar)
  }
  if (files != null) {
    files.foreach(addFile)
  }
  _executorMemory = _conf.getOption("spark.executor.memory")
    .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
    .orElse(Option(System.getenv("SPARK_MEM"))
    .map(warnSparkMem))
    .map(Utils.memoryStringToMb)
    .getOrElse(1024)
  for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
    value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
    executorEnvs(envKey) = value
  }
  Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
    executorEnvs("SPARK_PREPEND_CLASSES") = v
  }
  executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
  executorEnvs ++= _conf.getExecutorEnv
  executorEnvs("SPARK_USER") = sparkUser
  _heartbeatReceiver = env.rpcEnv.setupEndpoint(
    HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
  val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
  _schedulerBackend = sched
  _taskScheduler = ts
  _dagScheduler = new DAGScheduler(this)
  _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
  _taskScheduler.start()
  _applicationId = _taskScheduler.applicationId()
  _applicationAttemptId = taskScheduler.applicationAttemptId()
  _conf.set("spark.app.id", _applicationId)
  _ui.foreach(_.setAppId(_applicationId))
  _env.blockManager.initialize(_applicationId)
  _env.metricsSystem.start()
  _env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))

  _eventLogger =
    if (isEventLogEnabled) {
      val logger =
        new EventLoggingListener(_applicationId, _applicationAttemptId, _eventLogDir.get,
          _conf, _hadoopConfiguration)
      logger.start()
      listenerBus.addListener(logger)
      Some(logger)
    } else {
      None
    }
  val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
  _executorAllocationManager =
    if (dynamicAllocationEnabled) {
      Some(new ExecutorAllocationManager(this, listenerBus, _conf))
    } else {
      None
    }
  _executorAllocationManager.foreach(_.start())
  _cleaner =
    if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
      Some(new ContextCleaner(this))
    } else {
      None
    }
  _cleaner.foreach(_.start())
  setupAndStartListenerBus()
  postEnvironmentUpdate()
  postApplicationStart()
  _taskScheduler.postStartHook()
  _env.metricsSystem.registerSource(_dagScheduler.metricsSource)
  _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
  _executorAllocationManager.foreach { e =>
    _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
  }
  _shutdownHookRef = ShutdownHookManager.addShutdownHook(
    ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
    logInfo("Invoking stop() from shutdown hook")
    stop()
  }
} catch {
  case NonFatal(e) =>
    logError("Error initializing SparkContext.", e)
    try {
      stop()
    } catch {
      case NonFatal(inner) =>
        logError("Error stopping SparkContext after init error.", inner)
    } finally {
      throw e
    }
}

1、配置校验并设置SparkDriver的Host和Port

2、初始化事件日志目录和压缩类型

3、初始化App状态存储以及事件LiveListenerBus

4、创建Spark的执行环境SparkEnv

5、初始化状态跟踪器SparkStatusTracker

6、根据配置创建ConsoleProgressBar

扫描二维码关注公众号,回复: 5548283 查看本文章

7、创建并初始化Spark UI

8、Hadoop相关配置及Executor环境变量的配置

9、注册HeartbeatReceiver心跳接收器

10、创建TaskScheduler

11、创建DAGScheduler

12、启动TaskScheduler

13、初始化块管理器BlockManager

14、启动测量系统MetriecSystem

15、创建事件日志监听器

16、创建和启动Executor分配管理器ExecutorAllocationManager

17、创建和启动ContextCleaner

18、额外的SparkListener与启动事件总线(setupAndStartListenerBus)

19、Spark环境更新(postEnvironmentUpdate)

20、创建DAGSchedulerSource、BlockManagerSource和ExecutorAllocationManagerSource

SparkContext.setActiveContext(this, allowMultipleContexts)
将SparkContext标记为激活

3.1 初始设置

首先保存了当前的CallSite信息,并且判断是否允许创建多个SparkContext实例,使用的是spark.driver.allowMultipleContexts属性,默认为false

class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {
  //获取当前SparkContext的当前调用栈,包含了最靠近栈顶的用户类及最靠近栈底的Scala或者Spark核心类信息
  private val creationSite: CallSite = Utils.getCallSite()
  //SparkContext默认只有一个实例,如果SparkConf中设置了allowMultipleContexts为true
  //当存在多个active级别的SparkContext实例时Spark会发生警告,而不是抛出异常
  //如果没有配置,则默认为false
  private val allowMultipleContexts: Boolean =
    config.getBoolean("spark.driver.allowMultipleContexts", false)
  //用来确保SparkContext实例的唯一性,并将当前的SparkContext标记为正在构建中,
  //以防止多个SparkContext实例同时成为active级别
  SparkContext.markPartiallyConstructed(this, allowMultipleContexts)
  ....
}

接下来是对SparkConf进行复制,然后对各种配置信息进行校验,其中最主要的就是SparkConf必须指定spark.master(用于设置部署模式)和spark.app.name(应用程序名称)属性,否则会抛出异常。

private var _conf: SparkConf = _

_conf = config.clone()
_conf.validateSettings()
if (!_conf.contains("spark.master")) {
  throw new SparkException("A master URL must be set in your configuration")
}
if (!_conf.contains("spark.app.name")) {
  throw new SparkException("An application name must be set in your configuration")
}

3.2 创建执行环境SparkEnv

SparkEnv是Spark的执行环境对象,其中包括与众多Executor指向相关的对象。在local模式下Driver会创建Executor,local-cluster部署模式或者Standalone部署模式下Worker另起的CoarseGrainedExecutorBackend进程中也会创建Executor,所以SparkEnv存在于Driver或者CoarseGrainedExecutorBackend进程中。

创建SparkEnv主要使用SparkEnv的createDriverEnv方法,有四个参数:conf、isLocal、listenerBus以及在本地模式下driver运行executor需要的numberCores。

private var _env: SparkEnv = _
  
  
def isLocal: Boolean = Utils.isLocalMaster(_conf)
private[spark] def listenerBus: LiveListenerBus = _listenerBus
  
// Create the Spark execution environment (cache, map output tracker, etc)
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
  
private[spark] def createSparkEnv(
    conf: SparkConf,
    isLocal: Boolean,
    listenerBus: LiveListenerBus): SparkEnv = {
  SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))
}
  
/**
 * 获取在本地模式下执行程序需要的cores个数,否则不需要,为0
 */
private[spark] def numDriverCores(master: String): Int = {
  def convertToInt(threads: String): Int = {
    if (threads == "*") Runtime.getRuntime.availableProcessors() else threads.toInt
  }
  master match {
    case "local" => 1
    case SparkMasterRegex.LOCAL_N_REGEX(threads) => convertToInt(threads)
    case SparkMasterRegex.LOCAL_N_FAILURES_REGEX(threads, _) => convertToInt(threads)
    case _ => 0 // driver is not used for execution
  }
}

3.3 创建Spark UI

Spark UI提供了用浏览器访问具有样式及布局并且提供丰富监控数据的页面。其采用的是时间监听机制,发送的事件会存入缓存,由定时调度器取出后分配给监听事件的监听器对监控数据进行更新。如果不需要Spark UI,则可以将spark.ui.enabled置为false。

private var _ui: Option[SparkUI] = None

_ui =
  if (conf.getBoolean("spark.ui.enabled", true)) {
    Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
      _env.securityManager, appName, startTime = startTime))
  } else {
    None
  }
//在TaskScheduler启动前绑定UI,以便绑定的端口与集群管理器之间通信
_ui.foreach(_.bind())

3.4 Hadoop相关配置

默认情况下,Spark使用HDFS作为分布式文件系统,需要获取Hadoop相关的配置信息:

private var _hadoopConfiguration: Configuration = _

_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)

3.5 Executor环境变量

executorEnvs包含的环境变量将会在注册应用程序的过程中发送给Master,Mater给Worker发送调度后,Worker最终使用executorEnvs提供的信息启动Executor。通过配置spark.executor.memory指定Executor占用的内存的大小,也可以配置系统变量SPARK_EXECUTOR_MEMORY或者SPARK_MEM设置其大小。

private var _executorMemory: Int = _

_executorMemory = _conf.getOption("spark.executor.memory")
  .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
  .orElse(Option(System.getenv("SPARK_MEM"))
  .map(warnSparkMem))
  .map(Utils.memoryStringToMb)
  .getOrElse(1024)

executorEnvs是由一个HashMap存储:

private[spark] val executorEnvs = HashMap[String, String]()

for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
  value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
  executorEnvs(envKey) = value
}
Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
  executorEnvs("SPARK_PREPEND_CLASSES") = v
}
// The Mesos scheduler backend relies on this environment variable to set executor memory.
// TODO: Set this only in the Mesos scheduler.
executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
executorEnvs ++= _conf.getExecutorEnv
executorEnvs("SPARK_USER") = sparkUser

3.6 注册HeartbeatReceiver心跳接收器

在Spark的实际生产环境中,Executor是运行在不同的节点上的。在local模式下的Driver与Executor属于同一个进程,所以Driver与Executor可以直接使用本地调用交互,当Executor运行出现问题时,Driver可以很方便地知道,例如,通过捕获异常。但是在生产环境下,Driver与Executor很可能不在同一个进程内,它们也许运行在不同的机器上,甚至在不同的机房里,因此Driver对Executor失去掌握。为了能够掌控Executor,在Driver中创建了这个心跳接收器。

_heartbeatReceiver = env.rpcEnv.setupEndpoint(HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

//包名:org.apache.spark.rpc.netty
//类名:NettyRpcEnv
override def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef = {
  dispatcher.registerRpcEndpoint(name, endpoint)
}

上述代码中使用SparkEnv的子组件NettyRpcEnv的setupEndpoint方法,此方法的作用是想RpcEnv的Dispatcher注册HeartbeatReceiver,并返回HeartbeatReceiver的NettyRpcEndpointRef引用。

3.7 创建任务调度器TaskScheduler

TaskScheduler也是SparkContext的重要组成部分,负责任务的提交,请求集群管理器对任务调度,并且负责发送任务到集群,运行它们,任务失败的重试以及慢任务在其它节点上重试。其中给应用程序分配并运行Executor为一级调度,而给任务分配Executor并运行任务则为二级调度。另外TaskScheduler也可以看做任务调度的客户端。

  • 为TaskSet创建和维护一个TaskSetManager并追踪任务的本地性以及错误信息
  • 遇到Straggle任务会发到其它节点进行重试
  • 向DAGScheduler汇报执行情况,包括在Shuffle输出lost的时候报告fetch failed错误等信息

TaskScheduler负责任务调度资源分配,Scheduler负责与Master、Worker之间的通信,收集Worker上分配置给该应用使用的资源情况。

private var _schedulerBackend: SchedulerBackend = _
private var _taskScheduler: TaskScheduler = _

val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts

createTaskScheduler方法根据master的配置匹配部署模式,创建TaskSchedulerImpl,并生成不同的SchedulerBackend。

private def createTaskScheduler(
    sc: SparkContext,
    master: String,
    deployMode: String): (SchedulerBackend, TaskScheduler) = {
  import SparkMasterRegex._
    
  master match {
    case "local" =>
      val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
      val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
      scheduler.initialize(backend)
      (backend, scheduler)
  
    case LOCAL_N_REGEX(threads) =>
      def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
      // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
      val threadCount = if (threads == "*") localCpuCount else threads.toInt
      if (threadCount <= 0) {
        throw new SparkException(s"Asked to run locally with $threadCount threads")
      }
      val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
      val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
      scheduler.initialize(backend)
      (backend, scheduler)
  
    case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
      def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
      // local[*, M] means the number of cores on the computer with M failures
      // local[N, M] means exactly N threads with M failures
      val threadCount = if (threads == "*") localCpuCount else threads.toInt
      val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
      val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
      scheduler.initialize(backend)
      (backend, scheduler)
  
    case SPARK_REGEX(sparkUrl) =>
      val scheduler = new TaskSchedulerImpl(sc)
      val masterUrls = sparkUrl.split(",").map("spark://" + _)
      val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
      scheduler.initialize(backend)
      (backend, scheduler)
  
    case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
      // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
      val memoryPerSlaveInt = memoryPerSlave.toInt
      if (sc.executorMemory > memoryPerSlaveInt) {
        throw new SparkException(
          "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
            memoryPerSlaveInt, sc.executorMemory))
      }
  
      val scheduler = new TaskSchedulerImpl(sc)
      val localCluster = new LocalSparkCluster(
        numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
      val masterUrls = localCluster.start()
      val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
      scheduler.initialize(backend)
      backend.shutdownCallback = (backend: StandaloneSchedulerBackend) => {
        localCluster.stop()
      }
      (backend, scheduler)
  
    case masterUrl =>
      val cm = getClusterManager(masterUrl) match {
        case Some(clusterMgr) => clusterMgr
        case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
      }
      try {
        val scheduler = cm.createTaskScheduler(sc, masterUrl)
        val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
        cm.initialize(scheduler, backend)
        (backend, scheduler)
      } catch {
        case se: SparkException => throw se
        case NonFatal(e) =>
          throw new SparkException("External scheduler cannot be instantiated", e)
      }
  }
}

3.8 创建和启动DAGScheduler

DAGScheduler主要用于在任务正式交给TaskScheduler提交之前做一些准备工作,包括:创建Job,将DAG中的RDD划分到不同的Stage,提交Stage等等

@volatile private var _dagScheduler: DAGScheduler = _

_dagScheduler = new DAGScheduler(this)

DAGScheduler的数据结构主要维护jobId和stageId的关系、Stage、ActiveJob,以及缓存的RDD的Partition的位置信息。

3.9 TaskScheduler的启动

TaskScheduler在启动的时候实际是调用了backend的start方法

_taskScheduler.start()

override def start() {
  backend.start()
  if (!isLocal && conf.getBoolean("spark.speculation", false)) {
    logInfo("Starting speculative execution thread")
    speculationScheduler.scheduleWithFixedDelay(new Runnable {
      override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
        checkSpeculatableTasks()
      }
    }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
  }
}

3.10 启动测量系统MetricSystem

MetricSystem中三个概念:

  • Instance:指定了谁在使用测量系统;Spark按照Instance的不同,区分为Master、Worker、Application、Driver和Executor;
  • Source:指定了从哪里收集测量数据;Source有两种来源:Spark internal source->MasterSource/WorkerSource等, Common source->JvmSource
  • Sink:指定了往哪里输出测量数据; Spark目前提供的Sink有ConsoleSink、CsvSink、JmxSink、MetricServlet、GraphiteSink等;Spark使用MetricServlet作为默认的Sink

MetricsSystem的启动过程包括:

  • 1、注册Sources
  • 2、注册Sinks
  • 3、将Sinks增加Jetty的ServletContextHandler

MetricSystem启动完毕后,会遍历与Sinks有关的ServletContextHandler,并调用attachHandler将它们绑定到SparkUI上

_env.metricsSystem.start()

_env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))

3.11 创建事件日志监听器

EventLoggingListener是将事件持久化到存储的监听器,是SparkContext中可选组件。当spark.eventLog.enabled属性为true时启动,默认为false。

private var _eventLogDir: Option[URI] = None
private var _eventLogCodec: Option[String] = None
private var _eventLogger: Option[EventLoggingListener] = None

private[spark] def isEventLogEnabled: Boolean = _conf.getBoolean("spark.eventLog.enabled", false)
private[spark] def eventLogDir: Option[URI] = _eventLogDir
private[spark] def eventLogCodec: Option[String] = _eventLogCode

_eventLogDir =
  if (isEventLogEnabled) {
    val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
      .stripSuffix("/")
    Some(Utils.resolveURI(unresolvedDir))
  } else {
    None
  }
_eventLogCodec = {
  val compress = _conf.getBoolean("spark.eventLog.compress", false)
  if (compress && isEventLogEnabled) {
    Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
  } else {
    None
  }
}
_eventLogger =
  if (isEventLogEnabled) {
    val logger =
      new EventLoggingListener(_applicationId, _applicationAttemptId, _eventLogDir.get,
        _conf, _hadoopConfiguration)
    logger.start()
    listenerBus.addToEventLogQueue(logger)
    Some(logger)
  } else {
    None
}

3.12 创建和启动ExecutorAllocationManager

ExecutorAllocationManager用于对分配的Executor进行管理

val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
_executorAllocationManager =
  if (dynamicAllocationEnabled) {
    schedulerBackend match {
      case b: ExecutorAllocationClient =>
        Some(new ExecutorAllocationManager(
          schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf,
          _env.blockManager.master))
      case _ =>
        None
    }
  } else {
    None
  }
_executorAllocationManager.foreach(_.start())

默认情况下不会创建ExecutorAllocationManager,可以修改属性spark.dynamicAllocation.enabled为true来创建。ExecutorAllocationManager可以动态地分配最小Executor的数量、动态分配最大Executor的数量、每个Executor可以运行的Task数量等配置信息,并对配置信息进行校验。start方法将ExecutorAllocationListener加入listenerBus中,ExecutorAllocationListener通过监听listenerBus里的事件,动态地添加、删除Executor,并且通过不断添加Executor,遍历Executor,将超时的Executor杀死并移除。

3.13 ContextCleaner的创建与启动

ContextCleaner用于清理超出范围的RDD、ShuffleDependency和Broadcast对象。

_cleaner =
  if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
    Some(new ContextCleaner(this))
  } else {
    None
  }
_cleaner.foreach(_.start())

ContextCleaner的组成:

  • referenceQueue:缓存顶级的AnyRef引用
  • referenceBuff:缓存AnyRef的虚引用
  • listeners:缓存清理工作的监听器数组
  • cleaningThread:用于具体清理工作的线程

3.14 额外的SparkListener与启动事件

SparkContext中提供了添加用于自定义SparkListener

//注册config的spark.extraListener属性中指定的监听器,并启动监听总线
setupAndStartListenerBus()

private def setupAndStartListenerBus(): Unit = {
  try {
    // 获取用户自定义的 SparkListenser 的类名
    conf.get(EXTRA_LISTENERS).foreach { classNames =>
      // 通过反射生成每一个自定义 SparkListenser 的实例,并添加到事件总线的监听列表中
      val listeners = Utils.loadExtensions(classOf[SparkListenerInterface], classNames, conf)
      listeners.foreach { listener =>
        listenerBus.addToSharedQueue(listener)
        logInfo(s"Registered listener ${listener.getClass().getName()}")
      }
    }
  } catch {
    case e: Exception =>
      try {
        stop()
      } finally {
        throw new SparkException(s"Exception when registering SparkListener", e)
      }
  }
  // 启动事件总线,并将_listenerBusStarted设置为 true
  listenerBus.start(this, _env.metricsSystem)
  _listenerBusStarted = true
}

根据代码描述,setupAndStartListenerBus的执行步骤如下:

  • 1. 从spark.extraListeners属性中获取用户自定义的SparkListener的类名。用户可以通过逗号分割多个自定义SparkListener。
  • 2. 通过反射生成每一个自定义SparkListener的实例,并添加到事件总线的监听器列表中
  • 3. 启动事件总线,并将_listenerBusStarted设置true

3.15 Spark环境更新

在SparkContext的初始化过程中,可能对其环境造成影响,所以需要更新环境:

postEnvironmentUpdate()

SparkContext初始化过程中,如果设置了spark.jars属性,spark.jars指定的jar包将由addJar方法加入httpFileServer的jarDir变量指定的路径下。每加入一个jar都会调用postEnvironmentUpdate方法更新环境。增加文件与增加jar相同,也会调用postEnvironmentUpdate方法。

private var _jars: Seq[String] = _
private var _files: Seq[String] = _

//第一步:变量处理
_jars = Utils.getUserJars(_conf)
_files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten

def getUserJars(conf: SparkConf): Seq[String] = {
  val sparkJars = conf.getOption("spark.jars")
  sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
}

//第二步:添加每个jar或每个file
if (jars != null) {
  jars.foreach(addJar)
}
if (files != null) {
  files.foreach(addFile)
}

def addJar(path: String) {
  def addJarFile(file: File): String = {
    try {
      if (!file.exists()) {
        throw new FileNotFoundException(s"Jar ${file.getAbsolutePath} not found")
      }
      if (file.isDirectory) {
        throw new IllegalArgumentException(
          s"Directory ${file.getAbsoluteFile} is not allowed for addJar")
      }
      env.rpcEnv.fileServer.addJar(file)
    } catch {
      case NonFatal(e) =>
        logError(s"Failed to add $path to Spark environment", e)
        null
    }
  }
  
  if (path == null) {
    logWarning("null specified as parameter to addJar")
  } else {
    val key = if (path.contains("\\")) {
      // For local paths with backslashes on Windows, URI throws an exception
      addJarFile(new File(path))
    } else {
      val uri = new URI(path)
      // SPARK-17650: Make sure this is a valid URL before adding it to the list of dependencies
      Utils.validateURL(uri)
      uri.getScheme match {
        // A JAR file which exists only on the driver node
        case null =>
          // SPARK-22585 path without schema is not url encoded
          addJarFile(new File(uri.getRawPath))
        // A JAR file which exists only on the driver node
        case "file" => addJarFile(new File(uri.getPath))
        // A JAR file which exists locally on every worker node
        case "local" => "file:" + uri.getPath
        case _ => path
      }
    }
    if (key != null) {
      val timestamp = System.currentTimeMillis
      if (addedJars.putIfAbsent(key, timestamp).isEmpty) {
        logInfo(s"Added JAR $path at $key with timestamp $timestamp")
        postEnvironmentUpdate()
      }
    }
  }
}

def addFile(path: String): Unit = {
  addFile(path, false)
}

def addFile(path: String, recursive: Boolean): Unit = {
  val uri = new Path(path).toUri
  val schemeCorrectedPath = uri.getScheme match {
    case null | "local" => new File(path).getCanonicalFile.toURI.toString
    case _ => path
  }
  
  val hadoopPath = new Path(schemeCorrectedPath)
  val scheme = new URI(schemeCorrectedPath).getScheme
  if (!Array("http", "https", "ftp").contains(scheme)) {
    val fs = hadoopPath.getFileSystem(hadoopConfiguration)
    val isDir = fs.getFileStatus(hadoopPath).isDirectory
    if (!isLocal && scheme == "file" && isDir) {
      throw new SparkException(s"addFile does not support local directories when not running " +
        "local mode.")
    }
    if (!recursive && isDir) {
      throw new SparkException(s"Added file $hadoopPath is a directory and recursive is not " +
        "turned on.")
    }
  } else {
    // SPARK-17650: Make sure this is a valid URL before adding it to the list of dependencies
    Utils.validateURL(uri)
  }
  
  val key = if (!isLocal && scheme == "file") {
    env.rpcEnv.fileServer.addFile(new File(uri.getPath))
  } else {
    schemeCorrectedPath
  }
  val timestamp = System.currentTimeMillis
  if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
    logInfo(s"Added file $path at $key with timestamp $timestamp")
    // Fetch the file locally so that closures which are run on the driver can still use the
    // SparkFiles API to access files.
    Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
      env.securityManager, hadoopConfiguration, timestamp, useCache = false)
    postEnvironmentUpdate()
  }
}

postEvironmentUpdate方法处理步骤:

  • 1. 通过调用SparkEnv的方法environmentDetails,将环境的JVM参数、Spark属性、系统属性、classPath等信息设置为环境明细信息。
  • 2. 生成事件SparkListenerEvironmentUpdate(此事件携带环境明细信息),并投递到事件总线listenerBus,此事件最终被EnvironmentListener监听,并影响EnvironmentPage页面中的输出内容。
private def postEnvironmentUpdate() {
  if (taskScheduler != null) {
    val schedulingMode = getSchedulingMode.toString
    val addedJarPaths = addedJars.keys.toSeq
    val addedFilePaths = addedFiles.keys.toSeq
    val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,
      addedFilePaths)
    val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)
    listenerBus.post(environmentUpdate)
  }
}

//类名:org.apache.spark.SparkEnv
private[spark]
def environmentDetails(
    conf: SparkConf,
    schedulingMode: String,
    addedJars: Seq[String],
    addedFiles: Seq[String]): Map[String, Seq[(String, String)]] = {
  
  import Properties._
  val jvmInformation = Seq(
    ("Java Version", s"$javaVersion ($javaVendor)"),
    ("Java Home", javaHome),
    ("Scala Version", versionString)
  ).sorted
  
  // Spark properties
  // This includes the scheduling mode whether or not it is configured (used by SparkUI)
  val schedulerMode =
    if (!conf.contains("spark.scheduler.mode")) {
      Seq(("spark.scheduler.mode", schedulingMode))
    } else {
      Seq.empty[(String, String)]
    }
  val sparkProperties = (conf.getAll ++ schedulerMode).sorted
  
  // System properties that are not java classpaths
  val systemProperties = Utils.getSystemProperties.toSeq
  val otherProperties = systemProperties.filter { case (k, _) =>
    k != "java.class.path" && !k.startsWith("spark.")
  }.sorted
  
  // Class paths including all added jars and files
  val classPathEntries = javaClassPath
    .split(File.pathSeparator)
    .filterNot(_.isEmpty)
    .map((_, "System Classpath"))
  val addedJarsAndFiles = (addedJars ++ addedFiles).map((_, "Added By User"))
  val classPaths = (addedJarsAndFiles ++ classPathEntries).sorted
  
  Map[String, Seq[(String, String)]](
    "JVM Information" -> jvmInformation,
    "Spark Properties" -> sparkProperties,
    "System Properties" -> otherProperties,
    "Classpath Entries" -> classPaths)
}

3.16 投递应用程序启动事件

postApplicationStart方法只是向listenerBus发送了SparkListenerApplicationStart事件

postApplicationStart()

private def postApplicationStart() {
  listenerBus.post(SparkListenerApplicationStart(appName, Some(applicationId), startTime, sparkUser, applicationAttemptId, schedulerBackend.getDriverLogUrls))
}

3.17 创建DAGSchedulerSource、BlockManagerSource和ExecutorAllocationManagerSource

首先要调用taskScheduler的postStartHook方法,其目的是为了等待backend就绪

_taskScheduler.postStartHook()
_env.metricsSystem.registerSource(_dagScheduler.metricsSource)
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
_executorAllocationManager.foreach { e =>
  _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
}

3.18 将SparkContext标记为激活

SparkContext初始化的最后将当前SparkContext的状态从contextBeingConstructed(正在构建中)改为activeContext(已激活)

SparkContext.setActiveContext(this, allowMultipleContexts)

猜你喜欢

转载自blog.csdn.net/LINBE_blazers/article/details/88143813