spark 启动流程 源码解析

简单例子

object sum {
  def main(args: Array[String]): Unit = {
    val conf =new SparkConf().setAppName("SUM");
    conf.setMaster("local[3]")
    val size=1024*1024*1024;
    val sc=new SparkContext(conf);
    val data=sc.parallelize( 1 to 10000)
    val d=data.reduce(sum(_, _))
    println(d);

  }
  def sum(x:Int ,y:Int):Int ={
    val s=x+y;
    Thread.sleep(1000);
    return s;
  }
}

Spark Driver用于提交用户程序,实际上可以看做spark的客户端。
Spark Driver的初始化始终围绕着sparkContext的初始化,sparkContext可以算得上是spark应用程序的发动机引擎。SparkContext初始完毕,才能向spark集群提交任务。

SparkConf实例化

class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {

  import SparkConf._

  /** Create a SparkConf that loads defaults from system properties and the classpath */
  def this() = this(true)

  private val settings = new ConcurrentHashMap[String, String]()

  if (loadDefaults) {
    // Load any spark.* system properties
    for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) {
      set(key, value)
    }
  }

SparkConf实例化对象conf,把系统环境变量中"spark."开头的属性保留下来。
conf.setAppName(),conf.setMaster()是必设选项,当然也可以通过命令参数设置。
SparkConf实例化对象会在SparkContext实例化时进行校验及配置数据:

_conf = config.clone()
_conf.validateSettings()

if (!_conf.contains("spark.master")) {
  throw new SparkException("A master URL must be set in your configuration")
}
if (!_conf.contains("spark.app.name")) {
  throw new SparkException("An application name must be set in your configuration")
}

// System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
// yarn-standalone is deprecated, but still supported
if ((master == "yarn-cluster" || master == "yarn-standalone") &&
    !_conf.contains("spark.yarn.app.id")) {
  throw new SparkException("Detected yarn-cluster mode, but isn't running on a cluster. " +
    "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
}

if (_conf.getBoolean("spark.logConf", false)) {
  logInfo("Spark configuration:\n" + _conf.toDebugString)
}

// Set Spark driver host and port system properties
_conf.setIfMissing("spark.driver.host", Utils.localHostName())
_conf.setIfMissing("spark.driver.port", "0")
...................................

SparkContext实例化

SparkContext类中半生对象代码块(核心部分)

 SparkContext的关键属性,组成了SparkContext对象的状态,需要在实例化中赋值:

/* ------------------------------------------------------------------------------------- *
 | Private variables. These variables keep the internal state of the context, and are    |
 | not accessible by the outside world. They're mutable since we want to initialize all  |
 | of them to some neutral value ahead of time, so that calling "stop()" while the       |
 | constructor is still running is safe.                                                 |
 * ------------------------------------------------------------------------------------- */

private var _conf: SparkConf = _
private var _eventLogDir: Option[URI] = None
private var _eventLogCodec: Option[String] = None
private var _env: SparkEnv = _
private var _metadataCleaner: MetadataCleaner = _
private var _jobProgressListener: JobProgressListener = _
private var _statusTracker: SparkStatusTracker = _
private var _progressBar: Option[ConsoleProgressBar] = None
private var _ui: Option[SparkUI] = None
private var _hadoopConfiguration: Configuration = _
private var _executorMemory: Int = _
private var _schedulerBackend: SchedulerBackend = _
private var _taskScheduler: TaskScheduler = _
private var _heartbeatReceiver: RpcEndpointRef = _
@volatile private var _dagScheduler: DAGScheduler = _
private var _applicationId: String = _
private var _applicationAttemptId: Option[String] = None
private var _eventLogger: Option[EventLoggingListener] = None
private var _executorAllocationManager: Option[ExecutorAllocationManager] = None
private var _cleaner: Option[ContextCleaner] = None
private var _listenerBusStarted: Boolean = false
private var _jars: Seq[String] = _
private var _files: Seq[String] = _
private var _shutdownHookRef: AnyRef = _

/* ------------------------------------------------------------------------------------- *
 | Accessors and public fields. These provide access to the internal state of the        |
 | context.                                                                              |
 * ------------------------------------------------------------------------------------- */

1,JobProgressListener

#JobProgressListener监听event,获取相关job,stage信息,并全局保存,用于sparkUI实时展示
//* :: DeveloperApi ::
//* Tracks task-level information to be displayed in the UI.

// Application:
@volatile var startTime = -1L
@volatile var endTime = -1L

// Jobs:
val activeJobs = new HashMap[JobId, JobUIData]
val completedJobs = ListBuffer[JobUIData]()
val failedJobs = ListBuffer[JobUIData]()
val jobIdToData = new HashMap[JobId, JobUIData]
val jobGroupToJobIds = new HashMap[JobGroupId, HashSet[JobId]]

// Stages:
val pendingStages = new HashMap[StageId, StageInfo]
val activeStages = new HashMap[StageId, StageInfo]
val completedStages = ListBuffer[StageInfo]()
val skippedStages = ListBuffer[StageInfo]()
val failedStages = ListBuffer[StageInfo]()
val stageIdToData = new HashMap[(StageId, StageAttemptId), StageUIData]
val stageIdToInfo = new HashMap[StageId, StageInfo]
val stageIdToActiveJobIds = new HashMap[StageId, HashSet[JobId]]
val poolToActiveStages = HashMap[PoolName, HashMap[StageId, StageInfo]]()
// Total of completed and failed stages that have ever been run.  These may be greater than
// `completedStages.size` and `failedStages.size` if we have run more stages or jobs than
// JobProgressListener's retention limits.
var numCompletedStages = 0
var numFailedStages = 0
var numCompletedJobs = 0
var numFailedJobs = 0

// Misc:
val executorIdToBlockManagerId = HashMap[ExecutorId, BlockManagerId]()
def blockManagerIds: Seq[BlockManagerId] = executorIdToBlockManagerId.values.toSeq
var schedulingMode: Option[SchedulingMode] = None
// To limit the total memory usage of JobProgressListener, we only track information for a fixed
// number of non-active jobs and stages (there is no limit for active jobs and stages):
val retainedStages = conf.getInt("spark.ui.retainedStages", SparkUI.DEFAULT_RETAINED_STAGES)
val retainedJobs = conf.getInt("spark.ui.retainedJobs", SparkUI.DEFAULT_RETAINED_JOBS)

 2,LiveListenerBus

spark中的listener都是通过listenerBus统一管理的

#listenerBus = new LiveListenerBus
listenerBus.addListener(jobProgressListener)
类的集成关系:
class LiveListenerBus extends AsynchronousListenerBus[SparkListener, SparkListenerEvent]("SparkListenerBus")
with SparkListenerBus
abstract class AsynchronousListenerBus[L <: AnyRef, E](name: String)  extends ListenerBus[L, E]
#trait  ListenerBus.postToAll(~)方法:
#trait SparkListenerBus.onPostEvent(~)实现了ListenerBus的方法,通过event类型匹配把event传递到对应的listener

/**
 * Post the event to all registered listeners. The `postToAll` caller should guarantee calling
 * `postToAll` in the same thread for all events.
 */
final def postToAll(event: E): Unit = {
  // JavaConverters can create a JIterableWrapper if we use asScala.
  // However, this method will be called frequently. To avoid the wrapper cost, here ewe use
  // Java Iterator directly.
  val iter = listeners.iterator
  while (iter.hasNext) {
    val listener = iter.next()
    try {
      onPostEvent(listener, event)
    } catch {
      case NonFatal(e) =>
        logError(s"Listener ${Utils.getFormattedClassName(listener)} threw an exception", e)
    }
  }
}
setupAndStartListenerBus()方法:
通过spark参数(spark.extraListeners)向listenerBus中添加自定义监听器,调用listenerBus.start()方法
spark.extraListeners参数以逗号分隔,通过反射获得listener对象。

conf.get("spark.extraListeners", "").split(',').map(_.trim).filter(_ != "")
...........
listenerBus.addListener(listener)
listenerBus.start(this)
_listenerBusStarted = true

postEnvironmentUpdate() 方法触发系统环境信息事件,更新信息。

/** Post the environment update event once the task scheduler is ready */
private def postEnvironmentUpdate() {
  if (taskScheduler != null) {
    val schedulingMode = getSchedulingMode.toString
    val addedJarPaths = addedJars.keys.toSeq
    val addedFilePaths = addedFiles.keys.toSeq
    val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,
      addedFilePaths)
    val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)
    listenerBus.post(environmentUpdate)
  }
}

postApplicationStart() 方法触application start事件:

/** Post the application start event */
private def postApplicationStart() {
  // Note: this code assumes that the task scheduler has been initialized and has contacted
  // the cluster manager to get an application ID (in case the cluster manager provides one).
  listenerBus.post(SparkListenerApplicationStart(appName, Some(applicationId),
    startTime, sparkUser, applicationAttemptId, schedulerBackend.getDriverLogUrls))
}

3,SparkEnv

// Create the Spark execution environment (cache, map output tracker, etc)
//Helper method to create a SparkEnv for a driver or an executor.
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)

SparkEnv包含spark运行时环境信息,所有driver、executor都有对应的SparkEnv实例;
通过SparkEnv.create(~)方法创建实例,方法中调用SparkEnv()构造方法:

val envInstance = new SparkEnv(
  executorId,
  rpcEnv,
  actorSystem,
  serializer,
  closureSerializer,
  cacheManager,
  mapOutputTracker,
  shuffleManager,
  broadcastManager,
  blockTransferService,
  blockManager,
  securityManager,
  sparkFilesDir,
  metricsSystem,
  memoryManager,
  outputCommitCoordinator,
  conf)

4,_metadataCleaner

//this.cleanup   Called by MetadataCleaner to clean up the persistentRdds map periodically
//创建一个Timer,周期性清理内存中的RDD
_metadataCleaner = new MetadataCleaner(MetadataCleanerType.SPARK_CONTEXT, this.cleanup, _conf)、

5,_statusTracker

_statusTracker = new SparkStatusTracker(this)
/**
 * Low-level status reporting APIs for monitoring job and stage progress.
 *
 * These APIs intentionally provide very weak consistency semantics; consumers of these APIs should
 * be prepared to handle empty / missing information.  For example, a job's stage ids may be known
 * but the status API may not have any information about the details of those stages, so
 * `getStageInfo` could potentially return `None` for a valid stage id.
 *
 * To limit memory usage, these APIs only provide information on recent jobs / stages.  These APIs
 * will provide information for the last `spark.ui.retainedStages` stages and
 * `spark.ui.retainedJobs` jobs.
 */
 SparkStatusTracker的私有属性:jobProgressListener = sc.jobProgressListener
状态追踪器(statusTracker)就是通过jobProgressListener获取相关的job、stage信息。

6,_progressBar

在控制台显示进度条。进度条显示在最后一个输出之后的下一行中,继续覆盖自己以保持在一行中。
/**
 * ConsoleProgressBar shows the progress of stages in the next line of the console. It poll the
 * status of active stages from `sc.statusTracker` periodically, the progress bar will be showed
 * up after the stage has ran at least 500ms. If multiple stages run in the same time, the status
 * of them will be combined together, showed in one line.
 */
  _progressBar =
  if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && !log.isInfoEnabled) {
    Some(new ConsoleProgressBar(this))
  } else {
    None
  }

7,SparkUI

_ui =
  if (conf.getBoolean("spark.ui.enabled", true)) {
    Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
      _env.securityManager, appName, startTime = startTime))
  } else {
    // For tests, do not enable the UI
    None
  }
// Bind the UI before starting the task scheduler to communicate
// the bound port to the cluster manager properly
_ui.foreach(_.bind())

SparkUI.create(~)创建实例,其中会实例化各种监听器,并添加到SparkListenerBus中,统一监听各种event

/**
 * Create a new Spark UI.
 *
 * @param sc optional SparkContext; this can be None when reconstituting a UI from event logs.
 * @param jobProgressListener if supplied, this JobProgressListener will be used; otherwise, the
 *                            web UI will create and register its own JobProgressListener.
 */
private def create(
    sc: Option[SparkContext],
    conf: SparkConf,
    listenerBus: SparkListenerBus,
    securityManager: SecurityManager,
    appName: String,
    basePath: String = "",
    jobProgressListener: Option[JobProgressListener] = None,
    startTime: Long): SparkUI = {

  val _jobProgressListener: JobProgressListener = jobProgressListener.getOrElse {
    val listener = new JobProgressListener(conf)
    listenerBus.addListener(listener)
    listener
  }

  val environmentListener = new EnvironmentListener
  val storageStatusListener = new StorageStatusListener
  val executorsListener = new ExecutorsListener(storageStatusListener)
  val storageListener = new StorageListener(storageStatusListener)
  val operationGraphListener = new RDDOperationGraphListener(conf)

  listenerBus.addListener(environmentListener)
  listenerBus.addListener(storageStatusListener)
  listenerBus.addListener(executorsListener)
  listenerBus.addListener(storageListener)
  listenerBus.addListener(operationGraphListener)

  new SparkUI(sc, conf, securityManager, environmentListener, storageStatusListener,
    executorsListener, _jobProgressListener, storageListener, operationGraphListener,
    appName, basePath, startTime)
}
//SparkUI类中伴生代码块initialize(),实例化对象时执行。
/** Initialize all components of the server. */
def initialize() {
  attachTab(new JobsTab(this))
  attachTab(stagesTab)
  attachTab(new StorageTab(this))
  attachTab(new EnvironmentTab(this))
  attachTab(new ExecutorsTab(this))
  attachHandler(createStaticHandler(SparkUI.STATIC_RESOURCE_DIR, "/static"))
  attachHandler(createRedirectHandler("/", "/jobs/", basePath = basePath))
  attachHandler(ApiRootResource.getServletHandler(this))
  // This should be POST only, but, the YARN AM proxy won't proxy POSTs
  attachHandler(createRedirectHandler(
    "/stages/stage/kill", "/stages/", stagesTab.handleKillRequest,
    httpMethods = Set("GET", "POST")))
}
initialize()

WebUI.bind()方法中调用:JettyUtils.startJettyServer("0.0.0.0", port, handlers, conf, name)
startJettyServer启动sparkUI服务,如果默认端口被占用,port=port+1再尝试。

  /**
   * Attempt to start a Jetty server bound to the supplied hostName:port using the given
   * context handlers.
   *
   * If the desired port number is contended, continues incrementing ports until a free port is
   * found. Return the jetty Server object, the chosen port, and a mutable collection of handlers.
   */
  def startJettyServer(
      hostName: String,
      port: Int,
      handlers: Seq[ServletContextHandler],
      conf: SparkConf,
      serverName: String = ""): ServerInfo = {

    addFilters(handlers, conf)

    val collection = new ContextHandlerCollection
    val gzipHandlers = handlers.map { h =>
      val gzipHandler = new GzipHandler
      gzipHandler.setHandler(h)
      gzipHandler
    }
    collection.setHandlers(gzipHandlers.toArray)

    // Bind to the given port, or throw a java.net.BindException if the port is occupied
    def connect(currentPort: Int): (Server, Int) = {
      val server = new Server(new InetSocketAddress(hostName, currentPort))
      val pool = new QueuedThreadPool
      pool.setDaemon(true)
      server.setThreadPool(pool)
      val errorHandler = new ErrorHandler()
      errorHandler.setShowStacks(true)
      server.addBean(errorHandler)
      server.setHandler(collection)
      try {
        server.start()
        (server, server.getConnectors.head.getLocalPort)
      } catch {
        case e: Exception =>
          server.stop()
          pool.stop()
          throw e
      }
    }

    val (server, boundPort) = Utils.startServiceOnPort[Server](port, connect, conf, serverName)
    ServerInfo(server, boundPort, collection)
  }
}

8,_hadoopConfiguration : Configuration

_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)

//newConfiguration中的代码块,把spark配置的hadoop信息截取下,作为hadoop的配置信息。
// Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar"
conf.getAll.foreach { case (key, value) =>
  if (key.startsWith("spark.hadoop.")) {
    hadoopConf.set(key.substring("spark.hadoop.".length), value)
  }
}
val bufferSize = conf.get("spark.buffer.size", "65536")
hadoopConf.set("io.file.buffer.size", bufferSize)

9,_taskScheduler

// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
// retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
  HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
_taskScheduler.start()

_applicationId = _taskScheduler.applicationId()
_applicationAttemptId = taskScheduler.applicationAttemptId()
_conf.set("spark.app.id", _applicationId)
_ui.foreach(_.setAppId(_applicationId))
_env.blockManager.initialize(_applicationId)

env.rpcEnv.setupEndpoint(~)方法获得_heartbeatReceiver 对象
创建TaskScheduler,并启动,获得applicationId,配置化相关模块
SparkContext.createTaskScheduler(~)方法如下:

/**
   * Create a task scheduler based on a given master URL.
   * Return a 2-tuple of the scheduler backend and the task scheduler.
   */
  private def createTaskScheduler(
      sc: SparkContext,
      master: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1

    master match {
      case "local" =>
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalBackend(sc.getConf, scheduler, 1)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_REGEX(threads) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        if (threadCount <= 0) {
          throw new SparkException(s"Asked to run locally with $threadCount threads")
        }
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*, M] means the number of cores on the computer with M failures
        // local[N, M] means exactly N threads with M failures
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
        val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

      case SPARK_REGEX(sparkUrl) =>
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
        // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
        val memoryPerSlaveInt = memoryPerSlave.toInt
        if (sc.executorMemory > memoryPerSlaveInt) {
          throw new SparkException(
            "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
              memoryPerSlaveInt, sc.executorMemory))
        }

        val scheduler = new TaskSchedulerImpl(sc)
        val localCluster = new LocalSparkCluster(
          numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
        val masterUrls = localCluster.start()
        val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
        scheduler.initialize(backend)
        backend.shutdownCallback = (backend: SparkDeploySchedulerBackend) => {
          localCluster.stop()
        }
        (backend, scheduler)

      case "yarn-standalone" | "yarn-cluster" =>
        if (master == "yarn-standalone") {
          logWarning(
            "\"yarn-standalone\" is deprecated as of Spark 1.0. Use \"yarn-cluster\" instead.")
        }
        val scheduler = try {
          val clazz = Utils.classForName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
          val cons = clazz.getConstructor(classOf[SparkContext])
          cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
        } catch {
          // TODO: Enumerate the exact reasons why it can fail
          // But irrespective of it, it means we cannot proceed !
          case e: Exception => {
            throw new SparkException("YARN mode not available ?", e)
          }
        }
        val backend = try {
          val clazz =
            Utils.classForName("org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend")
          val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
          cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
        } catch {
          case e: Exception => {
            throw new SparkException("YARN mode not available ?", e)
          }
        }
        scheduler.initialize(backend)
        (backend, scheduler)

      case "yarn-client" =>
        val scheduler = try {
          val clazz = Utils.classForName("org.apache.spark.scheduler.cluster.YarnScheduler")
          val cons = clazz.getConstructor(classOf[SparkContext])
          cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]

        } catch {
          case e: Exception => {
            throw new SparkException("YARN mode not available ?", e)
          }
        }

        val backend = try {
          val clazz =
            Utils.classForName("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend")
          val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
          cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
        } catch {
          case e: Exception => {
            throw new SparkException("YARN mode not available ?", e)
          }
        }

        scheduler.initialize(backend)
        (backend, scheduler)

      case MESOS_REGEX(mesosUrl) =>
        MesosNativeLibrary.load()
        val scheduler = new TaskSchedulerImpl(sc)
        val coarseGrained = sc.conf.getBoolean("spark.mesos.coarse", defaultValue = true)
        val backend = if (coarseGrained) {
          new CoarseMesosSchedulerBackend(scheduler, sc, mesosUrl, sc.env.securityManager)
        } else {
          new MesosSchedulerBackend(scheduler, sc, mesosUrl)
        }
        scheduler.initialize(backend)
        (backend, scheduler)

      case SIMR_REGEX(simrUrl) =>
        val scheduler = new TaskSchedulerImpl(sc)
        val backend = new SimrSchedulerBackend(scheduler, sc, simrUrl)
        scheduler.initialize(backend)
        (backend, scheduler)

      case zkUrl if zkUrl.startsWith("zk://") =>
        logWarning("Master URL for a multi-master Mesos cluster managed by ZooKeeper should be " +
          "in the form mesos://zk://host:port. Current Master URL will stop working in Spark 2.0.")
        createTaskScheduler(sc, "mesos://" + zkUrl)

      case _ =>
        throw new SparkException("Could not parse Master URL: '" + master + "'")
    }
  }
}
/**
 * A collection of regexes for extracting information from the master string.
 */
private object SparkMasterRegex {
  // Regular expression used for local[N] and local[*] master formats
  val LOCAL_N_REGEX = """local\[([0-9]+|\*)\]""".r
  // Regular expression for local[N, maxRetries], used in tests with failing tasks
  val LOCAL_N_FAILURES_REGEX = """local\[([0-9]+|\*)\s*,\s*([0-9]+)\]""".r
  // Regular expression for simulating a Spark cluster of [N, cores, memory] locally
  val LOCAL_CLUSTER_REGEX = """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r
  // Regular expression for connecting to Spark deploy clusters
  val SPARK_REGEX = """spark://(.*)""".r
  // Regular expression for connection to Mesos cluster by mesos:// or mesos://zk:// url
  val MESOS_REGEX = """mesos://(.*)""".r
  // Regular expression for connection to Simr cluster
  val SIMR_REGEX = """simr://(.*)""".r
}

此处简单分析下master:local模式下:
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, 1)
scheduler.initialize(backend)

10,_executorAllocationManager

executor分配管理器,动态调整application占用executor的个数。

此处暂不解释,详情关注:
https://blog.csdn.net/zisheng_wang_data/article/details/51737008

11,metricsSystem

// The metrics system for Driver need to be set spark.app.id to app ID.
// So it should start after we get app ID from the task scheduler and set spark.app.id.
metricsSystem.start()
// Attach the driver metrics servlet handler to the web ui after the metrics system is started.
metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))

......................

setupAndStartListenerBus()
postEnvironmentUpdate()
postApplicationStart()

// Post init
_taskScheduler.postStartHook()
_env.metricsSystem.registerSource(_dagScheduler.metricsSource)
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))

MetricsSystem 是为了衡量系统的各种指标的度量系统。算是一个key-value形态的东西。
举个比较简单的例子,我怎么把当前JVM相关信息展示出去呢?做法自然很多,通过MetricsSystem就可以做的更标准化些,具体方式如下:

  1. Source 。数据来源。比如对应的有org.apache.spark.metrics.source.JvmSource

  2. Sink。  数据发送到哪去。有被动和主动。一般主动的是通过定时器来完成输出,譬如CSVSink,被动的如MetricsServlet等需要被用户主动调用。

  3. 桥接Source 和Sink的则是MetricRegistry了。

运算逻辑处理

通过sparkContext实例获取RDD;
RDD操作处理运算逻辑
 withScope方法模块:是用来做DAG可视化的(DAG visualization on SparkUI),把所有创建的RDD的方法都包裹起来,同时用RDDOperationScope 记录 RDD 的操作历史和关联,sparkUI中展示DAG运行关系图

RDD中的reduce方法(action)
传递调用SparkContext.runjob(~)方法,启动dagScheduler.runJob(~)

通过DAGScheduler把各步骤(rdd任务)划分阶段(stage),形成一系列的TaskSet,然后传给TaskScheduler,把具体的Task交给Worker节点上的Executor的线程池处理。线程池中的线程工作,通过BlockManager来读写数据。
DAGScheduler.runJob(~):

DAGScheduler.submitJob(~):

DAGScheduler采用事件传递,把event(job信息)传递到DAGSchedulerEventProcessLoop.onReceive(~)

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
  case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
    dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

  case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
    dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)

  case StageCancelled(stageId) =>
    dagScheduler.handleStageCancellation(stageId)

  case JobCancelled(jobId) =>
    dagScheduler.handleJobCancellation(jobId)

  case JobGroupCancelled(groupId) =>
    dagScheduler.handleJobGroupCancelled(groupId)

  case AllJobsCancelled =>
    dagScheduler.doCancelAllJobs()

  case ExecutorAdded(execId, host) =>
    dagScheduler.handleExecutorAdded(execId, host)

  case ExecutorLost(execId) =>
    dagScheduler.handleExecutorLost(execId, fetchFailed = false)

  case BeginEvent(task, taskInfo) =>
    dagScheduler.handleBeginEvent(task, taskInfo)

  case GettingResultEvent(taskInfo) =>
    dagScheduler.handleGetTaskResult(taskInfo)

  case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) =>
    dagScheduler.handleTaskCompletion(completion)

  case TaskSetFailed(taskSet, reason, exception) =>
    dagScheduler.handleTaskSetFailed(taskSet, reason, exception)

  case ResubmitFailedStages =>
    dagScheduler.resubmitFailedStages()
}

dagScheduler.handleJobSubmitted(~):job提交,开始划分stage,首先通过finalRDD(即触发action的RDD)创建ResultStage,通过其逆向追溯,通过识别shuffle操作来划分。
reduce (类似的action操作) 对应的都是 ResultStage

private[scheduler] def handleJobSubmitted(jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    callSite: CallSite,
    listener: JobListener,
    properties: Properties) {
  var finalStage: ResultStage = null
  try {
    // New stage creation may throw an exception if, for example, jobs are run on a
    // HadoopRDD whose underlying HDFS files have been deleted.
    finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
  } catch {
    case e: Exception =>
      logWarning("Creating new stage failed due to exception - job: " + jobId, e)
      listener.jobFailed(e)
      return
  }

  val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
  clearCacheLocs()
  logInfo("Got job %s (%s) with %d output partitions".format(
    job.jobId, callSite.shortForm, partitions.length))
  logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
  logInfo("Parents of final stage: " + finalStage.parents)
  logInfo("Missing parents: " + getMissingParentStages(finalStage))

  val jobSubmissionTime = clock.getTimeMillis()
  jobIdToActiveJob(jobId) = job
  activeJobs += job
  finalStage.setActiveJob(job)
  val stageIds = jobIdToStageIds(jobId).toArray
  val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
  listenerBus.post(
    SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
  submitStage(finalStage)

  submitWaitingStages()
}

 ResultStage对象的建立需要需要(stageId 、parentStages),需要通过getParentStagesAndId(~)方法获取。
getParentStages(~)获得finaiRDD的parentStages,原子操作递增生成stageId

/**
 * Helper function to eliminate some code re-use when creating new stages.
 */
private def getParentStagesAndId(rdd: RDD[_], firstJobId: Int): (List[Stage], Int) = {
  val parentStages = getParentStages(rdd, firstJobId)
  val id = nextStageId.getAndIncrement()
  (parentStages, id)
}

dagScheduler通过RDD之间的依赖来分stage,通过ShuffleDependency(宽依赖)来分界:
每个stage都保留其依赖的stage(parentStages);
job的起始stage,其parentStages是NIL

DAGScheduler.getShuffleMapStage(~)处理shufDep得到其对应的Stage
DAGScheduler.shuffleToMapStage保存当前application的stage
若shuffleToMapStage中有当前shufDep的stage就直接返回
若shuffleToMapStage中没有保存当前shufDep的stage执行如下步骤:
    1,getAncestorShuffleDependencies(shuffleDep.rdd)获取当前shufDep之前所有没有保存到shuffleToMapStage的中的shufDep,通过newOrUsedShuffleStage(~)遍历生成stage,保存到shuffleToMapStage中
       2,生成当前shufDep的stage,保存到shuffleToMapStage中,并返回。

/**
 * Get or create a shuffle map stage for the given shuffle dependency's map side.
 */
private def getShuffleMapStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
  shuffleToMapStage.get(shuffleDep.shuffleId) match {
    case Some(stage) => stage
    case None =>
      // We are going to register ancestor shuffle dependencies
      getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
        shuffleToMapStage(dep.shuffleId) = newOrUsedShuffleStage(dep, firstJobId)
      }
      // Then register current shuffleDep
      val stage = newOrUsedShuffleStage(shuffleDep, firstJobId)
      shuffleToMapStage(shuffleDep.shuffleId) = stage
      stage
  }
}

getAncestorShuffleDependencies(rdd):rdd逆向回溯,获取所有没有保存到shuffleToMapStage的中的shufDep保存到Stack中,按顺序stack中顶端到低端是父子依赖关系
优化算法:getShuffleMapStage(~)方法中遍历stack,正向依次生成stage,保存到shuffleToMapStage中
起始RDD预示getAncestorShuffleDependencies(~)方法穷尽job中的了一条rdd依赖线,获得了job的一个起始shuffleDepnedency(firstShufDep)
firstShufDep向前追溯,没有ShuffleDependency,直到起始RDD(创建的RDD),其deps是Nil;

/** Find ancestor shuffle dependencies that are not registered in shuffleToMapStage yet */
private def getAncestorShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
  val parents = new Stack[ShuffleDependency[_, _, _]]
  val visited = new HashSet[RDD[_]]
  // We are manually maintaining a stack here to prevent StackOverflowError
  // caused by recursively visiting
  val waitingForVisit = new Stack[RDD[_]]
  def visit(r: RDD[_]) {
    if (!visited(r)) {
      visited += r
      for (dep <- r.dependencies) {
        dep match {
          case shufDep: ShuffleDependency[_, _, _] =>
            if (!shuffleToMapStage.contains(shufDep.shuffleId)) {
              parents.push(shufDep)
            }
          case _ =>
        }
        waitingForVisit.push(dep.rdd)
      }
    }
  }

  waitingForVisit.push(rdd)
  while (waitingForVisit.nonEmpty) {
    visit(waitingForVisit.pop())
  }
  parents
}

newOrUsedShuffleStage(~)
    1,调用newShuffleMapStage(~)生成stage
    2, mapOutputTracker(application级别)判断此shuffleDependence是否做过shuffleMap操作
         a,做过,把shufflleMap操作的重分区信息(有效的(?:我认为是异常恢复的task))保存到当前stage中。
         b,未做过,注册shuffle,其对应的分区信息Array[MapStatus](partitonsNum)都是空。

/**
 * Create a shuffle map Stage for the given RDD.  The stage will also be associated with the
 * provided firstJobId.  If a stage for the shuffleId existed previously so that the shuffleId is
 * present in the MapOutputTracker, then the number and location of available outputs are
 * recovered from the MapOutputTracker
 */
private def newOrUsedShuffleStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
  val rdd = shuffleDep.rdd
  val numTasks = rdd.partitions.length
  val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite)
  if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
    val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
    val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
    (0 until locs.length).foreach { i =>
      if (locs(i) ne null) {
        // locs(i) will be null if missing
        stage.addOutputLoc(i, locs(i))
      }
    }
  } else {
    // Kind of ugly: need to register RDDs with the cache and map output tracker here
    // since we can't do it in the RDD constructor because # of partitions is unknown
    logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
    mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
  }
  stage
}

newShuffleMapStage(~)方法,此方法和newResultStage几乎一样(就Stage的的实现类不一样)
new~~Stage方法中getParentStagesAndId(~)方法构成了逆向递归;
触发条件是job正向第一个firstShufDep,没有ShuffleDependency;
getParentStages(~)返回结果Nil;nextStageId.getAndIncrement()生成stageId;new~~Stage(~)生成stage实例

/**
 * Create a ShuffleMapStage as part of the (re)-creation of a shuffle map stage in
 * newOrUsedShuffleStage.  The stage will be associated with the provided firstJobId.
 * Production of shuffle map stages should always use newOrUsedShuffleStage, not
 * newShuffleMapStage directly.
 */
private def newShuffleMapStage(
    rdd: RDD[_],
    numTasks: Int,
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int,
    callSite: CallSite): ShuffleMapStage = {
  val (parentStages: List[Stage], id: Int) = getParentStagesAndId(rdd, firstJobId)
  val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, parentStages,
    firstJobId, callSite, shuffleDep)

  stageIdToStage(id) = stage
  updateJobIdStageIdMaps(firstJobId, stage)
  stage
}

 
如上:
1,通过action操作的RDD,调用newResultStage(~)方法,开始逆推,得到第一个shuffleDependence(lastShufDep)如红框
2,通过getAncestorShuffleDependencies(lastShufDep._dd)逆推获取所有的shuffleDependence,从下往上放依次存入stack中
3,遍历stack从上往下正向依次得到实例化stage,并保存shuffleToMapStage中,这样getShuffleMapStage(~)就正常获取结果(不进入循环中)
      stage构造方法需要parentStages,stageId参数
4,红框是分界线,下面部分(包含)是逆向递归的,上面部分正向实例化stage,步骤3完成后,通过lastShufDep实例化lastShufStage(lastShufDep._rdd的parentStage已经保存到shuffleToMapStage中,可直接获取),此lastShufStage也保存到shuffleToMapStage中,递归返回执行newResultStage(~)得到ResultStage实例
(为什么双向设计,单向设计更简单明了)

提交stage : submitStage(finalStage)
finalStage:=> 通过action触发的ResultStage,最后运行的stage
getMissingParentStages(~)逆向识别,返回当前stage的parentStages(未成功运行的)
递归执行submitStage(undoParentStages),直到运行job的起始stage(firstStage的parentStages是空值)

/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug("submitStage(" + stage + ")")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      val missing = getMissingParentStages(stage).sortBy(_.id)
      logDebug("missing: " + missing)
      if (missing.isEmpty) {
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        submitMissingTasks(stage, jobId.get)
      } else {
        for (parent <- missing) {
          submitStage(parent)
        }
        waitingStages += stage
      }
    }
  } else {
    abortStage(stage, "No active job for stage " + stage.id, None)
  }
}

getMissingParentStages(~)方法:
1,rddHasUncachedPartitions:rdd的partition数据是否都已缓存,若已缓存,就不用考虑次条线以上的部分了
2,RDD需要向上追溯来获取数据,直到碰到shufDep,getShuffleMapStage(~)直接获取parentStage(stage划分已经处理)
3,parentStage.isAvailable()判断parentStage的tasks是否成功完成,若不成功,返回parentStage(可能多个)

DAGScheduler.submitMissingTasks(stage,jobId)方法:
1,获取当前stage的未执行的task,判别:partition的map输出信息为空

2,初始化stage每个task的TaskAttemptNumber(默认值:-1)

3,获取最优TaskLocation信息,每个partition的taskLocation可能有多个
  sparkUI监控stage信息

4,序列化stage.rdd和stage.shuffleDep,并广播序列化数组taskBinary

5,实例化stage的各partition task

6,taskScheduler.submitTasks(~)提交stage的任务集taskSet

 TaskScheduler调度

 taskScheduler.submitTasks(~)方法:
 1,创建TaskSetManager实例,用于监控task的生命周期
 2,当前stage的TaskSet是否正在运行,不能冲突
 3,把TaskSetManager实例添加到调度池schdulableBuilder(application级别)中。
 4 ,持续打印warning信息(Initial job has not accepted any resources),直到资源分配完成或者取消。
 5 ,taskScheduler的SchedulerBackend(和Driver/sparkContext的backend是同一个)调用reviveOffers(~)方法

override def submitTasks(taskSet: TaskSet) {
  val tasks = taskSet.tasks
  logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
  this.synchronized {
    val manager = createTaskSetManager(taskSet, maxTaskFailures)
    val stage = taskSet.stageId
    val stageTaskSets =
      taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
    stageTaskSets(taskSet.stageAttemptId) = manager
    val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
      ts.taskSet != taskSet && !ts.isZombie
    }
    if (conflictingTaskSet) {
      throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
        s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
    }
    schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

    if (!isLocal && !hasReceivedTask) {
      //持续打印warning信息
      starvationTimer.scheduleAtFixedRate(new TimerTask() {
        override def run() {
          if (!hasLaunchedTask) {
            logWarning("Initial job has not accepted any resources; " +
              "check your cluster UI to ensure that workers are registered " +
              "and have sufficient resources")
          } else {
            this.cancel()
          }
        }
      }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
    }
    hasReceivedTask = true
  }
  backend.reviveOffers()   //主要逻辑在后在后台进程处理
}

CoarseGrainedSchedulerBackend是SchedulerBackend是实现类,基于yarn的资源调度。
CoarseGrainedSchedulerBackend.reviveOffers()方法

猜你喜欢

转载自blog.csdn.net/stuliper/article/details/81219286
今日推荐