Spark of SparkContext principle analysis

Codes in github, using a branch origin / branch-2.4

Driver when the process is started, it will instantiate SparkContext object, and then SparkContext in building DAGScheduler and TaskScheduler objects. This sentence in learning dispatch notes of the Spark will basically be mentioned this on to analyze the problem from the source point of view.

First, start from SparkContext Source:

-- SparkContext.scala
// 初始化 TaskScheduler
 val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    //初始化 DAGScheduler
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

Take a look at the initialization TaskScheduler

-- SparkContext.scala
/*
根据跟定的 url 来创建 task scheduler,这里返回 SchedulerBackend, TaskScheduler 两个对象,也就是说 SchedulerBackend 和 TaskScheduler 分别被实例化了。
*/
  private def createTaskScheduler(
      sc: SparkContext,
      master: String,
      deployMode: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._
   // 这个就是常用的 standalone 模式
    master match {
      case SPARK_REGEX(sparkUrl) =>
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
        scheduler.initialize(backend)
        (backend, scheduler)
    }
  }

When instantiating SchedulerBackend, TaskScheduler, creates a SchedulerPool, SchedulerPool will judge FAIR way to create and FIFO.

-- TaskSchedulerImpl.scala

  def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

TaskScheduler bottom is a dispatch interface, in the actual operation performed org.apache.spark.scheduler.TaskSchedulerImpl. While its bottom a SchedulerBackend by operating different types of cluster scheduling (standalone, yarn, mesos) scheduled task.
Clients can call the initialize and start method, and then to submit task sets by runTasks method.

Initialization finished SchedulerBackend, TaskScheduler will start TaskScheduler ( _taskScheduler.start()), eventually calls StandaloneSchedulerBackend.start ()

-- StandaloneSchedulerBackend.scala

 override def start() {
    // 如果是 Clinet 模式, Scheduler backend 应该仅仅视图去连接 luancher;而在 cluster 模式,提交到       
    // master 节点的应用代码需要连接 luancher。
    if (sc.deployMode == "client") {
      launcherBackend.connect()
    }
  // 将 Application 的 name, 请求的 core 和 memory 等信息进行封装
    val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
    // 将封装信息,conf 等构造一个 StandaloneAppClient 实例
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    // StandaloneAppClient 启动
    client.start()
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
    waitForRegistration()
    launcherBackend.setState(SparkAppHandle.State.RUNNING)
  }

StandaloneAppClient is an interface that allows the Application spark cluster communication application, and which receives a url and applications described ApplicationDescription spark Master, as well as a variety of listeners and callbacks listener event cluster occurs.

-- StandaloneAppClient.scala

  def start() {
    // 启动一个 rpcEndpoint; 将回调进入监听状态。
    endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))
  }


    /**
     * 注册到 所有的 master 上,返回一个 Array[Future].
     */
    private def tryRegisterAllMasters(): Array[JFuture[_]] = {
      for (masterAddress <- masterRpcAddresses) yield {
        registerMasterThreadPool.submit(new Runnable {
          override def run(): Unit = try {
            if (registered.get) {
              return
            }
            logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
            val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
            // 发送 RegisterApplication 到远程节点就 master 节点上。这样就表示注册成功了。
            masterRef.send(RegisterApplication(appDescription, self))
          } catch {
            case ie: InterruptedException => // Cancelled
            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
          }
        })
      }
    }

So, it means that the above process flow driver registered with the master. The above procedure is mainly Driver -> SparkContext -> TaskScheduler -> StandaloneSchedulerBackend -> StandaloneAppClient, a TaskScheduler to the Master of the registration process.

Let's look at the process of DAGScheduler.

DAGScheduler is to achieve a high-level stage for scheduling level scheduling, it can calculate the DAG for each of a job, whether the tracking of the output stage and RDD are persistent, and to find an optimal scheduling mechanism to run the job, it It will be submitted to the bottom of the stage as taskset TaskScheduler to send them to run on a cluster task. In addition, it determines the best position to run each task based on the current state of the cache, which will be submitted to the optimum position to TaskScheduler the ground floor. Weapons, it will fail because the processing output file is missing shuffle caused, in this case, the old stage may be resubmitted. A failure inside the stage, if not caused because the file is missing shuffle, will be processed TaskScheduler, it will be retried several times each task, until the last one. It is not, will be canceled throughout the stage.

Can be found, DAGScheduler is based on the underlying call DAGSchedulerEventProcessLoop
private[spark] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

SparkUI by instantiating achieved SparkUI

-- SparkContext.scala

    _ui =
      if (conf.getBoolean("spark.ui.enabled", true)) {
        Some(SparkUI.create(Some(this), _statusStore, _conf, _env.securityManager, appName, "",
          startTime))
      } else {
        // For tests, do not enable the UI
        None
      }
    // 在开始执行 task 任务时,绑定通信端端口
    _ui.foreach(_.bind())

-- SparkUI.scala
  /**
   * 根据存储的应用状态来创建 SparkUI
   */
  def create(
      sc: Option[SparkContext],
      store: AppStatusStore,
      conf: SparkConf,
      securityManager: SecurityManager,
      appName: String,
      basePath: String,
      startTime: Long,
      appSparkVersion: String = org.apache.spark.SPARK_VERSION): SparkUI = {

    new SparkUI(store, sc, conf, securityManager, appName, basePath, startTime, appSparkVersion)
  }

Figure:
Here Insert Picture Description

Guess you like

Origin blog.csdn.net/dec_sun/article/details/90694426