SparkCore — Task Allocation Algorithm

Task allocation algorithm

        Following the best position of the Task in the previous article, we analyzed the submitMissingTasks() method, which is more important: one is the calculation of the best position of the task, and the other is to submit the TaskSet to the TaskScheduler. The following analyzes how the tasks in the TaskSet submitted to the TaskScheduler are allocated to the Executor.
  By default, the standalone mode uses TaskSchedulerImpl. TaskScheduler is just a trait. Find the submitTasks() method in TaskSchedulerImpl. The source code is as follows:
 

override def submitTasks(taskSet: TaskSet) {
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    this.synchronized {
      // 为TaskSet创建TaskSetManager,它会负责它的那个TaskSet的任务执行状况的监视和管理
      // TaskManager会负责追踪它所管理的那个TaskSet,如果task失败,它也会重试task等等
      val manager = createTaskSetManager(taskSet, maxTaskFailures)
      // 对TaskSet的信息进行提取和封装
      val stage = taskSet.stageId
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets(taskSet.stageAttemptId) = manager
      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
        ts.taskSet != taskSet && !ts.isZombie
      }
     
      // 将TaskSetManager放入调度池中,这是之前初始化的时候创建的调度池,默认是FIFO
      // 这里将TaskSet放入调度池,会对Task进行排序。
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
	  // 省略部分代码
	  ....
     } 
    // 调用SparkDeploySchedulerBackend的reviveOffers,而SparkDeploySchedulerBackend
    // 又继承自CoarseGrainedSchedulerBackend。
    backend.reviveOffers()
  }

 The reviveOffers of CoarseGrainedSchedulerBackend will be sent by DriverEndPoint the ReviveOffers message, and this message calls the makeOffers() method, the following analysis of this method:

private def makeOffers() {
      // 过滤掉被kill的executor
      val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
      // 将Application所有可用的executor,将其封装成WorkerOffer,每个WorkerOffer
      // 代表了每个executor可用cpu资源数量
      val workOffers = activeExecutors.map { case (id, executorData) =>
        new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
      }.toSeq
      // 调用resourceOffers()方法,执行任务分配算法,将各个task分配到executor上去
      // 将分配好task到executor之后,执行launchTasks()方法,将分配的task发送LaunchTask消息
      // 到对应的executor上去,由executor启动并启动task
      launchTasks(scheduler.resourceOffers(workOffers))
    }

       First encapsulate the available resources of the registered executor as workerOffers, then execute the resourceOffers method of TaskShcedulerImpl, execute the task allocation algorithm, assign each task to the executor, and finally execute its own launchTasks() method to send the assigned task a LaunchTask message to the corresponding The executor goes up, started and executed by the executor.
  First look at the resourceOffers() method of TaskShcedulerImpl.

def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
    // 缓存每个executor的信息,查看是否有新的executor产生,也一并写入缓存中
    var newExecAvail = false
    for (o <- offers) {
      executorIdToHost(o.executorId) = o.host
      executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
      if (!executorsByHost.contains(o.host)) {
        executorsByHost(o.host) = new HashSet[String]()
        executorAdded(o.executorId, o.host)
        newExecAvail = true
      }
      for (rack <- getRackForHost(o.host)) {
        hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
      }
    }

    // 首先将可用的executor进行shuffle,进行打散,尽量进行负载均衡
    val shuffledOffers = Random.shuffle(offers)
    // Build a list of tasks to assign to each worker.
    // 针对Worker创建出所需的组件
    // 创建一个tasks列表,它是一个二维数组,其中一维是TaskDescription,
    // 它对应的子ArrayBuffer是这个task对应的executor可用的cpu数量
    val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
    // 每个worker可用cpu数量
    val availableCpus = shuffledOffers.map(o => o.cores).toArray
    // 从rootPool中,取出了排序的TaskSet,
    // 刚开始创建的TaskSetManager被放入调度池中,会对提交上来的task进行排序
    val sortedTaskSets = rootPool.getSortedTaskSetQueue
    for (taskSet <- sortedTaskSets) {
      logDebug("parentName: %s, name: %s, runningTasks: %s".format(
        taskSet.parent.name, taskSet.name, taskSet.runningTasks))
      if (newExecAvail) {
        // 计算task的本地化级别 -- 这里是计算的新加入executor的taskset的本地化级别
        taskSet.executorAdded()
      }
    }
    // 这里就是任务分配算法的核心了
    // 双重for循环,遍历所有的TaskSet,以及每一种本地化级别
    var launchedTask = false
    // 对每个taskset,从最好的一种本地化级别开始遍历
    for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
      do {
        // 对当前taskset,尝试优先使用最小的本地化级别,将taskset的task,在executor上进行启动
        // 如果启动不了,那么就跳出这个循环,进入下一个本地化级别,也就是放大本地化级别
        // 依次类推,直到尝试将TaskSet在某些本地化级别下,让task在executor上全部启动
        launchedTask = resourceOfferSingleTaskSet(
            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
      } while (launchedTask)
    }

    if (tasks.size > 0) {
      hasLaunchedTask = true
    }
    return tasks
  }

    The core of the above is the double for loop, which is the core of the task allocation algorithm. First, let me introduce the localization levels. There are 5 types:
 PROCESS_LOCAL : Process localization, RDD partition and task enter the same executor, which is fast.
 NODE_LOCAL : Node localization, RDD partition and task, not in the same executor, but on the same worker node.
 NO_PREF : No localization, the calculation data is in a relational database, so no matter which node is available.
  RACK_LOCAL : Rack localization, RDD partition and task are on the same rack, but on different workers.
 ANY : Any localization level, that is, on any node of the cluster, this is the highest level, and this is used when other levels are not feasible.
  These localization levels, from good to bad, from small to large, the higher the localization level, the better. The above double for loop is to start from the best localization level for each TaskSet; use the best localization level first for the current TaskSet, and start the tasks in the TaskSet on the executor; if Can't start, jump out of this loop and enter the next localization level, that is, enlarge the localization level, and so on, until you try to set the TaskSet under certain localization levels and let the task start on the executor.
  The method to achieve the above functions is resourceOfferSingleTaskSet(), which will find which tasks of the taskset can be started on this executor, using a certain localization level. The resourceOffer() method of TaskSetManager is called inside. This method judges whether the localized task can be started on the executor. It judges the waiting time of the executor at this localized level. If it is within a certain range, then it is considered that the localized task can be Start on this executor.
 

Guess you like

Origin blog.csdn.net/qq_32445015/article/details/115308002