[spark] Data localization and delayed scheduling

foreword

Spark data localization means mobile computing rather than mobile data, and the reality is cruel. It is not necessary to provide enough resources for computing in the place of the data block, in order to make the task as possible as possible with the optimal localization level (Locality Levels) to start, Spark's delayed scheduling came into being, if the resources are not enough, you can retry within the limit time corresponding to the Locality Levels, and if it still cannot start after the limit time is exceeded, lower the Locality Levels and try to start again...

Locality Levels

  • PROCESS_LOCAL: Process localization, code and data are in the same process, that is, in the same executor; the task of computing data is executed by the executor, and the data is in the BlockManager of the executor, and the performance is the best
  • NODE_LOCAL: Node localization, the code and data are in the same node; for example, the data is on the node as an HDFS block, and the task runs in an executor on the node; or the data and the task are different on a node In the executor, data needs to be transferred between processes
  • NO_PREF: For tasks, the data is the same wherever the data is obtained, there is no difference between good and bad, for example, SparkSQL reads the data in MySql
  • RACK_LOCAL: Rack localization, data and tasks are on two nodes in a rack, and data needs to be transmitted between nodes through the network
  • ANY: data and tasks may be anywhere in the cluster, and not in a rack, the worst performance

The localization level of these tasks actually describes the positional relationship between computation and data. How does this final relationship come about? The following is a detailed explanation of its ins and outs.

DAGScheduler submits tasks

After DAGScheduler stages the job, it will submit the Stage to the TaskScheduler in the form of a TaskSet through the submitMissingTasks method. Let's take a look at some code of this method about position priority:

...
// 获取还未执行或未成功执行分区的id
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
...
// 通过getPreferredLocs方法获取rdd该分区的优先位置
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          val job = s.activeJob.get
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch { 
    }
...
// 通过最优位置等信息构建Task
val tasks: Seq[Task[_]] = try {
      stage match {
        case stage: ShuffleMapStage =>
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
          }

        case stage: ResultStage =>
          val job = stage.activeJob.get
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
          }
      }
    } catch { 
    }
...
//将所有task以TaskSet的形式提交给TaskScheduler
taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))

Note that the Task in the TaskSet submitted here already contains the priority location of the Task, and the priority location is obtained through the getPreferredLocs method, you can simply look at its implementation:

private def getPreferredLocsInternal(
      rdd: RDD[_],
      partition: Int,
      visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
    ...
    // 从缓存中获取
    val cached = getCacheLocs(rdd)(partition)
    if (cached.nonEmpty) {
      return cached
    }
    // 直接通过rdd的preferredLocations方法获取
    val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
    if (rddPrefs.nonEmpty) {
      return rddPrefs.map(TaskLocation(_))
    }
    // 递归从parent Rdd获取(窄依赖)
    rdd.dependencies.foreach {
      case n: NarrowDependency[_] =>
        for (inPart <- n.getParents(partition)) {
          val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
          if (locs != Nil) {
            return locs
          }
        }
      case _ =>
    }
    Nil
  }

No matter which method is used to obtain the preferred location of the RDD partition, the data source for the first calculation must be obtained through the preferredLocations method of the RDD. Different RDDs have different preferredLocations implementations, but the data exists in three places. It is cached to memory, HDFS, and disk, and these three methods of TaskLocation have specific implementations:

//数据在内存中
private [spark] case class ExecutorCacheTaskLocation(override val host: String, executorId: String)
  extends TaskLocation {
  override def toString: String = s"${TaskLocation.executorLocationTag}${host}_$executorId"
}
//数据在磁盘上(非HDFS上)
private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation {
  override def toString: String = host
}
//数据在HDFS上
private [spark] case class HDFSCacheTaskLocation(override val host: String) extends TaskLocation {
  override def toString: String = TaskLocation.inMemoryLocationTag + host
}

Therefore, the priority position passed when instantiating Task is one of these three.

Locality levels生成

After DAGScheduler submits the TaskSet to the TaskScheduler, the TaskScheduler will create a TaskSetMagager for each TaskSet to manage its Task. When the TaskSetMagager is initialized, the locality levels contained in the TaskSet will be calculated through computeValidLocalityLevels:

private def computeValidLocalityLevels(): Array[TaskLocality.TaskLocality] = {
    import TaskLocality.{PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY}
    val levels = new ArrayBuffer[TaskLocality.TaskLocality]
    if (!pendingTasksForExecutor.isEmpty && getLocalityWait(PROCESS_LOCAL) != 0 &&
        pendingTasksForExecutor.keySet.exists(sched.isExecutorAlive(_))) {
      levels += PROCESS_LOCAL
    }
    if (!pendingTasksForHost.isEmpty && getLocalityWait(NODE_LOCAL) != 0 &&
        pendingTasksForHost.keySet.exists(sched.hasExecutorsAliveOnHost(_))) {
      levels += NODE_LOCAL
    }
    if (!pendingTasksWithNoPrefs.isEmpty) {
      levels += NO_PREF
    }
    if (!pendingTasksForRack.isEmpty && getLocalityWait(RACK_LOCAL) != 0 &&
        pendingTasksForRack.keySet.exists(sched.hasHostAliveOnRack(_))) {
      levels += RACK_LOCAL
    }
    levels += ANY
    logDebug("Valid locality levels for " + taskSet + ": " + levels.mkString(", "))
    levels.toArray
  }

The program will judge whether the TaskSetMagager contains various levels in turn. The logic is similar. Let's take a closer look at the first one, the definition and addition of pendingTasksForExecutor:

// key为executorId,value为在该executor上有缓存的数据块对应的taskid数组
private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]]
...
//遍历所有该TaskSet的所有task进行添加
for (i <- (0 until numTasks).reverse) {
    addPendingTask(i)
  }
...
private def addPendingTask(index: Int) {
    for (loc <- tasks(index).preferredLocations) {
      loc match {
        case e: ExecutorCacheTaskLocation =>
          pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer) += index
        case e: HDFSCacheTaskLocation =>
          val exe = sched.getExecutorsAliveOnHost(loc.host)
          exe match {
            case Some(set) =>
              for (e <- set) {
                pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer) += index
              }
              logInfo(s"Pending task $index has a cached location at ${e.host} " +
                ", where there are executors " + set.mkString(","))
            case None => logDebug(s"Pending task $index has a cached location at ${e.host} " +
                ", but there are no executors alive there.")
          }
        case _ =>
      }
      pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer) += index
      for (rack <- sched.getRackForHost(loc.host)) {
        pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer) += index
      }
    }

    if (tasks(index).preferredLocations == Nil) {
      pendingTasksWithNoPrefs += index
    }

    allPendingTasks += index  // No point scanning this whole list to find the old task there
  }

Note that the addPendingTask method here will traverse the priority positions of all Tasks managed by the TaskSetMagager (resolved above), if ExecutorCacheTaskLocation (cached in memory), add the corresponding executorId and taskId to pendingTasksForExecutor, and also add to the low-level needs In pendingTasksForHost and pendingTasksForRack, it is stated that if the optimal locality of a task is X, then the task also has all other localities that are worse than X.
Back to the locality level judgment above:

if (!pendingTasksForExecutor.isEmpty && getLocalityWait(PROCESS_LOCAL) != 0 &&
        pendingTasksForExecutor.keySet.exists(sched.isExecutorAlive(_))) {
      levels += PROCESS_LOCAL
    }

As long as you look at the third judgment pendingTasksForExecutor.keySet.exists(sched.isExecutorAlive( ))), where pendingTasksForExecutor.keySet is the executorId, sched.isExecutorAlive( ) is to judge whether the executor id in the parameter is currently active. So the whole line of code means that there are executors whose data blocks corresponding to the task are cached in the executor whether they are active, and if so, add the PROCESS_LOCAL level to the LocalityLevels of the TaskSet.

The logic behind other locality levels is the same and will not be discussed in detail. The difference is to judge whether there is a data block corresponding to the task, whether the hosts in some nodes have Alive, etc...

At this point, the LocalityLevels contained in the TaskSet have been calculated.

Delay scheduling strategy

If spark runs on yarn, there are also two layers of delayed scheduling. The first layer is to allocate spark executors to nodemanagers with data as much as possible. This layer does not achieve data locality. In the spark stage, data locality is even more impossible. .

The purpose of delay scheduling is to reduce network and IO overhead, and it is obvious in the case of large amount of data and simple calculation logic (task execution time is less than data transmission time).

Spark scheduling always tries to start each task with the highest locality level. When a task is started with X locality level, but all nodes corresponding to the locality level have no free resources and the startup fails, it is not It will start with a lower locality level immediately but start the task again with X locality level within a certain period of time.

The TaskSetMagager will traverse each available executor resource at the locality level contained in a certain TaskSet and try to start the currently managed tasks on the executor, so how does it decide whether a task can be started on the executor? First, the highest local level of tasks that are not executed in the current TaskSetMagager will be calculated through the getAllowedLocalityLevel(curTime) method:

private def getAllowedLocalityLevel(curTime: Long): TaskLocality.TaskLocality = {
    // Remove the scheduled or finished tasks lazily
    def tasksNeedToBeScheduledFrom(pendingTaskIds: ArrayBuffer[Int]): Boolean = {
      var indexOffset = pendingTaskIds.size
      while (indexOffset > 0) {
        indexOffset -= 1
        val index = pendingTaskIds(indexOffset)
        if (copiesRunning(index) == 0 && !successful(index)) {
          return true
        } else {
          pendingTaskIds.remove(indexOffset)
        }
      }
      false
    }
    // Walk through the list of tasks that can be scheduled at each location and returns true
    // if there are any tasks that still need to be scheduled. Lazily cleans up tasks that have
    // already been scheduled.
    def moreTasksToRunIn(pendingTasks: HashMap[String, ArrayBuffer[Int]]): Boolean = {
      val emptyKeys = new ArrayBuffer[String]
      val hasTasks = pendingTasks.exists {
        case (id: String, tasks: ArrayBuffer[Int]) =>
          if (tasksNeedToBeScheduledFrom(tasks)) {
            true
          } else {
            emptyKeys += id
            false
          }
      }
      // The key could be executorId, host or rackId
      emptyKeys.foreach(id => pendingTasks.remove(id))
      hasTasks
    }

    while (currentLocalityIndex < myLocalityLevels.length - 1) {
      val moreTasks = myLocalityLevels(currentLocalityIndex) match {
        case TaskLocality.PROCESS_LOCAL => moreTasksToRunIn(pendingTasksForExecutor)
        case TaskLocality.NODE_LOCAL => moreTasksToRunIn(pendingTasksForHost)
        case TaskLocality.NO_PREF => pendingTasksWithNoPrefs.nonEmpty
        case TaskLocality.RACK_LOCAL => moreTasksToRunIn(pendingTasksForRack)
      }
      if (!moreTasks) {
        // This is a performance optimization: if there are no more tasks that can
        // be scheduled at a particular locality level, there is no point in waiting
        // for the locality wait timeout (SPARK-4939).
        lastLaunchTime = curTime
        logDebug(s"No tasks for locality level ${myLocalityLevels(currentLocalityIndex)}, " +
          s"so moving to locality level ${myLocalityLevels(currentLocalityIndex + 1)}")
        currentLocalityIndex += 1
      } else if (curTime - lastLaunchTime >= localityWaits(currentLocalityIndex)) {
        // Jump to the next locality level, and reset lastLaunchTime so that the next locality
        // wait timer doesn't immediately expire
        lastLaunchTime += localityWaits(currentLocalityIndex)
        logDebug(s"Moving to ${myLocalityLevels(currentLocalityIndex + 1)} after waiting for " +
          s"${localityWaits(currentLocalityIndex)}ms")
        currentLocalityIndex += 1
      } else {
        return myLocalityLevels(currentLocalityIndex)
      }
    }
    myLocalityLevels(currentLocalityIndex)
  }

The currentLocalityIndex in the loop condition is the index of the LocalityIndex returned by the previous call to getAllowedLocalityLevel in myLocalityLevels, the initial value is 0, and myLocalityLevels is the locality level contained in all tasks of TaskSetMagager.

  • If the level corresponding to myLocalityLevels(currentLocalityIndex) has unexecuted tasks, it can be obtained through the moreTasksToRunIn method (the logic is very simple: the tasks that are executed and being executed are removed from the corresponding list, and the unexecuted tasks return true directly)
  • If not, the currentLocalityIndex is incremented by one to continue the cycle (degradation)
  • If so, first determine whether the difference between the current time and the last start time at this level exceeds the time limit that the level can tolerate. If it does not exceed, it will directly return to the corresponding LocalityLevel. If it exceeds, the currentLocalityIndex will add one to continue the cycle ( downgrade)

So far, the highest locality level of the unexecuted tasks in the TaskSetMagager has been taken out (the one with the highest level in maxLocality is taken as the final allowedLocality).

The final decision on whether to start a task on an executor is the method dequeueTask(execId, host, allowedLocality)

private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value)
    : Option[(Int, TaskLocality.Value, Boolean)] =
  {
    for (index <- dequeueTaskFromList(execId, getPendingTasksForExecutor(execId))) {
      return Some((index, TaskLocality.PROCESS_LOCAL, false))
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {
      for (index <- dequeueTaskFromList(execId, getPendingTasksForHost(host))) {
        return Some((index, TaskLocality.NODE_LOCAL, false))
      }
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) {
      // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic
      for (index <- dequeueTaskFromList(execId, pendingTasksWithNoPrefs)) {
        return Some((index, TaskLocality.PROCESS_LOCAL, false))
      }
    }
    ...
  }

The TaskLocality.isAllowed method is used to ensure that tasks are only started with localities that are higher (equal) than allowedLocality, because a task has all other localities that are worse than optimal locality. This ensures that a task can be started with the highest possible level of locality.

Optimization suggestions

You can use the Spark UI to view the locality level of a job's task, and adjust the waiting time for data localization according to the actual situation:

  • spark.locality.wait global, applies to each locality level, defaults to 3s
  • spark.locality.wait.process
  • spark.locality.wait.node
  • spark.locality.wait.rack

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325521414&siteId=291194637