spark为每一个task分配多少内存

nohup spark-submit --master yarn --deploy-mode cluster --driver-memory 1g --executor-memory 1g --executor-cores 4 --num-executors 8 --conf "spark.default.parallelism=72" --conf "spark.streaming.kafka.maxRatePerPartition=2000" --class com.sjw.sync.spark.DealDataFlowMain DataSync-1.0-SNAPSHOT-jar-with-dependencies.jar >job.log 2>&1 &

以这个任务为例。

1个executor 内存1G ，core 4

spark可配置的参数 val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1)，代表cpu与task 一一对应

计算得出，每一个task 得到内存 1/4 G。

具体源码部分：位置

package org.apache.spark.memory

ExecutionMemoryPool.scala

 private[memory] def acquireMemory(
      numBytes: Long,
      taskAttemptId: Long,
      maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => Unit,
      computeMaxPoolSize: () => Long = () => poolSize): Long = lock.synchronized {
    assert(numBytes > 0, s"invalid number of bytes requested: $numBytes")

    // TODO: clean up this clunky method signature

    // Add this task to the taskMemory map just so we can keep an accurate count of the number
    // of active tasks, to let other tasks ramp down their memory in calls to `acquireMemory`
    if (!memoryForTask.contains(taskAttemptId)) {
      memoryForTask(taskAttemptId) = 0L
      // This will later cause waiting tasks to wake up and check numTasks again
      lock.notifyAll()
    }

    // Keep looping until we're either sure that we don't want to grant this request (because this
    // task would have more than 1 / numActiveTasks of the memory) or we have enough free
    // memory to give it (we always let each task get at least 1 / (2 * numActiveTasks)).
    // TODO: simplify this to limit each task to its own slot
    while (true) {
      val numActiveTasks = memoryForTask.keys.size
      val curMem = memoryForTask(taskAttemptId)

      // In every iteration of this loop, we should first try to reclaim any borrowed execution
      // space from storage. This is necessary because of the potential race condition where new
      // storage blocks may steal the free execution memory that this task was waiting for.
      maybeGrowPool(numBytes - memoryFree)

      // Maximum size the pool would have after potentially growing the pool.
      // This is used to compute the upper bound of how much memory each task can occupy. This
      // must take into account potential free memory as well as the amount this pool currently
      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
      // we did not take into account space that could have been freed by evicting cached blocks.
      val maxPoolSize = computeMaxPoolSize()
      val maxMemoryPerTask = maxPoolSize / numActiveTasks
      val minMemoryPerTask = poolSize / (2 * numActiveTasks)

      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
      // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
      val toGrant = math.min(maxToGrant, memoryFree)

      // We want to let each task get at least 1 / (2 * numActiveTasks) before blocking（阻塞）;
      // 我们希望每个任务在阻塞之前至少得到1 / (2 * numactivetask)
      // if we can't give it this much now, wait for other tasks to free up memory
      // (this happens if older tasks allocated lots of memory before N grew(增长))
      if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {
        logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")
        lock.wait()
      } else {
        memoryForTask(taskAttemptId) += toGrant
        return toGrant
      }
    }
    0L  // Never reached
  }

这个是执行内存的部分，每个executor内存确定，每个core分配几个task，这些配置确定的情况下。每个task可分配的内存大小，基本与task/(cpu core)数量强相关。

关于storage内存部分，每个task分配的大小，与block的大小有关。

内存统一管理，包括JVM堆内存，与调用sum.unsafe 分配堆外内存(系统内存)。

哥伦布112

发布了83 篇原创文章 · 获赞 19 · 访问量 17万+

私信关注

spark为每一个task分配多少内存

猜你喜欢