spark Storage Management of disk storage --DiskStore

DiskStore

Next Previous herein, this, we analyze achieve functional class DiskStore disk storage, this class is relatively simple. Before the formal start, I probably feel the need to analyze the background BlockManager, or that it's operating environment, scope of operation. Blockmanager this class each node at runtime actually have an instance (including the driver and executor process), because both driver-side variables created broadcast, or the executor end shuffle shuffle writes the block, or the task run results too need BlockManager transmission, or RDD cache, in fact, run on each node will be managed by the core meaning Blockmanager internal procedures to read and write local memory and disk, so the summary, I want to express is that each process (driver and Executor) has a Blockmanager examples, which examples are to Blockmanager uniquely distinguished by BlockManagerId class, the package is actually BlockManagerId physical location of the process.

DiskStore.put

First we look at one of the most commonly used method of writing

def put(blockId: BlockId)(writeFunc: WritableByteChannel => Unit): Unit = {
    // 通过DiskBlockManager对象检查这个blockId对应的文件名的文件是否存在
    if (contains(blockId)) {
      throw new IllegalStateException(s"Block $blockId is already present in the disk store")
    }
    logDebug(s"Attempting to put block $blockId")
    val startTime = System.currentTimeMillis
    // 通过DiskBlockManager获取一个文件用于写入数据
    val file = diskManager.getFile(blockId)
    // 用CountingWritableChannel包装一下,以便于记录写入的字节数
    val out = new CountingWritableChannel(openForWrite(file))
    var threwException: Boolean = true
    try {
      writeFunc(out)
      // 关键步骤,记录到内部的map结构中
      blockSizes.put(blockId, out.getCount)
      threwException = false
    } finally {
      try {
        out.close()
      } catch {
        case ioe: IOException =>
          if (!threwException) {
            threwException = true
            throw ioe
          }
      } finally {
         if (threwException) {
          remove(blockId)
        }
      }
    }
    val finishTime = System.currentTimeMillis
    logDebug("Block %s stored as %s file on disk in %d ms".format(
      file.getName,
      Utils.bytesToString(file.length()),
      finishTime - startTime))
  }

This method is very simple, there's really nothing, but called a more important class DiskBlockManager, this kind of function is to directories and files on the disk management, will create some directories and subdirectories according to certain rules on the disk, in the allocation file We will try to even the first name when assigned in these directories and subdirectories.

DiskStore.putBytes

This approach is not to say, easy to handle it directly calling the put method.

  def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = {
    put(blockId) { channel =>
      bytes.writeFully(channel)
    }
  }

DiskStore.getBytes

Let's look at this method, first obtain the corresponding file name by DiskBlockManager, then package it into a BlockData object into encrypted and unencrypted two kinds.

  def getBytes(blockId: BlockId): BlockData = {
    val file = diskManager.getFile(blockId.name)
    val blockSize = getSize(blockId)
  
    securityManager.getIOEncryptionKey() match {
      case Some(key) =>
        // Encrypted blocks cannot be memory mapped; return a special object that does decryption
        // and provides InputStream / FileRegion implementations for reading the data.
        new EncryptedBlockData(file, blockSize, conf, key)
  
      case _ =>
        // 看一下DiskBlockData
        new DiskBlockData(minMemoryMapBytes, maxMemoryMapBytes, file, blockSize)
    }
  }

DiskBlockData

This class is a wrapper class disk file, the main function is to provide several convenient interface, the data is read out of the disk file and generates a buffer object.
This class has two important methods toChunkedByteBuffer and toByteBuffer, toByteBuffer not say, calling ReadableByteChannel.read (ByteBuffer dst) method reads the file data, we look at toChunkedByteBuffer

DiskBlockData.toChunkedByteBuffer

This method is very simple, a large amount of data in the time, since the memory block size for each application there is a limit maxMemoryMapBytes, it is necessary to cut into a plurality of blocks

  override def toChunkedByteBuffer(allocator: (Int) => ByteBuffer): ChunkedByteBuffer = {
    // Utils.tryWithResource调用保证在使用完资源后关闭资源
    // 基本等同于java中的try{}finally{}
    Utils.tryWithResource(open()) { channel =>
      var remaining = blockSize
      val chunks = new ListBuffer[ByteBuffer]()
      while (remaining > 0) {
        // 这里取剩余大小和maxMemoryMapBytes的较小值,
        // 也就是说每次申请的内存块大小不超过maxMemoryMapBytes
        val chunkSize = math.min(remaining, maxMemoryMapBytes)
        val chunk = allocator(chunkSize.toInt)
        remaining -= chunkSize
        JavaUtils.readFully(channel, chunk)
        chunk.flip()
        chunks += chunk
      }
      new ChunkedByteBuffer(chunks.toArray)
    }
  }

DiskBlockManager

Also analyzed before this class, mainly used to manage the process spark some temporary files written to run and manage directories.

  • First created according to the parameters to configure the local directory (the directory can be multiple comma-separated), the priority order of the parameters is: If you are running on the yarn, the local directory yarn parameters LOCAL_DIRS configuration will be used; otherwise obtain the value of the environment variable SPARK_LOCAL_DIRS ; otherwise acquire spark.local.dir parameter value; if not final configuration, then use the value of the parameter java.io.tmpdir java system as a temporary directory.

  • Secondly, with regard to the file partitioned between directory, the file name of the hash value of the directory number of methods to try to take the remainder of the file distributed evenly in different directories.

  • Another thing to say is that naming the file name, the name is to distinguish the different roles according to Block, for example, the id RDD cache block is written RDDBlockId, its filename stitching rule is "rdd_" + rddId + "_ "+ splitIndex

Guess you like

Origin www.cnblogs.com/zhuge134/p/11007328.html