Big Data study notes --HDFS writing process source code analysis (2)

Reading & writing process annotated source code analysis HDFS

 Cipian content on a blog undertake not finished, we will focus on an analysis after Namenode get to the metadata, specifically how the real data is written to the node datanode

1 shows the frame of FIG.

Before we begin, or the old rules, first on the map

In this part will highlight writing HDFS 4,5 seven steps in the process step, i.e., the data writing and data packet acknowledgment packet (write packet & ack packet)

2. Source analysis

2.1 Documentation Comments Translation

When explaining this part be related to the class is DFSPacket

DFSPacket class

DFSPacket is used by DataStreamer and DFSOutputStream.
DFSOutputStream generates packets and then ask DataStreamer
to send them to datanodes.

DFSPacket be used DataStreamer and DFSOutputStream, DFSOutputStream then produces a packet transmission request to DataStreamer it sends the data packet up to datanodes

buf is pointed into like follows:
(C is checksum data, D is payload data)
[_________CCCCCCCCC________________DDDDDDDDDDDDDDDD___]
^ ^ ^ ^
| checksumPos dataStart dataPos
checksumStart
Right before sending, we move the checksum data to immediately precede
the actual data, and then insert the header into the buffer immediately
preceding the checksum data, so we make sure to keep enough space in
front of the checksum data to support the largest conceivable header.

The figure vivid demonstration of what a buffer containing part, in front of a checksum, while the back is the real data, before sending the data, we first check will immediately move to the front of the data and real data, and then head immediately inserted into the front of the checksum data, so we can ensure that there is enough space before the checksum data to support the greatest imaginable head space

2.2 source code analysis process to establish a packet

Due to the size of the HDFS file system in a block is 128M, so in order to test the effect, we have prepared a file larger than 128M, and will hit the breakpoint position in the figure below, and then enter debug mode to view the source code!

After calling the write method of two different classes, we saw the first of the more important classes, FSOutputSummer class, write code for this class method is as follows, in bold code, we can actually find the write method write1 is achieved by calling the method, and put it in a for loop, therefore, can guess that, in the method must have a write1 buffer write cycle each time a data buffer length of the array, is constantly cycle, the final completion of all tasks of writing

@Override
  public synchronized void write(byte b[], int off, int len)
      throws IOException {
    
    checkClosed();
    
    if (off < 0 || len < 0 || off > b.length - len) {
      throw new ArrayIndexOutOfBoundsException();
    }

    for (int n=0;n<len;n+=write1(b, off+n, len-n)) {
    }
  }

仔细查找,发现write1方法就在write方法的正下方,现在,我们就来好好解读一下这个write1方法:首先看第一个if判断,count是成员变量,表示的是缓冲区字节个数,初始状态必定为0,而buf是另一个成员变量,它的大小是每一个Chunk的大小乘以Buffer中Chunk的数量,即512 bytes x 9 = 4608 bytes,而我们在IOUtils.copyBytes方法中设置的每次拷贝的缓冲区长度是1024 bytes,因此很显然,第一个if判断并不能满足,走的是下面的代码,在不断的写入1024个字节的过程中,计数器count在不断增加,当buf.length达到4608个字节时(计算逻辑是4608整除1024,也就是在第5次循环时buf被填满)就会触发flushBuffer方法

private static final int BUFFER_NUM_CHUNKS = 9;
  
  protected FSOutputSummer(DataChecksum sum) {
    this.sum = sum;
    this.buf = new byte[sum.getBytesPerChecksum() * BUFFER_NUM_CHUNKS];
    this.checksum = new byte[getChecksumSize() * BUFFER_NUM_CHUNKS];
    this.count = 0;
  }
/**
   * Write a portion of an array, flushing to the underlying
   * stream at most once if necessary.
   */
  private int write1(byte b[], int off, int len) throws IOException {
    if(count==0 && len>=buf.length) {
      // local buffer is empty and user buffer size >= local buffer size, so
      // simply checksum the user buffer and send it directly to the underlying
      // stream
      final int length = buf.length;
      writeChecksumChunks(b, off, length);
      return length;
    }
    
    // copy user data to local buffer
    int bytesToCopy = buf.length-count;
    bytesToCopy = (len<bytesToCopy) ? len : bytesToCopy;
    System.arraycopy(b, off, buf, count, bytesToCopy);
    count += bytesToCopy;
    if (count == buf.length) {
      // local buffer is full
      flushBuffer();
    } 
    return bytesToCopy;
  }

接下去,我们来仔细研究一下flushBuffer方法中到底干了什么,代码如下所示:首先我们看到了两个计算式,第一个变量partialLen计算的是在flush缓冲之后还余留了多少未缓冲的数据,相当于是一个小尾巴,而lenToFlush变量就是需要缓冲的字节数,当这个lenToFlush变量不等于0时,就会调用关键方法writeChecksumChunks,根据方法名顾名思义,我们可以知道,这个方法中会以校验和和真实数据的方式写出buf中的4608个字节

protected synchronized int flushBuffer(boolean keep,
      boolean flushPartial) throws IOException {
    int bufLen = count;
    int partialLen = bufLen % sum.getBytesPerChecksum();
    int lenToFlush = flushPartial ? bufLen : bufLen - partialLen;
    if (lenToFlush != 0) {
      writeChecksumChunks(buf, 0, lenToFlush);
      if (!flushPartial || keep) {
        count = partialLen;
        System.arraycopy(buf, bufLen - count, buf, 0, count);
      } else {
        count = 0;
      }
}

writeChecksumChunks方法的具体代码如下所示:首先,我们发现代码调用了sum的calculateChunkedSums方法,而sum变量是DataChecksum类的一个实例,这个类包含了校验和的类型,在此例中,这个类型为CRC32C,大小为4个字节,而checksum数组的大小也就很明显地可以被计算出来,9 x 4 = 36 bytes,最终将buf数组长度(4608字节),chunk长度,校验和数组,校验和偏移量等变量作为参数传入了关键的方法writeChunk中去

public static enum Type {
    NULL  (CHECKSUM_NULL, 0),
    CRC32 (CHECKSUM_CRC32, 4),
    CRC32C(CHECKSUM_CRC32C, 4),
    DEFAULT(CHECKSUM_DEFAULT, 0), // This cannot be used to create DataChecksum
    MIXED (CHECKSUM_MIXED, 0); // This cannot be used to create DataChecksum

    public final int id;
    public final int size;
    
    private Type(int id, int size) {
      this.id = id;
      this.size = size;
}
private void writeChecksumChunks(byte b[], int off, int len)
  throws IOException {
    sum.calculateChunkedSums(b, off, len, checksum, 0);
    for (int i = 0; i < len; i += sum.getBytesPerChecksum()) {
      int chunkLen = Math.min(sum.getBytesPerChecksum(), len - i);
      int ckOffset = i / sum.getBytesPerChecksum() * getChecksumSize();
      writeChunk(b, off + i, chunkLen, checksum, ckOffset, getChecksumSize());
    }
}

继续调试,我们发现writeChunk方法是通过调用writeChunkImpl方法实现的

// @see FSOutputSummer#writeChunk()
  @Override
  protected synchronized void writeChunk(byte[] b, int offset, int len,
      byte[] checksum, int ckoff, int cklen) throws IOException {
    TraceScope scope =
        dfsClient.getPathTraceScope("DFSOutputStream#writeChunk", src);
    try {
      writeChunkImpl(b, offset, len, checksum, ckoff, cklen);
    } finally {
      scope.close();
    }
}

进入writeChunkImpl方法中,我们终于看到了建包的语句!currentPacket变量是通过调用createPacket方法创建出来的,查看createPacket方法,我们发现,buf的大小为头部的大小加上packetSize的大小,即33 bytes + 65016 bytes = 65049 bytes,而chunksPerPacket的值为126,也就是说,如果将一个chunk和一个checksum作为一个组合,这样的组合在一个packet中共有126对,计算一下,126 x ( 512 + 4 ) = 65016,正好是packetSize的大小!!!既然数据包以及创建出来了,那么接下去就可以开始数组拷贝了,需要拷贝的有两个,校验和以及真实数据,对应的方法就是writeChecksum以及writeData方法,将chunk中的以及checksum数组中的数据依次拷贝到packet数组中去,由于之前的buf数组是每9个chunks一缓冲,因此总共需要缓冲126 / 9 = 14次即可将packet拷满

private DFSPacket createPacket(int packetSize, int chunksPerPkt, long offsetInBlock,
      long seqno, boolean lastPacketInBlock) throws InterruptedIOException {
    final byte[] buf;
    final int bufferSize = PacketHeader.PKT_MAX_HEADER_LEN + packetSize;

    try {
      buf = byteArrayManager.newByteArray(bufferSize);
    } catch (InterruptedException ie) {
      final InterruptedIOException iioe = new InterruptedIOException(
          "seqno=" + seqno);
      iioe.initCause(ie);
      throw iioe;
    }

    return new DFSPacket(buf, chunksPerPkt, offsetInBlock, seqno,
                         getChecksumSize(), lastPacketInBlock);
}
synchronized void writeChecksum(byte[] inarray, int off, int len)
      throws ClosedChannelException {
    checkBuffer();
    if (len == 0) {
      return;
    }
    if (checksumPos + len > dataStart) {
      throw new BufferOverflowException();
    }
    System.arraycopy(inarray, off, buf, checksumPos, len);
    checksumPos += len;
}
synchronized void writeData(byte[] inarray, int off, int len)
      throws ClosedChannelException {
    checkBuffer();
    if (dataPos + len > buf.length) {
      throw new BufferOverflowException();
    }
    System.arraycopy(inarray, off, buf, dataPos, len);
    dataPos += len;
}

当chunk的数量达到最大chunk数量126即一个packet被填满了之后,这个packet会被放入dataQueue即数据队列中去,最终调用关键方法waitAndQueueCurrentPacket

// If packet is full, enqueue it for transmission
    //
    if (currentPacket.getNumChunks() == currentPacket.getMaxChunks() ||
        bytesCurBlock == blockSize) {
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("DFSClient writeChunk packet full seqno=" +
            currentPacket.getSeqno() +
            ", src=" + src +
            ", bytesCurBlock=" + bytesCurBlock +
            ", blockSize=" + blockSize +
            ", appendChunk=" + appendChunk);
      }
      waitAndQueueCurrentPacket();

直接将断点打到waitAndQueueCurrentPacket方法处,我们查看以下这个方法的源码,发现数据队列以及确认队列都使用了LinkedList这一数据结构,并且使用了Java多线程技术的等待唤醒机制,packet已填满但还未放入数据队列时处于等待状态,而一旦将packet加入队列末尾这一动作完成后就使用了notifyAll方法通知到其他packet也能加入数据队列中去了

// both dataQueue and ackQueue are protected by dataQueue lock
  private final LinkedList<DFSPacket> dataQueue = new LinkedList<DFSPacket>();
  private final LinkedList<DFSPacket> ackQueue = new LinkedList<DFSPacket>();
private void waitAndQueueCurrentPacket() throws IOException {
    synchronized (dataQueue) {
      try {
      // If queue is full, then wait till we have enough space
        boolean firstWait = true;
        try {
          while (!isClosed() && dataQueue.size() + ackQueue.size() >
              dfsClient.getConf().writeMaxPackets) {
            if (firstWait) {
              Span span = Trace.currentSpan();
              if (span != null) {
                span.addTimelineAnnotation("dataQueue.wait");
              }
              firstWait = false;
            }
            try {
              dataQueue.wait();
            } catch (InterruptedException e) {
              // If we get interrupted while waiting to queue data, we still need to get rid
              // of the current packet. This is because we have an invariant that if
              // currentPacket gets full, it will get queued before the next writeChunk.
              //
              // Rather than wait around for space in the queue, we should instead try to
              // return to the caller as soon as possible, even though we slightly overrun
              // the MAX_PACKETS length.
              Thread.currentThread().interrupt();
              break;
            }
          }
        } finally {
          Span span = Trace.currentSpan();
          if ((span != null) && (!firstWait)) {
            span.addTimelineAnnotation("end.wait");
          }
        }
        checkClosed();
        queueCurrentPacket();
      } catch (ClosedChannelException e) {
      }
    }
}
private void queueCurrentPacket() {
    synchronized (dataQueue) {
      if (currentPacket == null) return;
      currentPacket.addTraceParent(Trace.currentSpan());
      dataQueue.addLast(currentPacket);
      lastQueuedSeqno = currentPacket.getSeqno();
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Queued packet " + currentPacket.getSeqno());
      }
      currentPacket = null;
      dataQueue.notifyAll();
    }
}

综上所述,我们可以整理出一个packet的构成应该如下图所示:

2.3 数据包发送过程源码分析

调用DataStreamer的run方法,我们可以看到setPipeline方法设置了管线并将数据包通过管线发送到三个datanodes上去,在该方法中,我们找到了记录datanodes信息的数组,截图如下所示,分别对应了s102-s104三个从节点

private void setPipeline(LocatedBlock lb) {
  setPipeline(lb.getLocations(), lb.getStorageTypes(), lb.getStorageIDs());
}
private void setPipeline(DatanodeInfo[] nodes, StorageType[] storageTypes,
    String[] storageIDs) {
  this.nodes = nodes;
  this.storageTypes = storageTypes;
  this.storageIDs = storageIDs;
}

将packet从数据队列移动至确认队列的过程如下:

synchronized (dataQueue) {
  // move packet from dataQueue to ackQueue
  if (!one.isHeartbeatPacket()) {
    span = scope.detach();
    one.setTraceSpan(span);
    dataQueue.removeFirst();
    ackQueue.addLast(one);
    dataQueue.notifyAll();
  }
}

Guess you like

Origin www.cnblogs.com/w950219/p/11911010.html