Introduction to DFSOutputStream

DFSOutputStream Fact Sheet

In this section, we introduce the processing part of the client in the process of writing data to hdfs. The processing of the client mainly uses the DFSOutputStream object. From the name, we can see that this is an encapsulation of the output stream of the dfs file system. Next, let's take a detailed look at several important classes and variables used.

The main functions of DFSOutputStream have been clearly stated in the class comments. Let's take a look first. If the English is not good, the translation may not be very good.

/****************************************************************
 * DFSOutputStream从字节流创建文件
 * DFSOutputStream creates files from a stream of bytes.
 *
 * 客户端写的数据DFSOutputStream临时缓存了起来。数据被分解了一个个的数据包（DFSPacket）,
 * 每个DFSPacket一般是64K大小，一个DFSPacket又包含了若干个块（chunks），每个chunk一般是512k并且
 * 有一个对应的校验和。
 * The client application writes data that is cached internally by
 * this stream. Data is broken up into packets, each packet is
 * typically 64K in size. A packet comprises of chunks. Each chunk
 * is typically 512 bytes and has an associated checksum with it.
 *
 * 当一个客户端程序写的的数据填充慢了当前的数据包的时候(DFSPacket类型的变量currentPacket)，
 * 就会被有顺序的放入dataQueue队列中。DataStreamer线程从dataQueue中获取数据包（packets）,
 * 发送该数据包给数据管道(pipeline)中的第一个datanode, 然后把该数据包从dataQueue中移除，添加到ackQueue。
 * ResponseProcessor会从各个datanode中接收ack确认消息。
 * 当对于一个DFSPacket的成功的ack确认消息被所有的datanode接收到了,ResponseProcessor将其从ackQueue列表中移除  
 * When a client application fills up the currentPacket, it is
 * enqueued into dataQueue.  The DataStreamer thread picks up
 * packets from the dataQueue, sends it to the first datanode in
 * the pipeline and moves it from the dataQueue to the ackQueue.
 * The ResponseProcessor receives acks from the datanodes. When an
 * successful ack for a packet is received from all datanodes, the
 * ResponseProcessor removes the corresponding packet from the
 * ackQueue.
 *
 *
 * 在有错误发生的时候，所有的未完成的数据包从ackQueue队列移除,一个新的不包含损坏的datanode的管道将会被建立，
 * DataStreamer线程将重新开始从dataQueue获取数据包发送。
 * In case of error, all outstanding packets and moved from
 * ackQueue. A new pipeline is setup by eliminating the bad
 * datanode from the original pipeline. The DataStreamer now
 * starts sending packets from the dataQueue.
****************************************************************/
@InterfaceAudience.Private
public class DFSOutputStream extends FSOutputSummer
    implements Syncable, CanSetDropBehind { }

DFSOutputStream important variables

The two most important queues, dataQueue and ackQueue, are typical producer and consumer models. For dataQueue, the producer is the client and the consumer is the DataStreamer. For ackQueue, the producer is the DataStreamer, consumer is ResponseProcessor

/**
   * dataQueue和ackQueue是两个非常重要的变量，他们是存储了DFSPacket对象的链表。
   * dataQueue列表用于存储待发送的数据包，客户端写入的数据，先临时存到这个队列里。
   * ackQueue是回复队列，从datanode收到回复消息之后，存到这里队列里。
   * 
   */
  // both dataQueue and ackQueue are protected by dataQueue lock
  private final LinkedList<DFSPacket> dataQueue = new LinkedList<DFSPacket>();
  private final LinkedList<DFSPacket> ackQueue = new LinkedList<DFSPacket>();
  private DFSPacket currentPacket = null;//当前正在处理的数据包
  private DataStreamer streamer;
  private long currentSeqno = 0;
  private long lastQueuedSeqno = -1;
  private long lastAckedSeqno = -1;
  private long bytesCurBlock = 0; // bytes written in current block 当前的数据块有多少个字节
  private int packetSize = 0; // write packet size, not including the header.
  private int chunksPerPacket = 0;

Data processing thread class DataStreamer

DataStreamer is the core class for processing data, let's look at the explanation in the comments

/**
   *  DataStreamer负责往管道中的datanodes发送数据包， 从namenode中获取块的位置信息和blockid，然后开始
   *  将数据包发送到datanode的管道。
   *  每个包都有一个序列号。
   *  当所有的数据包都发送完毕并且都接收到回复消息之后，DataStreamer关闭当前的block
   * The DataStreamer class is responsible for sending data packets to the
   * datanodes in the pipeline. It retrieves a new blockid and block locations
   * from the namenode, and starts streaming packets to the pipeline of
   * Datanodes. Every packet has a sequence number associated with
   * it. When all the packets for a block are sent out and acks for each
   * if them are received, the DataStreamer closes the current block.
   */
  class DataStreamer extends Daemon {
      
    private volatile boolean streamerClosed = false;
    private volatile ExtendedBlock block; // its length is number of bytes acked
    private Token<BlockTokenIdentifier> accessToken;
    private DataOutputStream blockStream;//发送数据的输出流
    private DataInputStream blockReplyStream;//输入流，即接收ack消息的流
    private ResponseProcessor response = null;
    private volatile DatanodeInfo[] nodes = null; // list of targets for current block 将要发送的datanode的集合
    private volatile StorageType[] storageTypes = null;
    private volatile String[] storageIDs = null;
      
    ......................  
      
  }

Response processing class ResponseProcessor

ResponseProcessor is a subclass of DataStreamer for processing received ack data

//处理从datanode返回的相应信息，当相应到达的时候，将DFSPacket从ackQueue移除
    // Processes responses from the datanodes.  A packet is removed
    // from the ackQueue when its response arrives.
    //
    private class ResponseProcessor extends Daemon {}

Process flow

Client sends data to dataQueue

After creating the file, return an FSDataOutputStream object, call the write method to write the data, and finally call org.apache.hadoop.fs.FSOutputSummer.write(byte[], int, int);

write调用write1()方法循环写入len长度的数据，当写满一个数据块的时候，调用抽象方法writeChunk来写入数据，具体的实现则是org.apache.hadoop.hdfs.DFSOutputStream类中的同名方法，

具体的写入是在writeChunkImpl方法中，具体的代码如下：

private synchronized void writeChunkImpl(byte[] b, int offset, int len,
          byte[] checksum, int ckoff, int cklen) throws IOException {
    dfsClient.checkOpen();
    checkClosed();

    if (len > bytesPerChecksum) {
      throw new IOException("writeChunk() buffer size is " + len +
                            " is larger than supported  bytesPerChecksum " +
                            bytesPerChecksum);
    }
    if (cklen != 0 && cklen != getChecksumSize()) {
      throw new IOException("writeChunk() checksum size is supposed to be " +
                            getChecksumSize() + " but found to be " + cklen);
    }

    if (currentPacket == null) {
      currentPacket = createPacket(packetSize, chunksPerPacket, 
          bytesCurBlock, currentSeqno++, false);
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("DFSClient writeChunk allocating new packet seqno=" + 
            currentPacket.getSeqno() +
            ", src=" + src +
            ", packetSize=" + packetSize +
            ", chunksPerPacket=" + chunksPerPacket +
            ", bytesCurBlock=" + bytesCurBlock);
      }
    }

    currentPacket.writeChecksum(checksum, ckoff, cklen);
    currentPacket.writeData(b, offset, len);
    currentPacket.incNumChunks();
    bytesCurBlock += len;

    // If packet is full, enqueue it for transmission
    //当一个DFSPacket写满了，则调用waitAndQueueCurrentPacket将其加入
    if (currentPacket.getNumChunks() == currentPacket.getMaxChunks() ||
        bytesCurBlock == blockSize) {
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("DFSClient writeChunk packet full seqno=" +
            currentPacket.getSeqno() +
            ", src=" + src +
            ", bytesCurBlock=" + bytesCurBlock +
            ", blockSize=" + blockSize +
            ", appendChunk=" + appendChunk);
      }
      waitAndQueueCurrentPacket();

      // If the reopened file did not end at chunk boundary and the above
      // write filled up its partial chunk. Tell the summer to generate full 
      // crc chunks from now on.
      if (appendChunk && bytesCurBlock%bytesPerChecksum == 0) {
        appendChunk = false;
        resetChecksumBufSize();
      }

      if (!appendChunk) {
        int psize = Math.min((int)(blockSize-bytesCurBlock), dfsClient.getConf().writePacketSize);
        computePacketChunkSize(psize, bytesPerChecksum);
      }
      //
      // if encountering a block boundary, send an empty packet to 
      // indicate the end of block and reset bytesCurBlock.
      //
      if (bytesCurBlock == blockSize) {
        currentPacket = createPacket(0, 0, bytesCurBlock, currentSeqno++, true);
        currentPacket.setSyncBlock(shouldSyncBlock);
        waitAndQueueCurrentPacket();
        bytesCurBlock = 0;
        lastFlushOffset = 0;
      }
    }
  }

当packet满了的时候，调用waitAndQueueCurrentPacket方法，将数据包放入dataQueue队列中，waitAndQueueCurrentPacket方法开始的时候会进行packet的大小的判断，当dataQueue和ackQueue的值大于writeMaxPackets（默认80）时候，就等地，直到有足够的空间.

private void waitAndQueueCurrentPacket() throws IOException {
    synchronized (dataQueue) {
      try {
      // If queue is full, then wait till we have enough space
        boolean firstWait = true;
        try {
         //当大小不够的时候就wait
          while (!isClosed() && dataQueue.size() + ackQueue.size() >
              dfsClient.getConf().writeMaxPackets) {
                    ..................
            try {
              dataQueue.wait();
            } catch (InterruptedException e) {
                ..............
            }
          }
        } finally {
         ...............
        }
        checkClosed();
        //入队列
        queueCurrentPacket();
      } catch (ClosedChannelException e) {
      }
    }
  }

最后调用了queueCurrentPacket方法，将packet真正的放入了队列中

private void queueCurrentPacket() {
    synchronized (dataQueue) {
      if (currentPacket == null) return;
      currentPacket.addTraceParent(Trace.currentSpan());
      dataQueue.addLast(currentPacket);//将数据包放到了队列的尾部
      lastQueuedSeqno = currentPacket.getSeqno();
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Queued packet " + currentPacket.getSeqno());
      }
      currentPacket = null;//当前packet置空，用于下一个数据包的写入
      dataQueue.notifyAll();//唤醒所有在dataQueue上的线程去处理
    }
  }

最终通过方法queueCurrentPacket将DFSPacket写入dataQueue,即dataQueue.addLast(currentPacket);

并通过dataQueue.notifyAll();唤醒dataQueue上面等待的所有线程来处理数据

private void queueCurrentPacket() {
    synchronized (dataQueue) {
      if (currentPacket == null) return;
      currentPacket.addTraceParent(Trace.currentSpan());
      dataQueue.addLast(currentPacket);
      lastQueuedSeqno = currentPacket.getSeqno();
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Queued packet " + currentPacket.getSeqno());
      }
      currentPacket = null;
      dataQueue.notifyAll();
    }
  }

DataStreamer处理dataQueue中的数据

DataStreamer处理发送数据的核心逻辑在run方法中。

处理错误

在开始的时候，首先判断是否有错误

具体的处理方法是private的processDatanodeError方法，如果发现了错误，就讲ack队列里的packet全部放回dataQueue中，然后创建一个新的流重新发送数据。

创建输出数据流，发送数据

通过nextBlockOutputStream()方法建立到datanode的输出流。

向namenode申请数据块

locateFollowingBlock方法申请数据块，具体的代码是
dfsClient.namenode.addBlock(src, dfsClient.clientName, block, excludedNodes, fileId, favoredNodes);

dfsClient拿到namenode的代理，然后通过addBlock方法来申请新的数据块，addBlock方法申请数据块的时候还会提交上一个块，也就是参数中的block，即上一个数据块。
excludedNodes参数表示了申请数据块的时候需要排除的datanode列表，
favoredNodes参数表示了优先选择的datanode列表。

连接到第一个datanode

成功申请了数据块之后，会返回一个LocatedBlock对象，里面包含了datanode的相关信息。

然后通过createBlockOutputStream方法连接到第一个datanode，具体就是new了一个DataOutputStream对象来连接到datanode。然后构造了一个Sender对象，来向DataNode发送操作码是80的写block的输出流，发送到datanode的数据，datanode通过DataXceiver接收处理

new Sender(out).writeBlock(blockCopy, nodeStorageTypes[0], accessToken,
      dfsClient.clientName, nodes, nodeStorageTypes, null, bcs, 
      nodes.length, block.getNumBytes(), bytesSent, newGS,
      checksum4WriteBlock, cachingStrategy.get(), isLazyPersistFile,
    (targetPinnings == null ? false :targetPinnings[0]), targetPinnings);

申请block，然后建立到datanode的连接，是在一个do while循环中做的，如果失败了会尝试重新连接，默认三次。

建立管道

nextBlockOutputStream方法成功的返回了datanode的信息之后，setPipeline方法建立到datanode的管道信息，这个方法比较简单，就是用申请到的datanode给相应的变量赋值。

private void setPipeline(LocatedBlock lb) {
      setPipeline(lb.getLocations(), lb.getStorageTypes(), lb.getStorageIDs());
    }
    private void setPipeline(DatanodeInfo[] nodes, StorageType[] storageTypes,
        String[] storageIDs) {
      this.nodes = nodes;
      this.storageTypes = storageTypes;
      this.storageIDs = storageIDs;
    }

初始化数据流

initDataStreaming方法主要就是根据datanode列表建立ResponseProcessor对象，并且调动start方法启动，并将状态设置为DATA_STREAMING

/**
     * Initialize for data streaming
     */
    private void initDataStreaming() {
      this.setName("DataStreamer for file " + src +
          " block " + block);
      response = new ResponseProcessor(nodes);
      response.start();
      stage = BlockConstructionStage.DATA_STREAMING;
    }

发送数据包

一切准备就绪之后，从dataQueue头部拿出一个packet，放入ackQueue的尾部，并且唤醒在dataQueue上等待的所有线程，通过 one.writeTo(blockStream);发送数据包。

// send the packet
          Span span = null;
          synchronized (dataQueue) {
            // move packet from dataQueue to ackQueue
            if (!one.isHeartbeatPacket()) {
              span = scope.detach();
              one.setTraceSpan(span);
              dataQueue.removeFirst();
              ackQueue.addLast(one);
              dataQueue.notifyAll();
            }
          }

          if (DFSClient.LOG.isDebugEnabled()) {
            DFSClient.LOG.debug("DataStreamer block " + block +
                " sending packet " + one);
          }

          // write out data to remote datanode
          TraceScope writeScope = Trace.startSpan("writeTo", span);
          try {
            one.writeTo(blockStream);
            blockStream.flush();   
          } catch (IOException e) {
            // HDFS-3398 treat primary DN is down since client is unable to 
            // write to primary DN. If a failed or restarting node has already
            // been recorded by the responder, the following call will have no 
            // effect. Pipeline recovery can handle only one node error at a
            // time. If the primary node fails again during the recovery, it
            // will be taken out then.
            tryMarkPrimaryDatanodeFailed();
            throw e;
          } finally {
            writeScope.close();
          }

关闭数据流

当dataQueue中的所有数据块都发送完毕，并且确保都收到ack消息之后，客户端的写入操作就结束了，调用endBlock方法来关闭相应的流，

// Is this block full?
          if (one.isLastPacketInBlock()) {
            // wait for the close packet has been acked
            synchronized (dataQueue) {
              while (!streamerClosed && !hasError && 
                  ackQueue.size() != 0 && dfsClient.clientRunning) {
                dataQueue.wait(1000);// wait for acks to arrive from datanodes
              }
            }
            if (streamerClosed || hasError || !dfsClient.clientRunning) {
              continue;
            }

            endBlock();
          }

关闭响应，关闭数据流，将管道置空，状态变成PIPELINE_SETUP_CREATE

private void endBlock() {
      if(DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Closing old block " + block);
      }
      this.setName("DataStreamer for file " + src);
      closeResponder();
      closeStream();
      setPipeline(null, null, null);
      stage = BlockConstructionStage.PIPELINE_SETUP_CREATE;
    }

ResponseProcessor处理回复消息

这块逻辑相对比较简单

@Override
      public void run() {

        setName("ResponseProcessor for block " + block);
        PipelineAck ack = new PipelineAck();

        TraceScope scope = NullScope.INSTANCE;
        while (!responderClosed && dfsClient.clientRunning && !isLastPacketInBlock) {
          // process responses from datanodes.
          try {
            //从ack队列里读取packet
            // read an ack from the pipeline
            long begin = Time.monotonicNow();
            ack.readFields(blockReplyStream);
             ..............

                
            //一切都处理成功之后，将其从ack队列中删除
            synchronized (dataQueue) {
              scope = Trace.continueSpan(one.getTraceSpan());
              one.setTraceSpan(null);
              lastAckedSeqno = seqno;
              pipelineRecoveryCount = 0;
              ackQueue.removeFirst();
              dataQueue.notifyAll();

              one.releaseBuffer(byteArrayManager);
            }
          } catch (Exception e) {
          //如果遇到了异常，并没有立即处理，而是放到了一个AtomicReference类型的对象中，
            if (!responderClosed) {
              if (e instanceof IOException) {
                setLastException((IOException)e);
              }
                ............
            }
          } finally {
            scope.close();
          }
        }
      }

Analysis of the whole process of writing data in hdfs of hadoop source code analysis---client processing