[Big Data Hadoop] ZKFC (DFSZKFailoverController) high-availability active-standby switching mechanism in HDFS-HA mode

overview

When a NameNode is successfully switched to the Active state, it will create a temporary znode inside ZK, and some information of the current Active NameNode will be retained in the znode, such as the host name and so on. When the Active NameNode fails or the connection times out, the monitoring program will delete the corresponding temporary znode on the ZK, and the deletion event of the znode will actively trigger the next Active NamNode selection.

Because ZK is highly consistent, it can guarantee that at most one node can successfully create a znode and become the current Active Name. This is why the community uses ZK to do automatic switching of HDFS HA.

Component principle

Inside the process of ZKFC, there are 3 object services running:

  • ZKFailoverController: Coordinate HealMonitor and ActiveStandbyElector objects, process event change events sent by them, and complete the process of automatic switching
  • HealthMonitor: monitor the service status of local-NameNode
  • ActiveStandbyElector: manage and monitor the status of nodes on ZK

The results of the above three operations are shown in Figure 1-1.
insert image description here
insert image description here

The startup log shows clues

In zkfc, first start the rpc service, then print Entering state SERVICE_HEALTHY, write zk data /hadoop-ha-cdp-cluster/cdp-cluster/ActiveBreadCrumb, and finally set the status of the current node namenode to active

zkfc log

2023-03-22 08:45:47,258 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server spark-31/10.253.128.31:2181. Will not attempt to authenticate using SASL (unknown error)
2023-03-22 08:45:47,266 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.253.128.31:50782, server: spark-31/10.253.128.31:2181
2023-03-22 08:45:47,285 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server spark-31/10.253.128.31:2181, sessionid = 0x1000e3e5f9a0005, negotiated timeout = 10000
2023-03-22 08:45:47,289 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2023-03-22 08:45:47,299 INFO org.apache.hadoop.ha.ZKFailoverController: ZKFC RpcServer binding to /10.253.128.31:8019
2023-03-22 08:45:47,331 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue, queueCapacity: 300, scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler, ipcBackoff: false.
2023-03-22 08:45:47,378 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8019
2023-03-22 08:45:47,484 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2023-03-22 08:45:47,484 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8019: starting
2023-03-22 08:45:47,724 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY
2023-03-22 08:45:47,725 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at /10.253.128.31:8020 entered state: SERVICE_HEALTHY
2023-03-22 08:45:47,745 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2023-03-22 08:45:47,756 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence
2023-03-22 08:45:47,756 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha-cdp-cluster/cdp-cluster/ActiveBreadCrumb to indicate that the local node is the most recent active...
2023-03-22 08:45:47,763 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at spark-31/10.253.128.31:8020 active...
2023-03-22 08:45:49,271 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at spark-31/10.253.128.31:8020 to active state

After the Namenode is started, it will enter the standby state by default. When zkfc detects that the current namenode is started, it will send a detection monitorHealth, and then zkfc will conduct an election. If the election is active, it will send rpc to set the state of the namenode node to active. Then the Namenode will stop some standby nodes
to stopStandbyServicesexecute Thread services, such as standbyCheckpointerand editLogTailer. Finally, execute startActiveServicessome thread detection services that start the master node service.

log of namenode

monitorHealth from 10.253.128.31:45708: org.apache.hadoop.ha.HealthCheckFailedException: The NameNode is configured to report UNHEALTHY to ZKFC in Safemode.
2023-03-26 10:33:21,493 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is OFF
2023-03-26 10:33:21,493 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 33 secs
2023-03-26 10:33:21,493 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 1 racks and 3 datanodes
2023-03-26 10:33:21,493 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
2023-03-26 10:33:21,749 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2023-03-26 10:33:21,751 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted: sleep interrupted
2023-03-26 10:33:21,755 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2023-03-26 10:33:21,766 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2023-03-26 10:33:21,803 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 8
2023-03-26 10:33:21,804 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning recovery of unclosed segment starting at txid 102544
2023-03-26 10:33:21,850 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare phase complete. Responses:
10.253.128.33:8485: segmentState {
    
     startTxId: 102544 endTxId: 102544 isInProgress: true } lastWriterEpoch: 7 lastCommittedTxId: 102543
10.253.128.31:8485: segmentState {
    
     startTxId: 102544 endTxId: 102544 isInProgress: true } lastWriterEpoch: 7 lastCommittedTxId: 102543
10.253.128.32:8485: segmentState {
    
     startTxId: 102544 endTxId: 102544 isInProgress: true } lastWriterEpoch: 7 lastCommittedTxId: 102543

ZKFailoverController

This process does the following things:

  • Initialize the connection information of zk and create ActiveStandbyElector
  • Create rpc service to connect to local namenode and detect service status
  • Create healthMonitor and start the thread.
    First look at the member variables of the object. Through these variables, you can also know what to do in it.
public abstract class ZKFailoverController {
    
    
  // 省略...

  // zk连接串
  private String zkQuorum;
  // 本地的namenode对象
  protected final HAServiceTarget localTarget;
  // 健康检测对象
  private HealthMonitor healthMonitor;
  // 选举对象
  private ActiveStandbyElector elector;
  // rpc对象
  protected ZKFCRpcServer rpcServer;
  // 默认状态
  private State lastHealthState = State.INITIALIZING;

  private volatile HAServiceState serviceState = HAServiceState.INITIALIZING;

Core thread work content


private int doRun(String[] args)
      throws Exception {
    
    
    try {
    
    
      initZK();
    } catch (KeeperException ke) {
    
    
      LOG.error("Unable to start failover controller. Unable to connect "
          + "to ZooKeeper quorum at " + zkQuorum + ". Please check the "
          + "configured value for " + ZK_QUORUM_KEY + " and ensure that "
          + "ZooKeeper is running.", ke);
      return ERR_CODE_NO_ZK;
    }
    // 省略参数解析过程...
    try {
    
    
      // 创建 rpc 服务,连接 local namenode rpc
      initRPC();
      // 创建 healthMonitor
      initHM();
      // 启动 rpc
      startRPC();
      // 挂住主进程。
      mainLoop();
    } catch (Exception e) {
    
    
      LOG.error("The failover controller encounters runtime error: ", e);
      throw e;
    } finally {
    
    
      rpcServer.stopAndJoin();
      
      elector.quitElection(true);
      healthMonitor.shutdown();
      healthMonitor.join();
    }
    return 0;
  }

HealthMonitor

HealthMonitor is a service monitoring object, which is used to monitor the status of the namenode node on the current node. 5 types of state are maintained internally

public enum State {
    
    
    /**
     * The health monitor is still starting up.
     */
     // HealMonitor初始化启动状态。
    INITIALIZING,

    /**
     * The service is not responding to health check RPCs.
     */
     // 健康检查无响应状态。
    SERVICE_NOT_RESPONDING,

    /**
     * The service is connected and healthy.
     */
    // 服务检测健康状态。
    SERVICE_HEALTHY,
    
    /**
     * The service is running but unhealthy.
     */
    // 服务检查不健康状态。
    SERVICE_UNHEALTHY,
    
    /**
     * The health monitor itself failed unrecoverably and can
     * no longer provide accurate information.
     */
    // 监控服务本身失败不可用状态。
    HEALTH_MONITOR_FAILED;
  }

Take a look at HealMonitor initialization process

private void initHM() {
    
    
  // HealthMonitor对象的初始化
  healthMonitor = new HealthMonitor(conf, localTarget);
  // 加入回调操作对象,以此不同的状态变化可以触发这些回调的执行
  healthMonitor.addCallback(new HealthCallbacks());
  healthMonitor.addServiceStateCallback(new ServiceStateCallBacks());
  healthMonitor.start();
}

The logic of the HealMonitor object to detect the health status of the NameNode is actually very simple: send an RPC request to see if there is a response. The relevant code is as follows:

public void run() {
    
    
  // 循环检测
  while (shouldRun) {
    
    
    try {
    
    
      // 尝试连接 namenode,直到连接上退出loop
      loopUntilConnected();
      // 做监控检测
      doHealthChecks();
    } catch (InterruptedException ie) {
    
    
      Preconditions.checkState(!shouldRun,
          "Interrupted but still supposed to run");
    }
  }
}

continue intodoHealthChecks

private void doHealthChecks() throws InterruptedException {
    
    
  while (shouldRun) {
    
    
    HAServiceStatus status = null;
    boolean healthy = false;
    try {
    
    
      // rpc调用namenode获取服务状态
      status = proxy.getServiceStatus();
      // 监控健康
      proxy.monitorHealth();
      // 没有异常,就将healthy置为true
      healthy = true;
    } catch (Throwable t) {
    
    
      if (isHealthCheckFailedException(t)) {
    
    
        LOG.warn("Service health check failed for {}", targetToMonitor, t);
        // 如果出现异常情况 进入服务不健康状态
        enterState(State.SERVICE_UNHEALTHY);
      } else {
    
    
        LOG.warn("Transport-level exception trying to monitor health of {}",
            targetToMonitor, t);
        RPC.stopProxy(proxy);
        proxy = null;
        // 进入服务无响应状态
        enterState(State.SERVICE_NOT_RESPONDING);
        Thread.sleep(sleepAfterDisconnectMillis);
        return;
      }
    }
    // 服务状态
    if (status != null) {
    
    
      setLastServiceStatus(status);
    }
    // 服务健康状态
    if (healthy) {
    
    
      enterState(State.SERVICE_HEALTHY);
    }
	// 进行检测间隔时间的睡眠
    Thread.sleep(checkIntervalMillis);
  }
}

After detecting different states, the enterState method will be called, and the callback event of the corresponding state will be triggered inside this method. These events are handled in the ZKFailoverController class.

Look at the enterState method as follows:

class HealthCallbacks implements HealthMonitor.Callback {
    
    
  @Override
  public void enteredState(HealthMonitor.State newState) {
    
    
    // 设置最近状态
    setLastHealthState(newState);
    recheckElectability();
  }
}

Then look at recheckElectabilitythe method

private void recheckElectability() {
    
    
  // Maintain lock ordering of elector -> ZKFC
  synchronized (elector) {
    
    
    synchronized (this) {
    
    
      boolean healthy = lastHealthState == State.SERVICE_HEALTHY;
      // 省略 部分代码...
      
      switch (lastHealthState) {
    
    
      // 如果当前状态为健康,则加入此轮选举
      case SERVICE_HEALTHY:
        if(serviceState != HAServiceState.OBSERVER) {
    
    
          elector.joinElection(targetToData(localTarget));
        }
        if (quitElectionOnBadState) {
    
    
          quitElectionOnBadState = false;
        }
        break;
        
      case INITIALIZING:
        LOG.info("Ensuring that " + localTarget + " does not " +
            "participate in active master election");
        // 如果当前处于初始化状态,则暂时不加入选举
        elector.quitElection(false);
        serviceState = HAServiceState.INITIALIZING;
        break;
  
      case SERVICE_UNHEALTHY:
      case SERVICE_NOT_RESPONDING:
        LOG.info("Quitting master election for " + localTarget +
            " and marking that fencing is necessary");
        // 如果当前状态为不健康或无响应状态,则退出选择
        elector.quitElection(true);
        serviceState = HAServiceState.INITIALIZING;
        break;
        
      case HEALTH_MONITOR_FAILED:
        fatalError("Health monitor failed!");
        break;
        
      default:
        throw new IllegalArgumentException("Unhandled state:"
                                             + lastHealthState);
      }
    }
  }
}

ActiveStandbyElector

The ActiveStandbyElector object is mainly responsible for the interaction with Zookeeper. For example, if a node is successfully switched to Active Name, the ActiveStandbyElector object will create a node on ZK. At the end of this class, there are two key methods related to Active Name election: joinElection() and quitElection() methods.

The joinElection method is called to indicate that the local NameNode is ready to participate in the election of the Active NameNode and is a candidate node. The quitElection method is called to indicate that the local node quits this election.

These two methods will be called at the end of HDFS HA ​​automatic switching. Obviously the quitElection method will be called in the node where the original Active NameNode is located.
In the method of joinElection to participate in the election, the method of creating a temporary znode on ZK will be executed. The code is as follows:

  private void joinElectionInternal() {
    
    
    Preconditions.checkState(appData != null,
        "trying to join election without any app data");
    if (zkClient == null) {
    
    
      if (!reEstablishSession()) {
    
    
        fatalError("Failed to reEstablish connection with ZooKeeper");
        return;
      }
    }

    createRetryCount = 0;
    wantToBeInElection = true;
    createLockNodeAsync();
  }
private void createLockNodeAsync() {
    
    
  zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL,
      this, zkClient);
}

The quitElection method will delete the zk node that has been created, as follows:

public synchronized void quitElection(boolean needFence) {
    
    
  LOG.info("Yielding from election");
  if (!needFence && state == State.ACTIVE) {
    
    
    // If active is gracefully going back to standby mode, remove
    // our permanent znode so no one fences us.
    tryDeleteOwnBreadCrumbNode();
  }
  reset();
  wantToBeInElection = false;
}

private void tryDeleteOwnBreadCrumbNode() {
    
    
  assert state == State.ACTIVE;
  LOG.info("Deleting bread-crumb of active node...");
  
  // Sanity check the data. This shouldn't be strictly necessary,
  // but better to play it safe.
  Stat stat = new Stat();
  byte[] data = null;
  try {
    
    
    data = zkClient.getData(zkBreadCrumbPath, false, stat);

    if (!Arrays.equals(data, appData)) {
    
    
      throw new IllegalStateException(
          "We thought we were active, but in fact " +
          "the active znode had the wrong data: " +
          StringUtils.byteToHexString(data) + " (stat=" + stat + ")");
    }
    
    deleteWithRetries(zkBreadCrumbPath, stat.getVersion());
  } catch (Exception e) {
    
    
    LOG.warn("Unable to delete our own bread-crumb of being active at {}." +
        ". Expecting to be fenced by the next active.", zkBreadCrumbPath, e);
  }
}

I hope it will be helpful to you who are viewing the article, remember to pay attention, comment, and favorite, thank you

Guess you like

Origin blog.csdn.net/u013412066/article/details/129777627