HDFS cluster rolling upgrade and downgrade rollback

Table of contents

1. HDFS cluster rolling upgrade

1.1 Introduction

1.2 Non-stop rolling upgrade

1.2.1 Non-federated HA cluster

1.2.1.1 Rolling upgrade preparation 

1.2.1.2 Upgrading Active NN and Standbys NN 

1.2.1.3 Upgrade DN 

1.2.1.4 Complete rolling upgrade 

1.2.2 Federated HA Cluster 

1.3 Downtime upgrade 

1.3.1 Non-HA cluster 

2. HDFS cluster downgrade and rollback 

2.1 The difference between downgrade and rollback  

2.2 HA cluster downgrade (downgrade) 

2.2.1 Downgrade DataNodes

2.2.2 Downgrade Active NameNode and Standby NameNode 

2.2.3 Confirmation of downgrade operation 

2.2.4 HA cluster downgrade (downgrade) considerations 

2.3 Cluster rollback (rollback) operation 


 

1. HDFS  cluster rolling upgrade

1.1 Introduction

        In Hadoop v2, HDFS  supports  NameNode  High Availability ( HA ). It makes it feasible to upgrade  HDFS  without downtime . Note that  rolling upgrades are only supported from Hadoop-2.4.0  onwards . So in order to upgrade an HDFS  cluster without downtime  the cluster must be set up with HA  .

        In an HA cluster, there are two or more  NameNodes ( NN ), many  DataNodes ( DN ), some JournalNodes ( JN ) and some  ZooKeeperNodes ( ZKN ). JN is relatively stable, and in most cases,  no upgrade is required when upgrading HDFS .

        During rolling upgrade, only for  NNs  and  DNs , neither JNS  nor  ZKNs  . Upgrading  JN  and  ZKN  may cause cluster downtime.

1.2 Non-stop rolling upgrade

1.2.1 Non-federated HA cluster

Suppose there are two namenodes  NN1  and  NN2 , where  NN1  and  NN2  are in Active  and  StandBy  states respectively  . 

1.2.1.1 Rolling upgrade preparation 

# 创建一个新的 fsimage 文件用于回滚
hdfs dfsadmin -rollingUpgrade prepare

# 不断运行下面命令检查回滚 fsimage 是否创建完毕。
# 如果显示 Proceeding with Rolling Upgrade 表示已经完成。
hdfs dfsadmin -rollingUpgrade query

1.2.1.2  Upgrading  Active NN  and  Standbys NN 

# 关闭 NN2:
hdfs --daemon stop namenode
# 升级启动 NN2:
hdfs --daemon start namenode -rollingUpgrade started

# 做一次 failover 切换,使得 NN2 成为 Active 节点,NN1 变为 Standby 节点。
# 关闭 NN1:  
hdfs --daemon stop namenode
# 升级启动 NN1:
hdfs --daemon start namenode -rollingUpgrade started

1.2.1.3  Upgrade  DN 

# 选择整体中的一小部分 DataNode 节点进行升级(比如按照DataNode所在的不同机架来筛选)。
# 关闭升级所选的 DN 其中 IPC_PORT 由参数 dfs.datanode.ipc.address 指定,默认 9867。
hdfs dfsadmin -shutdownDatanode <DATANODE_HOST:IPC_PORT> upgrade

# 检查下线 DataNode 是否已经停止服务,如果还能得到节点信息,意味着此节点还未真正被关闭。
hdfs dfsadmin -getDatanodeInfo <DATANODE_HOST:IPC_PORT>

# 启动 DN 节点。
hdfs --daemon start datanode

# 对选中的所有 DN 节点执行以上步骤。重复上述步骤,直到升级群集中的所有 DN 节点。

1.2.1.4  Complete rolling upgrade 

# 完成滚动升级
hdfs dfsadmin -rollingUpgrade finalize

1.2.2  Federation  HA  Cluster 

        A federation cluster is a cluster with multiple  namespaces  . Each  namespace  corresponds to a pair of active and standby  NameNode  nodes. The above set of clusters is commonly known as federation + HA cluster .

        The upgrade process of federated clusters is similar to that of non-federated clusters, and there is no essential difference, except that the upgrade operation needs to be repeated several times for different  namespaces  .

#1、在每个 namespace 下执行升级准备
hdfs dfsadmin -rollingUpgrade prepare

#2、升级每个 namespace 下的 Active/Standby 节点
#2.1、关闭 NN2:
hdfs --daemon stop namenode
#2.2、升级启动 NN2:
hdfs --daemon start namenode -rollingUpgrade started
#2.3、做一次 failover 切换,使得 NN2 成为 Active节点,NN1 变为 Standby 节点。
#2.4、关闭 NN1:
hdfs --daemon stop namenode
#2.5、升级启动 NN1:
hdfs --daemon start namenode -rollingUpgrade started

#3、升级每个 DataNode 节点
#3.1、关闭升级所选的 DN 其中 IPC_PORT 由参数 dfs.datanode.ipc.address 指定,默认9867。    
hdfs dfsadmin -shutdownDatanode <DATANODE_HOST:IPC_PORT> upgrade
#3.2、检查下线 DataNode 是否已经停止服务。如果还能得到节点信息,意味着此节点还未真正被关闭
hdfs dfsadmin -getDatanodeInfo <DATANODE_HOST:IPC_PORT>
#3.3、启动 DN 节点。
hdfs --daemon start datanode

#4、升级过程执行完毕,在每个 namespace 下执行 finalize 确认命令
hdfs dfsadmin -rollingUpgrade finalize

1.3  Downtime upgrade 

1.3.1 Non-HA cluster 

        During the upgrade process, there is bound to be a short-term service stop time, because  the NameNode  needs to be restarted, and there is no standby node available during this time. The overall process is similar to the four steps of the non  - federated  HA  mode  . However, the process of step 2 needs to be slightly modified:

#Step1:滚动升级准备

#Step2:升级 NN 和 SNN
#1、关闭 NN    
hdfs --daemon stop namenode  
#2、升级启动 NN    
hdfs --daemon start namenode -rollingUpgrade started
#3、停止 SNN    
hdfs --daemon stop secondarynamenode  
#4、升级启动SNN    
hdfs --daemon start secondarynamenode -rollingUpgrade started

#Step3:升级 DN

#Step4:完成滚动升级
hdfs dfsadmin -rollingUpgrade finalize

2. HDFS  cluster downgrade and rollback 

2.1  The difference between downgrade ( downgrade ) and rollback ( rollback )  

  • common ground :

Will return the version to the version before the upgrade ;

After the finalize  action of the upgrade  is executed , the downgrade and rollback will not be allowed .

  • Differences :

The downgrade can support  the rollling  method , which can be rolled down , and the rollback needs to stop the service for a period of time;

The downgrade process will only restore the software version to the one before the upgrade , and will retain the user's existing data status;

The rollback will restore the user data to the state mode before the upgrade, and the existing data state will not be saved.  

Friendly reminder: be cautious about upgrading, and even more cautious about downgrading and rolling back . 

        In a production environment, scientific research must be conducted before cluster upgrades to evaluate the compatibility of the upgraded version with existing services. The upgrade process is completely simulated in the test environment, and the cluster status before the upgrade is backed up to avoid accidental cluster interruption. Don't expect to save the cluster through operations such as rollback and downgrade when the upgrade fails.

2.2  HA  cluster downgrade ( downgrade ) 

        If the upgraded version is not desired, or in some unlikely cases, the upgrade fails (due to a bug in the newer version), administrators can choose to downgrade HDFS to the pre-upgrade version, or roll back HDFS to the pre-upgrade version Version and pre-upgrade status.

Note that downgrades can be done in a rolling fashion, but not rolled back. Rollback requires cluster downtime.

        Note also that downgrades and rollbacks are only possible after starting a rolling upgrade and before terminating the upgrade . Upgrades can be terminated by completing, downgrading, or rolling back. Therefore, it may not be possible to perform a rollback after completion or downgrade, or to perform a downgrade after completion. 

2.2.1  Downgrade  DataNodes

#1. 选中部分集合 DataNode 节点(可以按照机架进行区分)
# 执行降级操作,其中 IPC_PORT 由参数 dfs.datanode.ipc.address 指定,默认 9867。
hdfs dfsadmin -shutdownDatanode <DATANODE_HOST:IPC_PORT> upgrade  

# 执行命令检查节点是否完全停止    
hdfs dfsadmin -getDatanodeInfo <DATANODE_HOST:IPC_PORT>  

# 在选中集合内的其他 DataNode 节点上重复执行上述操作

2.2.2  Downgrade  Active NameNode  and  Standby NameNode 

# 停止并降级 Standby NameNode.
# 正常启动 Standby NameNode
# 触发 failover 切换,使得主备角色对调
# 停止并降级之前属于 Active(现属于 Standby 的 NameNode)
# 正常启动作为 Standby 节点

2.2.3  Confirmation of downgrade operation 

# 完成降级操作
hdfs dfsadmin -rollingUpgrade finalize

2.2.4  HA  cluster downgrade ( downgrade ) precautions 

        Downgrading and upgrading have one thing in common in the HA mode: when operating  the NameNode  , they start from the Standby  node first  , wait for  the Standby  node to upgrade / downgrade, and do a switch to enable another node to perform the upgrade / downgrade operation. In the whole process, always maintain an  Active  node to provide external services .

The operation sequence of NameNode and DataNode          in the downgrade process  is just opposite to that during upgrade: the new version is generally compatible with the old version in terms of protocol and API  . If NN is downgraded first  , then  DN  will be the new version and NN  will be the old version . Many protocols in  the new version of DN  may not be compatible in the old version of NN  . Therefore, the DN must be downgraded  first , and then the server  NN  must be downgraded . There is actually a deeper reason behind this seemingly simple order reversal .

The downgrade operation of federated clusters and non-  HA  clusters corresponds to the upgrade operation , just replace the corresponding operation commands .

2.3  Cluster rollback ( rollback ) operation 

Notes on rollback: Rollback does not support rolling operations. During the operation, it requires the cluster to stop providing services to the outside world.

The Rollback operation will not only return the software version to the version before the upgrade , but also return the user data to the state before the upgrade.

Rollback steps:

#1.停止所有的 NameNode 和 DataNode 节点
#2.在所有的节点机器上恢复升级前的软件版本
#3.在 NN1 节点上执行 -rollingUpgrade rollback 命令来启动 NN1,将 NN1 作为 Active 节点
#4.在 NN2 上执行 -bootstrapStandby 命令并正常启动 NN2,将 NN2 作为 Standby 节点
#5.以 -rollback 参数启动所有的 DataNode

Previous article: HDFS HA ​​High Availability Cluster Construction Detailed Graphical Tutorial_Stars.Sky's Blog-CSDN Blog 

Guess you like

Origin blog.csdn.net/weixin_46560589/article/details/132674027