How to upgrade Hadoop cluster?

foreword

This article belongs to the column "Big Data Technology System". This column is original by the author. Please indicate the source for the citation. Please point out the deficiencies and mistakes in the comment area, thank you!

For the directory structure and references of this column, please refer to Big Data Technology System


text

insert image description here

Upgrading a Hadoop cluster requires careful planning, especially for HDFS upgrades. If the version of the file system's layout changes, the upgrade operation automatically migrates the file system data and metadata to a format compatible with the new version. Like other processes involving data migration, an upgrade operation carries the risk of data loss, so make sure that both data and metadata are backed up.

The planning process should ideally include a testing process on a small test cluster to assess whether (possible) data loss can be afforded. The testing process makes users more familiar with the upgrade process and how to configure the cluster and toolset, thereby removing technical barriers to upgrade work on the production cluster. Additionally, a test cluster is also useful for testing the client upgrade process.

If the file system layout has not changed, upgrading the cluster is very easy: install the new version of Hadoop on the cluster (and the clients simultaneously), shut down the old daemons, update the configuration files, start the new daemons, and let the clients Use the new library . The whole process is reversible, in other words, it can also be easily restored to an old version.

After a successful version upgrade, there are two more cleanup steps that need to be performed.

  1. Remove old installation and configuration files from the cluster.
  2. Fixes for "deprecation" warning messages in code and config files.

Upgrade capabilities are a hallmark of Hadoop cluster management tools such as Cloudera Manager and Apache Ambari. They simplify the upgrade process and make rolling upgrades easy. Nodes are upgraded in batches (or, for master nodes, one at a time) so that clients experience no interruption of service.

If the aforementioned method is used to upgrade HDFS, and the file system layout of the old and new HDFS happens to be different, the namenode will not work normally, and the following information will be generated in its log file:

File system image contains an old layout version -16.
An upgrade to version -18 is required.
Please restart NameNode with -upgrade option.

The most reliable way to determine whether a filesystem upgrade is necessary is to experiment on a test cluster.

Upgrading HDFS will keep a copy of metadata and data from the previous version, but this does not mean twice the storage overhead, because the datanode uses hard links to keep two applications pointing to the same block (respectively the current version and the previous version ), allowing you to easily roll back to a previous version if needed. It should be emphasized that after the system rolls back to the old version, the original upgrade changes will be cancelled.

Users can keep the previous version of the file system, but cannot roll back multiple versions. In order to perform another upgrade task on HDFS data and metadata, the previous version needs to be removed, this process is called "finalizing the upgrade". Once this is done, there is no way to roll back to a previous version.

In general, the upgrade process can ignore intermediate versions. However, in some cases it is still necessary to upgrade to an intermediate version first, and this situation will be clearly indicated in the release notes document.

The file system can only be upgraded if it is healthy, so it is necessary to call fsckthe tool . Also, it is best to keep fsckthe output report of , which lists all file and block information; after upgrading, run again to create a fscknew output report and compare the contents of the two reports.

Before the upgrade, it is best to clear the temporary files, including the HDFS MapReduce system directory and local temporary files.

To sum up, if upgrading the cluster will result in changes in the layout of the file system, you need to use the following steps to upgrade .

insert image description here

  1. Before performing an upgrade task, make sure that the previous upgrade has been finalized.
  2. Shut down YARN and MapReduce daemons.
  3. Close HDFS, and back up the namenode directory.
  4. Install a new version of Hadoop on the cluster and clients.
  5. Start HDFS with -upgrade option
  6. Wait until the upgrade is complete.
  7. Verify that HDFS is functioning properly.
  8. Start the YARN and MapReduce daemons.
  9. Rollback or finalize the upgrade task (optional).

When running the upgrade task, it's a good idea to remove the Hadoop scripts from the PATH environment variable so that users don't confuse scripts for different versions. Usually two environment variables can be defined for the new installation directory. In subsequent instructions, we define two environment variables, OLD_HADOOP_HOME and NEW_HADOOP_HOME.

Start the upgrade. To perform the upgrade, run the following command:

% $NEW_HADOOP_HOME/bin/start-dfs.sh -upgrade 

The result of this command is to have the namenode update the metadata, placing the previous version in a
new directory named previous under dfs.namenode.name.dir. Similarly, datanode upgrades the storage directory, keeping the original copy and storing it in the previous directory.

Wait until the upgrade is complete. The upgrade process does not happen overnight. You can use dfsadmin to view the upgrade progress. The upgrade event also appears in the log file of the daemon process:

% $NEW_HADOOP_HOME/bin/hdfs dfsadmin -upgradeProgress status 
Upgrade for version -18 has been completed.
Upgrade is not finalized.

It shows that the upgrade is complete. In this phase, the user can check the status of the file system, such as fsckverify files and blocks using (a basic file operation). When checking the state of the system, it is best to put HDFS into safe mode (all data read-only) to prevent other users from modifying the data. If the new version doesn't work correctly, you can roll back to the previous version, provided the update hasn't been scheduled yet.

First, shut down the new daemon:

% $NEW_HADOOP_HOME/bin/stop-dfs.sh

Second, start the older version of HDFS with the -rollback option:

% $OLD_HADOOP_HOME/bin/start-dfs.sh -rollback 

This command will cause namenode and datanode to replace the current storage directory with the pre-upgrade copy. The file system returns to its previous state.

Scheduled upgrade (optional), if the user is satisfied with the new version of HDFS, scheduled upgrade can be done to remove the storage directory before the upgrade.

This step must be performed before performing a new upgrade task:

% $NEW_HADOOP_HOME/bin/hdfs dfsadmin -finalizeUpgrade 
% $NEW_HADOOP_HOME/bin/hdfs dfsadmin -upgradeProgress status 
There are no upgrades in progress.

Now, HDFS has been fully upgraded to the new version


mind Mapping

insert image description here


Summarize

This article describes the steps and precautions for upgrading Hadoop clusters, especially HDFS upgrades.

  1. Upgrading Hadoop clusters requires planning, especially when upgrading HDFS. The upgrade operation may introduce a risk of data loss, so be sure to back up your data and metadata before starting the upgrade.
  2. It is recommended to test the upgrade on a small test cluster to assess possible data loss risks. The testing process can help users become familiar with the upgrade process and configuration, remove technical obstacles, and test the upgrade process of clients.
  3. Upgrading a cluster is relatively easy if the filesystem layout has not changed. Steps include installing a new version of Hadoop, shutting down old daemons, upgrading configuration files, starting new daemons, and having clients use the new libraries.
  4. After a successful upgrade, cleanup steps need to be performed, including removing old installation and configuration files, and fixing "deprecated" warning messages.
  5. Hadoop cluster management tools (Cloudera Manager and Apache Ambari) simplify the upgrade process, making rolling upgrades possible. They support batch upgrades of nodes to reduce service interruption to clients.
  6. If the HDFS file system layout changes, additional upgrade steps are required. Before upgrading, you need to ensure that the previous upgrade has been completed, shut down related daemons, back up the namenode directory, install a new version of Hadoop, and use the -upgrade option to start HDFS, etc.
  7. Before upgrading, it is recommended to use the fsck tool to check the status of the file system and to keep and compare the output reports. Also, it is good practice to empty temporary files.
  8. Proper upgrade means that after you are satisfied with the new version of HDFS, remove the storage directory before the upgrade. This step is required before performing a new upgrade task.

To sum up, this article provides detailed upgrade steps and precautions to help users upgrade Hadoop clusters smoothly and protect data security.

Guess you like

Origin blog.csdn.net/Shockang/article/details/131354656
Recommended