Some things to consider before Hadoop upgrade

Preface


Now that the Hadoop version has been released to the Hadoop 3.x version, the speed of iterative updates in the community is still very fast. Compared with the old version, the new version can bring us many benefits, such as new features, many bug fixes and some major performance improvements. Maintaining and running an outdated system all the time means that many bugs that have been fixed by the community may be encountered later, which will consume a considerable amount of daily work time of the system administrator. Especially when we maintain a large-scale distributed system. Therefore, for a complex system, the most cost-effective way to optimize the system is to upgrade it and upgrade it to a new stable version. The author of this article will talk about some things that need to be considered and done before Hadoop upgrade.

Selection of the target version of Hadoop upgrade


First of all, before we upgrade Hadoop, the first question is very clear: which Hadoop version we want to upgrade to. The Hadoop community releases many different Hadoop versions every year. Which version should we choose as the upgraded version?

There are generally the following considerations:

  • 1) The target version has an important function we expect, such as HDFS Standby NN Read, or HDFS EC function.
  • 2) The target version is a relatively new version that has been released for a period of time. It can be confirmed that it is a stable operation and relatively new version.

The above 1) is a very targeted choice, but the author personally prefers the second approach above. It is a safer choice to choose a relatively new version as the target upgrade version. After all, no one wants to upgrade to the latest community version, and then unfortunately encounter a problem that no one else has encountered. At this time, the cost of fixing the problem is very high.

Preparatory work required for the Hadoop upgrade process


After we clarify the target version we want, then we can start the pre-upgrade preparations. The more detailed this preparation is done, the smaller the probability of problems in the upgrade process will be.

Some students may think that this preparation is not about replacing the old version with the new version when testing, and then restarting the service to check whether the system is normal, isn't it OK? For a simple service system, it may be feasible to adopt the few simple steps mentioned here. But for a complex distributed system such as Hadoop, it is obvious that the test process mentioned above is not perfect.

Here the author briefly lists the following things to consider and do before Hadoop upgrade (also applicable to other complex systems):

The first point is the compatibility test between the old and new versions. Compatibility is undoubtedly the first test we have to do. For complex systems, the upgrade process must be gradually upgraded in stages. For example, first upgrade the master service, then upgrade the slave service, and then upgrade and replace the client package if necessary on the client side. In this process, there will be a mixed deployment of the old and new versions of the master service, slave service, and client. There will be many combinations of this, so when testing, we need to be able to simulate all possible mixed scenes, and then test. If you do encounter a compatibility problem, you can solve the corresponding problem. For example, you may need to modify the additional configuration or clean up the data status. The compatibility problem of the upgrade often exists in the non-stop service upgrade mode. If we use the service-stop upgrade method, the probability of problems in this situation will be much smaller.

Let me talk about the compatibility test of the system. The compatibility of different types of systems is slightly different. For stateless distributed systems, such as distributed computing systems, their compatibility can reflect aspects such as RPC communication and API interface calls. For distributed storage systems with state data, a major feature of their compatibility is reflected in data compatibility. Whether the new system can load the status data of the old system, or whether the old system can load the status data generated by the new system during the upgrade rollback process.

The second point is the configuration between the old and new versions, the check and synchronization of code logic. Sometimes we will do some internal version-specific optimizations in the old version, but this optimization is not available in the official version released by the community. At this time, after getting the new version, we need to compare the code and optimize the internal backport in the new version. In addition, in the new version, there will be some configuration updates, such as the configuration of some new functions, and the modification of some configuration names (generally, the new configuration name is compatible with the old name).

The third point is the gray-scale test process. In addition to the upgrade test of the new version in the test environment, before the official upgrade, it is best to perform a certain grayscale test in the generation environment in advance. First, the grayscale test is carried out in a real production environment, and it can find hidden problems that are covered up in the test environment. Secondly, the grayscale test can also be a rehearsal process before the formal upgrade operation.

The fourth point is user communication and coordination before the upgrade. User communication is no longer within the scope of the system upgrade itself, but this matter is still very important. Reducing the impact of upgrades on users itself is our most important point. During this communication process, users should clearly know what is the background of this upgrade, what is the impact on them, and what they need to do to cooperate with this upgrade. At the same time, a related document can be provided to the user at the end, and why all the several mentioned here are written on the document. One of the best situations is that a new user can understand all the things he needs to do after seeing the document.

The fifth point is the phased formal upgrade process. After the previous pre-upgrade tests have been completed, it comes to the final upgrade operation. How should we upgrade when we finally upgrade? The author's personal suggestion is to split the entire upgrade process in a fine-grained manner, escalating little by little, and observe one to two days for each step of the upgrade. The advantage of this method is that 1) There are fewer change operations for each upgrade, and there will be no operational errors. 2) If there is a problem during the upgrade, it is easier to roll back. So how can we upgrade in stages? A simple example, for example, if we have a system to be upgraded, we can first upgrade the master service and observe it for several days, then upgrade a rack slave service, and observe it for several days. If there is no problem during the period, continue to upgrade the follow-up service.

In addition, there is the assignment of personnel roles for upgrade operations. The upgrade process is a big system change, which involves many services, so we need more people to join together, need to coordinate with each other, and cooperate to complete the entire upgrade process.

The above is the whole content of this article, mainly the points that the author thought of when upgrading the Hadoop version at work. This part of the content is also applicable to other complex distributed systems like Hadoop.

Guess you like

Origin blog.csdn.net/Androidlushangderen/article/details/114179797