Apache Hadoop 3.x latest status and upgrade guide

Apache Hadoop 3.x latest status and upgrade guide

Past large data memory historical memory big data
from The Strata Data Conference 2019 held in New York September 23 to 26, sharer is Wangda Tan and Wei-Chiu Chuang from Cloudera, the conference page https: //conferences.oreilly .com/strata/strata-ny-2019/public/schedule/detail/77506 .
Please pay attention to the WeChat official account of the past memory big data, and reply the hadoop_3 keyword in the background to get the PPT download address of this article.
Apache Hadoop 3.x latest status and upgrade guide
First of all, let's take a look at the latest situation of the Hadoop community.
Apache Hadoop 3.x latest status and upgrade guide
In the past 2019, there are many outsiders saying that Hadoop is dead. Is this a fact or a lie?
Apache Hadoop 3.x latest status and upgrade guide
Let's use the data to illustrate. The above are the top ten projects with the most ISSUE solved by the Apache Foundation as of August 2019. As we can see from the above figure, the Hadoop project ranks first.
Apache Hadoop 3.x latest status and upgrade guide
The above figure shows the activity of various sub-projects within Hadoop. It can be seen that the Hadoop project is still very active.
Apache Hadoop 3.x latest status and upgrade guide
The above is the number of ISSUE solved by the Hadoop project in the past 13 years. It can be seen that after so many years, the ISSUE solved by the Hadoop community has not decreased.
Apache Hadoop 3.x latest status and upgrade guide
If we look at the trend of Hadoop Contributors again, we can see that the number of Contributors has not decreased in the past 5 years.
So, the Hadoop project itself is not dead, on the contrary, the Hadoop project is still very active. We can also take a look at the article about rebutting Hadoop's dead remarks before remembering big data: Is Hadoop exhausted?

Apache Hadoop 3.x latest status and upgrade guide
Well, after talking for so long, let's take a brief look at Hadoop 3.x first.
Apache Hadoop 3.x latest status and upgrade guide
The current Hadoop can already well support the long-running service of big data.
Apache Hadoop 3.x latest status and upgrade guide

Hadoop 3.x now pays more attention to the topics of scalability, containerization, cost, cloud native, and machine learning.
Apache Hadoop 3.x latest status and upgrade guide
Let's take a look at the YARN module:
Apache Hadoop 3.x latest status and upgrade guide
YARN is now paying more attention to containerization. Apache Hadoop 3.1.0 already supports production-level Docker container support, and Hadoop 3.3.0 supports Docker shell interactive;
Apache Hadoop 3.x latest status and upgrade guide

YARN is also currently supporting cloud-native environments, mainly including automatic expansion, intelligent scheduling, and node reduction.
Apache Hadoop 3.x latest status and upgrade guide
Hadoop 3.0.0 also enhances the scheduling capabilities of YARN.

Apache Hadoop 3.x latest status and upgrade guide

Some other enhancements include, for example, version 3.2.0 supports node attributes, so we can schedule Containers based on attributes. 3.1.0 Placement Constraint, dynamic creation of queues, etc. .
Apache Hadoop 3.x latest status and upgrade guide

The second module is Submarine.
Apache Hadoop 3.x latest status and upgrade guide
Many people may not know Submarine. Submarine is a brand new module of Hadoop. It was developed in 2018 and it is a machine learning solution. Make full use of Hadoop's GPU/Docker features to run data scientists to run deep learning applications on Hadoop. For more information about Submarine, you can refer to the article on {Submarine} Running Deep Learning Framework in Apache Hadoop Long before Memorizing Big Data.
Apache Hadoop 3.x latest status and upgrade guide
In version 0.3.0, Submarine brings many new features, as described above.
Apache Hadoop 3.x latest status and upgrade guide
Companies that use Hadoop Submarine in production environments include NetEase, LinkedIn, and Shell.com.
Apache Hadoop 3.x latest status and upgrade guide
Okay, now let's take a look at the third largest module of Hadoop, storage.
Apache Hadoop 3.x latest status and upgrade guide
Reading data from Standby NN can now achieve consistency, and we can improve the performance of the overall system by reading Standby NN. If a client has already obtained a certain transaction ID, and Standby NN has synchronized the data of this transaction ID, then Standby NN allows the client to read the data of this transaction ID. This feature is used in the generation environment of Uber and LinkedIn.
Apache Hadoop 3.x latest status and upgrade guide
The routing-based Federation has also been greatly improved, such as supporting security requirements, scalability, and handling of slow sub-clusters. Regarding routing-based federation, you can read the article about Apache Hadoop's HDFS federation in the past and present before remembering big data.
Apache Hadoop 3.x latest status and upgrade guide
Some other more important features of HDFS
Apache Hadoop 3.x latest status and upgrade guide
are related to the update of the cloud connector.
Apache Hadoop 3.x latest status and upgrade guide
ABFS is the abbreviation of Azure Blob FileSystem, which is Azure's object storage solution for the cloud. Azure Data Lake Storage Gen2 is a set of functions dedicated to big data analysis, built on the basis of Azure Blob storage. Hadoop 3.2.0 currently supports connecting to Azure Data Lake Storage Gen2. Version 3.2.1 is more stable.
Apache Hadoop 3.x latest status and upgrade guide
The fourth module of Hadoop is Hadoop common.
Apache Hadoop 3.x latest status and upgrade guide
What this module is doing includes: TLS supports RPC, supports JDK 11, and so on.
Apache Hadoop 3.x latest status and upgrade guide
Finally, let's take a look at the Ozone module of Hadoop. Ozone
Apache Hadoop 3.x latest status and upgrade guide
is a scalable distributed object storage system designed specifically for Hadoop. Mainly contributed by Hortonworks/Cloudera/Tencent, significant progress has been made in the past year.
Apache Hadoop 3.x latest status and upgrade guide
Apache Hadoop 3.x latest status and upgrade guide
Apache Hadoop 3.x latest status and upgrade guide
Hadoop version release plan.

Apache Hadoop 3.x latest status and upgrade guide
Submarine version release plan
Apache Hadoop 3.x latest status and upgrade guide

Hadoop 3.0.x, Hadoop 2.7.x and below versions are no longer supported.
Apache Hadoop 3.x latest status and upgrade guide
Apache Hadoop 3.x latest status and upgrade guide
There are two ways to upgrade Hadoop: Express and Rolling. The Express upgrade process is to stop the existing service and then start the service with the new version; the Rolling upgrade process is a rolling upgrade, with continuous service and no user perception.
Apache Hadoop 3.x latest status and upgrade guide
There are some challenges and problems in the rolling upgrade of large versions. At present, the community has done a lot of work to upgrade the cluster without stopping the machine. This work will soon be released together with the latest Hadoop version. Hadoop 2 to 3 is currently recommended to use Express upgrade. However, Didi released the practice of upgrading its Hadoop cluster from 2 to 3 without stopping the machine some time ago. Please refer to this article about the non-stop service upgrade of Hadoop 2.7 which remembers big data to 3.2 in Didi.
Apache Hadoop 3.x latest status and upgrade guide
In terms of compatibility, the compatibility of the Hadoop 2 client is still maintained, and the compatibility of Distcp/WebHDFS is also supported.

APIs and tools marked as abandoned have been removed, and shell scripts have been extensively rewritten.

Apache Hadoop 3.x latest status and upgrade guide
It is recommended that you use Apache Hadoop 2.8.x to upgrade to Apache Hadoop 3.1.x. Because most companies have deployed Apache Hadoop 2.8.x in their production environments. If your Hadoop version is 2.6.x or 2.7.x, it is recommended to do more verification before upgrading. However, we have also seen someone directly upgrade Hadoop 2.7.x to 3.x.
Apache Hadoop 3.x latest status and upgrade guide
Other related documents for the upgrade can be found at https://dataworkssummit.com/san-jose-2018/session/ease-of-migration-from-hadoop-2-to-hadoop-3-clusters/ .
Apache Hadoop 3.x latest status and upgrade guide

Many companies have already upgraded to Hadoop 3.x.
Apache Hadoop 3.x latest status and upgrade guide
About the Hadoop 3.x upgrade: Hadoop 3 has many new features and optimizations (for details, please refer to the official release of Apache Hadoop 3.0.0 GA version before memory big data, which can be deployed online, and Apache Hadoop 3.1.0 is officially released. Native support for GPU and FPGA article). Many companies use Hadoop 3 in production environments; it is recommended to use Express to upgrade; if you haven't upgraded yet, then hurry up and upgrade.

Guess you like

Origin blog.51cto.com/15127589/2677107