Hadoop3.0 Big Data's new features

Overview

Official website

  1. Java version of the minimum requirements increased from Java 7 Java 8
  2. Support erasure coding in HDFS
  3. YARN timeline Services v.2
  4. Shell script rewrite
  5. It supports random container and distributed program
  6. Native MapReduce task-level optimization
  7. It supports more than two NameNode
  8. The default port for multiple services has changed
  9. Support for Microsoft Azure Data Lake Aliyun object storage system and the file system connector
  10. In the data node balancer
  11. Redo the daemon and heap management tasks
  12. S3Guard: S3A client file system consistency and metadata cache
  13. Based on joint HDFS router
  14. Capacity Scheduler API queue configured based on configuration
  15. YARN Resource Type

Introduction

1. The minimum required version of Java increased from Java 7 Java 8

All Hadoop JAR are now running for the Java version 8 when compiled.

2. Support erasure coding in HDFS ( HDFS on Erasure Coding )

Erasure coding technique referred to as EC erasure codes, data protection is a technology for data communications industry in the first data transmission recovery, fault tolerance is a coding technique. He by the addition of new checking data in the raw data, such that associate data of each portion of lower data over a range of error conditions, can be recovered by techniques .EC erasure code technology to prevent data loss, but also can solve the problem of double HDFS storage space.

When you create a file, directory inherit from the closest ancestor EC policy to determine how to store their blocks. Compared with the 3-way replication, the default EC strategy can save 50% of storage space, but also can withstand more storage failure.

EC recommendation for cold storage of data, due to the large number of really cold data, can reduce the copy to reduce storage space, while cold data is stable, which need to recover data, will not have much impact on the business.

The purpose of erasure coded
replication very expensive - HDFS 3x copy the default program 200% overhead on memory space and other resources (e.g., bandwidth). However, for data sets having a relatively low cold I / O activity, rarely accessed during normal operation of other copies of the block, but still consume a first copies of the same amount of resources.

Therefore, the natural improvement is the use of erasure coding (EC) instead of replication, which provides the same level of fault tolerance, and storage space is much less. In a typical erasure coding (EC) is provided, the storage overhead does not exceed 50%. EC file replication factor does not make sense. It is always 1 and can not be changed by -setrep command.

3. YARN timeline Services v.2 ( at The YARN Timeline Service v.2 )

Scalability
V.1 limited to the writer / reader and stored in a single instance, and can not be extended well beyond the small clusters. V.2 writer using a distributed and scalable architecture more scalable backend storage.

YARN axis v.2 service collected data (write) and providing data (read) separately. It uses a distributed collection, each YARN application is essentially a collector. The reader is dedicated to a single instance of the service through the REST API queries.

YARN Timeline Service v.2 Apache HBase selected as the primary backing store, Apache HBase as well be extended to a larger size, while maintaining good read and write response time.

Usability Improvements
In many cases, the level of interest to the user application YARN "flow" level or logical group information. Start a group or series of YARN application to complete the application logic is more common. Timeline services v.2 clearly support the concept of flow. In addition, it supports stream level summary indicators.

4. Shell script rewrite ( Unix Shell Guide )

Hadoop Shell script has been rewritten to fix many bugs longstanding and includes some new features. Although always looking for compatibility, but some changes could undermine the existing installation.

  • Increasing the collision detection parameters, to avoid duplication and redundancy parameter definitions
  • Go CLASSPATH, JAVA_LIBRARY_PATH, and LD_LIBRARY_PATH parameters such as weight, reduce environmental variables
  • Hadoop provides a list of environment variables Shell script now supports a --debug option, it reports on a variety of environmental variables, basic information java option, classpath and other construction to help debugging from
  • Increased distch and jnipath sub-command to command hadoop
  • Operation triggers ssh connection can now use pdsh (if installed). $ {HADOOP _SSH _OPTS} is still being applied
  • A new option called --buildpaths will try to add developers to build directory to the class path to allow testing in the source tree
  • * -Daemon.sh from daemon has moved to the bin command --daemon option. Just use --daemon start a daemon

5. Support random Container ( Opportunistic Containers ) and Distributed

ExecutionType introduced the concept of an application can now request execution Opportunistic type of container. Even if no scheduling resources available, this may be scheduled on the type of container to perform NM. In this case, these containers will be queued at NM, it starts to wait for available resources. Priority chance container vessel lower than the default "guarantee", so if you need to, you can seize the opportunity that "guarantees" container to make room. This will improve cluster utilization.

By default, the opportunity RM container by the central allocation, but also adds support to allow the interceptor implemented as AMRMProtocol distributed scheduler allocate opportunities container.

6. MapReduce task-level native optimization,

MapReduce adds support for map output collector implemented locally .
Native library constructed using -Pnative automatically.
Users can set up:

    mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred. nativetask.NativeMapOutputCollectorDelegator

For shuffle-intensive jobs may be increased by 30% or more efficiency.

7. supports more than two NameNode

The HDFS NameNode initial high availability is a single and a single active NameNode Standby NameNode provided. Editing by copying to a quorum of three JournalNode, the architecture can tolerate failure of any one node in the system. However, some deployments require a higher degree of fault tolerance. This new feature enables this feature, which allows users to run multiple backup NameNode.

For example, by arranging three and five NameNode JournalNode, the cluster can withstand the failure of the two nodes, rather than just one node.

Reference: HDFS high availability documentation

8. multiple services has changed the default port

参考:HDFS should not default to ephemeral ports
The patch updates the HDFS default HTTP/RPC ports to non-ephemeral ports. The changes are listed below:
Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820
Secondary NN ports: 50091 --> 9869, 50090 --> 9868
Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 50075 --> 9864

How about changing into something like below? Just replace 50 with 9

application Hadoop2.x port Hadoop3.x port
Namenode 8020 9820
Namenode Http UI 50070 9870
Namenode Https UI 50470 9871
SNN HTTP 50091 9869
SNN HTTP UI 50090 9868
DN IPC 50020 9867
DN 50010 9866
DN HTTP UI 50075 9864
DN HTTPS UI 50475 9865

9. Support Microsoft Azure Data Lake Aliyun object storage system and the file system connector

The data node 10. The balancer

Single DataNode can manage multiple disks. During normal write operation, the disk will be uniformly filled. However, add or replace disks may cause serious internal DataNode deviation. HDFS conventional balancer can not handle this situation, the offset between the balancer itself relates to DN, DN without involving internal offset.
This case is handled by the new internal DataNode balance function, which by hdfs diskbalancer CLI calls.

Reference: HDFS Commands Guide

11. redo daemon and heap management tasks

Host memory size can be automatically adjusted, HADOOP_HEAPSIZE deprecated.
The required heap size is no longer needed by the task configuration and Java options to achieve. It has been designated the existing configuration is not affected by this change.

Reference:
daemon heap configuration
map and reduce task stack configuration

12. S3Guard: S3A client's file system consistency and metadata cache

S3Guard

13. Based on the joint HDFS router

HDFS joint router adds a route RPC layer, which provides a combined view of the plurality of namespaces based HDFS.

Reference:
the HDFS Router-based

14. Capacity Scheduler API queue configured based on configuration

OrgQueue expansion capacity provided by the scheduler to modify a user may invoke the REST API queue configuration provides a programmatic way to change the configuration.

Reference:
YARN-5734
the Hadoop capacity scheduling

15. YARN Resource Type

YARN resource model has been generalized to support other than the CPU and memory resources to user-defined types may be counted. For example, a cluster administrator can define resources, such as memory GPU, software licenses or local connection. Then YARN task may be scheduled according to the availability of these resources.

Reference:
YARN-3926
YARN resource model documentation

Guess you like

Origin www.cnblogs.com/yin1361866686/p/11642559.html