Hadoop overview

Table of contents

1. What is Hadoop

1.1. Hadoop in a narrow sense:

1.2. Hadoop in a broad sense:

 1.3. Hadoop core components:

2. Features and advantages of Hadoop

3. Changes in Hadoop architecture

 4. Hadoop cluster

 HDFS cluster and YARN cluster are logically separated and physically together

1. What is Hadoop

1.1. Hadoop in a narrow sense:

狭义上Hadoop指的是Apache软件基金会的一款开源软件

​		用java语言实现,开源

​		允许用户使用简单的编程模型实现跨机器集群对海量数据进行分布式计算处理

1.2. Hadoop in a broad sense:

Hadoop in a broad sense refers to the big data ecosystem built around Hadoop


        As a distributed file storage system, HDFS is at the bottom and core of the ecosystem

​ As a distributed and general-purpose cluster resource management system and task scheduling platform, YARN supports the operation of various computing engines and ensures the status of Hadoop

​ MapReduce is the first-generation distributed computing engine in the big data ecosystem. Due to the disadvantages of the model designed by itself, the front-line enterprises almost no longer use MapReduce directly for programming and computing, but the bottom layer of many software is still using the MapReduce engine. to process data 

 1.3. Hadoop core components:

        HDFS (Distributed File Storage System): Solve Massive Data Storage

​ YARN (cluster resource management and task scheduling framework): solve resource task scheduling

​ MapReduce (distributed computing framework): solving massive data computing

###################################################### 

2. Features and advantages of Hadoop

        Strong capacity expansion

Hadoop distributes data and completes computing tasks among available computer clusters that
can scale to thousands of nodes in a convenient and flexible manner.

​ low cost

​ Hadp allows to process big data by deploying ordinary cheap machines to form a cluster, so that the cost is very low, focusing on the overall capability of the cluster

​ High efficiency

​ Through concurrent data, Hadoop can dynamically move data between nodes in parallel, making the speed very fast

​ Reliability

It can automatically maintain multiple copies of data, and can automatically redeploy computing tasks after task failures,
so people trust Hadoop's bit-by-bit storage and data processing capabilities

######################################################  

3. Changes in Hadoop architecture

        Hadoop1.0:

​ HDFS (distributed file storage)
​ MapReduce (resource management and distributed data processing)

​        Hadoop2.0:

​ HDFS (distributed file storage)
​ MapReduce (resource management and distributed data processing)
​ YARN (cluster resource management, task scheduling)

​        Hadoop3.0:

General aspects:
Streamlined kernel, classpath isolation, shell script refactoring

​ Hadoop HDFS:
​ EC erasure code, multi-nameNode support

​ Hadoop MapReduce:
​ Task localization optimization, automatic inference of memory parameters

Hadoop YARN:
Timeline Service V2, queue configuration

######################################################  

 4. Hadoop cluster

Hadoop cluster includes two clusters: HDFS cluster and YARN cluster

        The two clusters are logically separated, usually physically together
Both clusters are standard master-slave architecture clusters

​ HDFS cluster:
​ Master role: NameNode
​ Slave role: DataNode
​ Master role and auxiliary role: SecondaryNameNode

YARN cluster:
Master role: ResourceManager
Slave role: NodeManager

 HDFS cluster and YARN cluster are logically separated and physically together

Logical separation means that the HDFS cluster and the YARN cluster do not depend on each other. It is not necessary to start one to start the other, and they do not affect each other.
But at the physical level, the processes of two clusters may be deployed on one machine.
MapReduce is a computing framework, a component at the code level, and there is no such thing as a cluster

Guess you like

Origin blog.csdn.net/qq_48391148/article/details/129813242