Table of contents
1.1. Hadoop in a narrow sense:
2. Features and advantages of Hadoop
3. Changes in Hadoop architecture
HDFS cluster and YARN cluster are logically separated and physically together
1. What is Hadoop
1.1. Hadoop in a narrow sense:
狭义上Hadoop指的是Apache软件基金会的一款开源软件
用java语言实现,开源
允许用户使用简单的编程模型实现跨机器集群对海量数据进行分布式计算处理
1.2. Hadoop in a broad sense:
Hadoop in a broad sense refers to the big data ecosystem built around Hadoop
As a distributed file storage system, HDFS is at the bottom and core of the ecosystem
As a distributed and general-purpose cluster resource management system and task scheduling platform, YARN supports the operation of various computing engines and ensures the status of Hadoop
MapReduce is the first-generation distributed computing engine in the big data ecosystem. Due to the disadvantages of the model designed by itself, the front-line enterprises almost no longer use MapReduce directly for programming and computing, but the bottom layer of many software is still using the MapReduce engine. to process data
1.3. Hadoop core components:
HDFS (Distributed File Storage System): Solve Massive Data Storage
YARN (cluster resource management and task scheduling framework): solve resource task scheduling
MapReduce (distributed computing framework): solving massive data computing
######################################################
2. Features and advantages of Hadoop
Strong capacity expansion
Hadoop distributes data and completes computing tasks among available computer clusters that
can scale to thousands of nodes in a convenient and flexible manner. low cost
Hadp allows to process big data by deploying ordinary cheap machines to form a cluster, so that the cost is very low, focusing on the overall capability of the cluster
High efficiency
Through concurrent data, Hadoop can dynamically move data between nodes in parallel, making the speed very fast
Reliability
It can automatically maintain multiple copies of data, and can automatically redeploy computing tasks after task failures,
so people trust Hadoop's bit-by-bit storage and data processing capabilities
######################################################
3. Changes in Hadoop architecture
Hadoop1.0:
HDFS (distributed file storage)
MapReduce (resource management and distributed data processing) Hadoop2.0:
HDFS (distributed file storage)
MapReduce (resource management and distributed data processing)
YARN (cluster resource management, task scheduling) Hadoop3.0:
General aspects:
Streamlined kernel, classpath isolation, shell script refactoring Hadoop HDFS:
EC erasure code, multi-nameNode support Hadoop MapReduce:
Task localization optimization, automatic inference of memory parametersHadoop YARN:
Timeline Service V2, queue configuration
######################################################
4. Hadoop cluster
Hadoop cluster includes two clusters: HDFS cluster and YARN cluster
The two clusters are logically separated, usually physically together
Both clusters are standard master-slave architecture clusters HDFS cluster:
Master role: NameNode
Slave role: DataNode
Master role and auxiliary role: SecondaryNameNodeYARN cluster:
Master role: ResourceManager
Slave role: NodeManager
HDFS cluster and YARN cluster are logically separated and physically together
Logical separation means that the HDFS cluster and the YARN cluster do not depend on each other. It is not necessary to start one to start the other, and they do not affect each other.
But at the physical level, the processes of two clusters may be deployed on one machine.
MapReduce is a computing framework, a component at the code level, and there is no such thing as a cluster