[Basic knowledge] Brief introduction to big data component HDFS

HDFS is a classic Master and Slave architecture. Each HDFS cluster includes a NameNode and multiple DataNodes.
NameNode manages the metadata information of all files and is responsible for interacting with clients. DataNode is responsible for managing files stored on the node. Each file uploaded to HDFS will be divided into one or more data blocks. These data blocks are allocated to different DataNodes according to the data backup strategy of the HDFS cluster, and the location information is managed by the NameNode.
Please add image description

NameNode

It is used to manage the namespace of the file system, maintain the directory structure tree and metadata information of the file system, and record the correspondence between each data block (Block) written and its belonging file.
This information is persisted on the local disk in the form of namespace image (FSImage) and edit log (EditsLog).

DataNode

DataNode is the actual storage location of the file.
The DataNode will store or provide data blocks according to the instructions of the NameNode or Client, and regularly reports the data block information stored by the DataNode to the NameNode.

Blocks

HDFS splits files into 128 MB data blocks for storage, and these blocks may be stored on different nodes. HDFS can store larger individual files than any single disk can hold. A Block stores 3 copies by default (2 copies if the EMR Core node uses a cloud disk), and the copies are stored on multiple nodes with Block as the granularity. This method not only improves data security, but also makes better use of local data for calculations and reduces network transmission for distributed jobs.

High availability

For high-availability clusters, two NameNodes are started by default, one is Active NameNode and the other is Standby NameNode. The two NameNodes assume different roles.
The Active NameNode is responsible for processing requests from the DataNode and Client. The Standby NameNode, like the Active NameNode, has the latest metadata information and is ready to take over the Active NameNode's services when an exception occurs. If the Active NameNode is abnormal, the Standby NameNode will sense it and switch to the role of the Active NameNode to handle DataNode and Client requests.

Reference documentation

What is HDFS_Open source big data platform E-MapReduce-Alibaba Cloud Help Center
Big Data Technology Hadoop (HDFS) Chapter 1 HDFS Overview-Tencent Cloud Developer Community-Tencent Cloud

おすすめ

転載: blog.csdn.net/weixin_44325637/article/details/135067947