Basic theoretical knowledge of Hadoop
- HDFS (Hadoop Distributed File System) design and development based on the paper published by Google
- In addition to the same characteristics of other distributed file systems, there are also unique features:
high fault tolerance: hardware is always unreliable and
high throughput: applications that access large amounts of data provide high throughput and support
large file storage: support storage TB- PB level data - What is HDFS suitable for?
Large file storage and access
Streaming data access - What is HDFS not suitable for?
Mass storage of small files
Random write
Low latency read
Examples of HDFS application scenarios:
-
HDFS is a distributed file system in the Hadoop technical framework, which manages files deployed on independent physical machines.
-
Can be used in a variety of scenarios, such as:
-
Website user behavior data storage
-
Ecosystem data storage
-
Meteorological data storage
Basic system architecture
HDFS architecture key design
HDFS high reliability (HA)
Metadata persistence
Configure HDFS data storage strategy
-
By default, HDFS NameNode automatically selects DataNode to save a copy of the data. In actual business, the following scenarios exist:
-
Different storage devices exist on the DataNode, the data needs to select a suitable storage device to store data hierarchically
-
The importance of the data in different directories of the DataNode is different. The data needs to be selected according to the directory label to save a suitable DataNode node
-
The DataNode cluster uses heterogeneous servers, and critical data needs to be stored in a highly reliable node group.
HDFS data integrity guarantee
-
The main purpose of HDFS is to ensure the integrity of stored data, and to deal with the reliability of each component failure.
-
Rebuild the duplicate data of the failed data disk: When the DataNode fails to report to the NameNode periodically, the NameNode initiates a copy reconstruction action to recover the lost copy.
-
Cluster data balancing: data balancing mechanism, this mechanism ensures that the data is evenly distributed on each DataNode
-
Metadata reliability guarantee
-
Use the logging mechanism to operate metadata, and the metadata is stored on the active and standby NameNode
-
The snapshot mechanism implements the common snapshot mechanism of the file system to ensure that it can be restored in time when data is misused.
-
Safe mode: In case of data node failure and hard disk failure, it can prevent the failure from spreading.