Getting Started with Big Data Hadoop 03 - HDFS Distributed File System Foundation

1. Introduction to HDFS:

  • HDFS (Hadoop Distributed File System), meaning: Hadoop Distributed File System. 
  • It is one of the core components of Apache Hadoop and exists as the lowest distributed storage service in the big data ecosystem. It can also be said that the first problem to be solved by big data is the storage of massive data. 
  • HDFS mainly solves the problem of how to store big data. Distributed means that HDFS is a storage system that spans multiple computers.
  • HDFS is a distributed file system that can run on ordinary hardware. It is usually suitable for storing large data (such as TB and PB).
  • HDFS uses multiple computers to store files and provides a unified access interface, using a distributed file system like accessing a common file system. 

2. The origin, development and design goals of HDFS

  • Doug Cutting leads the research and development of the Nutch project. The design goal of Nutch is to build a large-scale search engine for the whole network, including web crawling, indexing, querying and other functions.
  • As the number of web pages crawled by crawlers increased, a serious scalability problem was encountered-how to solve the problem of storing and indexing billions of web pages. In 2003, a paper published by Google provided a feasible solution to this problem.
  • "Distributed File System (GFS), which can be used to handle the storage of massive web pages" "Distributed File System (GFS), which can be used to handle the storage of massive web pages"
  • The developers of Nutch completed the corresponding open source implementation HDFS, and separated from Nutch and MapReduce to become an independent project HADOOP.
  • Hardware failure is normal. HDFS may consist of hundreds or thousands of servers, and each component may fail. Therefore, failure detection and automatic fast recovery are the core architectural goals of HDFS. 
  • The application on HDFS is mainly to read data in streaming mode (Streaming Data Access). HDFS is designed for batch processing, not user-interactive. Compared with the response time of data access, more attention is paid to the high throughput of data access. Streaming data access (the application above is mainly to read data in a streaming manner). HDFS is designed for batch processing, rather than user-interactive, focusing more on high throughput of data access than latency of data access.
  • Typical HDFS file sizes are on the order of GB to TB. Therefore, HDFS is adjusted to support large files (Large Data Sets). It should provide high aggregate data bandwidth, support hundreds of nodes in a cluster, and support tens of millions of files in a cluster. Typical HDFS file sizes are on the GB to TB level. Therefore, HDFS is tuned to support large files (large datasets). It should provide high aggregate data bandwidth, support hundreds of nodes in a cluster, and support tens of millions of files in a cluster. Hardware failure (Hardware Failure) is normal, and HDFS may consist of hundreds or thousands of servers , each component has the potential to fail. Therefore, failure detection and automatic fast recovery are the core architectural goals of HDFS. Hardware failure (hardware failure) is normal, hdfs may consist of hundreds or thousands of servers, and each component may fail. Therefore, fault detection and automatic fast recovery are the core architectural goals of HDFS.
  • The application on HDFS is mainly to read data in streaming mode (Streaming Data Access). HDFS is designed for batch processing, not user-interactive. Compared with the response time of data access, more attention is paid to the high throughput of data access. Streaming data access (the application above is mainly to read data in a streaming manner). HDFS is designed for batch processing, rather than user-interactive, focusing more on high throughput of data access than latency of data access.
  • Typical HDFS file sizes are on the order of GB to TB. Therefore, HDFS is adjusted to support large files (Large Data Sets). It should provide high aggregate data bandwidth, support hundreds of nodes in a cluster, and support tens of millions of files in a cluster. Typical HDFS file sizes are on the GB to TB level. Therefore, HDFS is tuned to support large files (large datasets). It should provide high aggregate data bandwidth, support hundreds of nodes in a cluster, and support tens of millions of files in a cluster.
  • Most HDFS applications require a write-one-read-many access model for files. Once a file is created, written to, and closed, it does not need to be modified. This assumption simplifies data consistency issues and enables high-throughput data access.
  • Mobile computing is less expensive than mobile data. The closer the computation requested by an application is to the data it operates on, the more efficient it is. It is clearly better to move the computation closer to the data than to move the data where the application resides. 
  • HDFS is designed to be easily portable from one platform to another. This contributes to the widespread use of HDFS as the platform of choice for a large number of applications.

3. HDFS Application Scenarios

4. Important features of HDFS:

  1. Overall overview
  2. master-slave architecture
  3. block storage
  4. copy mechanism
  5. metadata record
  6. Abstract unified directory tree structure (namespace)

(1) Master-slave architecture

  • HDFS cluster is a standard master/slave master-slave architecture cluster.
  • Generally, an HDFS cluster consists of a Namenode and a certain number of Datanodes. 
  • Namenode is the master node of HDFS, and Datanode is the slave node of HDFS. The two roles perform their respective duties, and jointly coordinate and complete the distributed file storage service. 
  • The official architecture diagram is a master-five-slave mode, in which the five slave roles are located on different servers in two racks (Rack). 

(2) block storage 

  • Files in HDFS are physically stored in blocks. The default size is 128M (134217728). If it is less than 128M, it is a block.
  • The size of the block can be specified by a configuration parameter located in hdfs-default.xml: dfs.blocksize.

(3) Copy mechanism 

  • All blocks of the file will have a copy.
  • The copy factor can be specified when the file is created, or can be changed later by commands.
  • The number of copies is controlled by the parameter dfs.replication, and the default value is 3, that is, 2 additional copies will be made, and a total of 3 copies will be made together with itself.

(4) Metadata management


In HDFS, there are two types of metadata managed by Namenode: ·

  • The attribute information of the file itself: file name, permission, modification time, file size, replication factor, and data block size.
  • File block location mapping information: record the mapping information between file blocks and DataNodes, that is, which block is located on which node.

(5)  namespace

  • HDFS supports the traditional hierarchical file organization structure. Users can create directories and save files in those directories. The hierarchy of the filesystem namespace is similar to that of most existing filesystems: users can create, delete, move, or rename files. 
  • Namenode is responsible for maintaining the namespace namespace of the file system, and any modification to the namespace or attributes of the file system will be recorded by Namenode. 
  • HDFS will provide the client with a unified abstract directory tree, and the client will access the file through the path, such as: hdfs://namenode:port/dir-a/dir-b/dir-c/file.data. 

(6) Data block storage 


        The specific storage management of each block of the file is undertaken by the DataNode node. Each block can be stored on multiple DataNodes.

Guess you like

Origin blog.csdn.net/gongzi_9/article/details/129698956