HDFS Distributed File System Basics

file system definition

  • The file system is a method of storing and organizing data, which realizes operations such as data storage, hierarchical organization, access, and acquisition, making it easier for users to access and find files
  • The file system uses the abstract logical concept of a tree directory to replace the concept of data blocks used by physical devices such as hard disks. Users do not need to care about where the hard disk exists at the bottom of the data, and only need the directory and file name of the file.
  • File systems typically use storage devices such as hard disks and optical disks, and maintain the physical location of files on the device

Traditional common file system

  • The so-called traditional common file system refers more to the single-machine file system, that is, the bottom layer will not be implemented across multiple machines.
  • Common characteristics of these file systems include:
  1. With an abstract directory tree structure, the tree starts from the / root directory and spreads down
  2. The nodes in the tree are divided into two categories: directories and files
  3. Starting from the root directory, the node path is unique

data, metadata

  • Data
    refers to the stored content itself, such as files, videos, pictures, etc. The bottom layer of these data is ultimately stored on storage media such as disks. Generally, users do not need to care about it. They only need to add, delete, modify, and check based on the directory tree. Actually, for data Operations are done by the file system.
  • Metadata
    Metadata, also known as explanatory data, records data. File
    system metadata generally refers to file size, last modification time, underlying storage location, attributes, users, permissions, and other information.

Problems encountered in mass data storage

  • High cost
    Traditional storage hardware has poor versatility, and the cost of equipment investment, post-maintenance, upgrade and expansion is very high.
  • How to Support Efficient Computing and Analysis
    Traditional storage methods mean data: storage is storage, computing is computing, and data is moved when data needs to be processed.
    Programs and data are implemented by different technology vendors and cannot be organically integrated together.
  • Low performance
    The I/O performance bottleneck of a single node cannot be overcome, and it is difficult to support high-concurrency and high-throughput scenarios with massive data.
  • Poor scalability.
    Rapid deployment and elastic expansion cannot be achieved. The cost of dynamic expansion and reduction is high, and the technical implementation is difficult.

The core attributes and functions of distributed storage systems

Advantages of Distributed Storage

Problem: The data volume is huge, and the stand-alone storage encounters a bottleneck

  • Solution:
    Stand-alone vertical expansion: disks are not enough to add disks, there is an upper limit bottleneck
    Multi-level horizontal expansion: machines are not enough to add machines, theoretically wireless expansion

Functionality of metadata records

Problem: Files are distributed on different machines, which is not conducive to finding

  • Solution: Metadata records the file and its storage location information, and quickly locates the file location

Chunk Storage Benefits

Problem: The file is too large to save on a single machine, and the efficiency of uploading and downloading is low

  • Solution: File blocks are stored in different machines to improve efficiency for block parallel operations

The role of the copy mechanism

  • Problem: Hardware failure is inevitable and data is easy to lose
  • Solution: backup of different machine settings, redundant storage, and data security

HDFS introduction, design goals

Introduction to HDFS

  • HDFS (Hadoop Distributed File System), meaning: Hadoop Distributed File System
  • It is one of the core components of Apache Hadoop and exists as the lowest distributed storage service in the big data ecosystem. It can also be said that the primary problem to be solved by big data is the problem of massive data storage.
  • HDFS mainly solves the problem of how to store big data. Distributed means that HDFS is a storage system that spans multiple computers
  • HDFS is a distributed file system that can run on common hardware. It is highly fault-tolerant and suitable for applications with large data sets. It is very suitable for storing large data (TB, PB).
  • HDFS uses multiple computers to store files and provides a unified access interface, using a distributed file system like accessing an ordinary file system.

HDFS origin development

  • Doug Cutting led the research and development of the Nutch project. The design goal of Nutch is to build a large-scale search engine for the whole network, including web crawling, indexing, querying and other functions
  • As the number of web pages crawled by crawlers increases, serious scalability problems are encountered - how to solve the problem of storing and indexing billions of web pages
  • In 2003, a paper published by Google provided a feasible solution to this problem. "Distributed File System (GFS), which can be used to handle the storage of massive web pages"
  • The developers of Nutch completed the corresponding open source implementation of HDFS, and separated from Nutch and MapReduce to become an independent project Hadoop.

HDFS Design Goals

  • Hardware failure (Hardware Failure) is normal. HDFS may consist of hundreds or thousands of servers, and each component may fail. Therefore, failure detection and automatic fast recovery are the core architectural goals of HDFS
  • The application on HDFS mainly reads data in streaming mode (Streaming Data Access). HDFS is designed for batch processing, not user-interactive. Compared with the response time of data access, more attention is paid to the high throughput of data access.
  • Typical HDFS file sizes range from GB to TB. Therefore, HDFS is adjusted to support large files (Larage Data Sets). It should provide high aggregate data bandwidth, support hundreds of nodes in a cluster, and support tens of millions of files in a cluster.
  • Most HDFS applications require a write-one-read-many access model for files. Once a file is created, written to, and closed, it does not need to be modified. This assumption simplifies data consistency issues and makes high-throughput data access possible.
  • Mobile computing is less expensive than mobile data. The closer the computation requested by an application is to the data it operates on, the more efficient it is. It is clearly better to move the computation closer to the data than to move the data where the application resides.
  • HDFS is designed to be easily portable from one platform to another. This contributes to the widespread use of HDFS as the platform of choice for a large number of applications

HDFS application scenarios

Suitable scene:

  • large file
  • data stream access
  • write once read many
  • Low Cost Deployment, Cheap PCs
  • High fault tolerance

Unsuitable scene:

  • small file
  • Data Interactive Access
  • frequent arbitrary modification
  • low latency processing

Important features of HDFS

Overall overview

  • master-slave architecture
  • block storage
  • copy mechanism
  • metadata record
  • Abstract unified directory tree structure (namespace)

master-slave architecture

  • HDFS cluster is a standard master/slave master-slave architecture cluster
  • Generally, an HDFS cluster consists of a Namenode and a certain number of Datanodes.
  • Namenode is the master node of HDFS, and Datanode is the slave node of HDFS. The two roles perform their own duties and coordinate to complete the distributed file storage service.
  • The official architecture diagram is a master-five-slave mode, in which five slave roles are located on different servers in two racks (Rack).

block storage

  • The files in HDFS are physically stored in blocks (blocks), the default size is 128M (134217728), if it is less than 128M, it is itself a block
  • The size of the block can be specified by configuration parameters, which are located in hdfs-default.xml: dfs.blocksize

copy mechanism

  • All blocks of the file will have a copy. The copy wash can be specified when the file is created, or it can be changed later by commands.
  • The number of copies is controlled by the parameter dfs.replication, and the default value is 3, that is, two additional copies will be made, together with itself a total of 3 copies.

metadata management

In HDFS, there are two types of metadata managed by Namenode:

  • Attribute information of the file itself
    File name, permission, modification time, file size, replication factor, data block size
  • File block location mapping information
    Records the mapping information between file blocks and Datanodes, that is, which block is located on which node

namespace

  • HDFS supports the traditional hierarchical file organization structure. Users can create directories and save files in those directories. The hierarchy of the filesystem namespace is similar to that of most existing filesystems: users can create, delete, move, or rename files.
  • Namenode is responsible for maintaining the namespace name space of the file system, and any modification of the file system name space or attributes is recorded by the Namenode
  • HDFS will provide the client with a unified abstract directory tree, and the client will access files through the path.

database storage

  • The specific storage management of each block of the file is undertaken by the Datanode node
  • Each block can be stored on multiple Datanodes

Guess you like

Origin blog.csdn.net/JAX_fire/article/details/126029676
Recommended