The Road to Java Big Data--HDFS Detailed Explanation (1)--Overview

HDFS (Distributed File Storage System)--Overview

Table of contents

HDFS (Distributed File Storage System)--Overview

I. Overview

2. Features

advantage:

shortcoming:


I. Overview

  1. The full name is Hadoop Distributed File System, Hadoop distributed file storage system
  2. HDFS is designed according to Google's paper: "The Google File System"
  3. Itself is a distributed, scalable, reliable file system
  4. HDFS contains three main processes: NameNode, DataNode, and SecondaryNameNode. These three processes are generally distributed on different hosts, so it is generally customary to use the name of the process to call the node

2. Features

advantage:

  1. Support for very large files. Oversized files here refer to files of hundreds of M, hundreds of GB, or even several TB in size. Generally speaking, Hadoop's file system will store TB-level or PB-level data. Therefore, in enterprise applications, there may be thousands of data nodes
  2. Detect and quickly respond to hardware failures. In a clustered environment, hardware failure is a common problem. Since there are thousands of servers connected together, this leads to a high failure rate. Therefore failure detection and automatic recovery (heartbeat mechanism) is a design goal of the HDFS file system
  3. Streaming data access. The data processing scale of HDFS is relatively large, and applications need to access a large amount of data at a time. At the same time, these applications generally perform batch processing instead of user-interactive processing. Applications can access datasets as streams. The main thing is data throughput, not access speed
  4.  Simplified consistency model. When most hdfs operates files, it needs to be written once and read many times. In HDFS, once a file is created, written, and closed, it generally does not need to be modified. Such a simple consistency model is conducive to improving throughput
  5. High fault tolerance. Data is automatically saved in multiple copies, and automatically restored after the copy is lost
  6. Can be built on cheap machines. Built on cheap machines, it is easy to increase the storage capacity of the cluster almost linearly by expanding the number of machines

shortcoming:

  1. No low-latency data access. For example, applications that interact with users require data to be responded within milliseconds or seconds. Since Hadoop is optimized for the throughput of massive data and sacrifices the delay of obtaining data, it is not suitable for low latency to use hadoop
  2. Not suitable for storing large numbers of small files. HDFS supports very large files by distributing the data on the data nodes, and the metadata of the data is stored on the name nodes. The memory size of the name node determines the number of files that can be stored in the HDFS file system. Although the current system memory is relatively large, a large number of small files will still affect the performance of the name node
  3. It does not support multi-user writing and modifying files. HDFS files can only be written once, and do not support modification and additional writing (version 2.0 supports appending), nor does it support modification. Only in this way can the data throughput be large
  4. Super-strong transactions are not supported. There is no strong support for transactions like relational databases, and the loss of a block will not affect all data because the amount of data is too large.

Guess you like

Origin blog.csdn.net/a34651714/article/details/102812441