Hadoop-HDFS summary (1)

HDFS advantages

1. High fault tolerance:
(1) Data is automatically saved in multiple copies, which improves fault tolerance.
(2) After a copy is lost, it can be automatically restored.
2. Suitable for processing large amounts of data
(1) Data scale: It can handle GB, TB, or even PB level data.
(2) File size: A large number of files (million scale) can be used.
3. The reliability can be improved through multiple copies mechanism on cheap machines.

HDFS disadvantages

1. Not suitable for low-latency data access, such as millisecond data access.
2. Unable to store a large number of small files efficiently.
(1) If the storage file is too small, it will occupy the memory of the NameNode to store a large amount of file directory and block information. The NameNode has limited memory, so it is not desirable.
(2) The addressing time of small files will exceed the reading time, which violates the original intention of HDFS design.
3. Does not support concurrent file writing, files are modified randomly.
(1) A file can only be written by one, and cannot be written by multiple threads at the same time.
(2) Append files are supported, but random file modifications are not supported.

HDFS composition architecture

1. NameNode: Supervisor, manager
(1) Manage the HDFS name space.
(2) Configure a copy strategy.
(3) Manage data block mapping information.
(4) Processing client read and write requests.
2. DataNode: Slave NameNode issues commands and DataNode executes operations.
(1) Store the actual data block.
(2) Perform data block read/write operations.
3. Client: Client
(1) File segmentation. When the file is uploaded to HDFS, the Client divides the file into blocks for uploading.
(2) Interact with the NameNode to obtain file information.
(3) Interact with DataNode, read or write data.
(4) The client can access HDFS through some commands, such as adding, deleting, modifying, and checking HDFS.
(5) Client can manage HDFS through some commands, such as NameNode format.
4. Secondary Namenode: The cold standby of the NameNode. When the NameNode is down, it cannot immediately replace the NameNode to provide services.
(1) Assist NameNode to share workload.
(2) In an emergency, it can assist in restoring the NameNode

HDFS file block size*

Files in HDFS are physically stored in blocks (Block), and the block size can be specified by configuration parameters (dfs.blocksize). The default size is 128M in the 2.x version and 64M in the old version.
If the addressing time is 10ms, the time required to find the target is 10ms.
The addressing time is 1% of the transmission time, which is the best state. Therefore, the transmission time=10ms/0.01=1000ms=1s. The
current disk transmission rate is generally 100MB/s.
Set block size=1s*100MB/s=100MB
! If the HDFS block is set too small, it will increase the addressing time, and the program is always looking for the beginning of the block .
! If the block setting is too large, the transmission time from the disk will be significantly longer than the time to locate the block, and the program will be very slow when processing this block of data .
! The block size setting of HDFS depends on the disk transfer rate.

Guess you like

Origin blog.csdn.net/qq_45092505/article/details/104913046