hadoop-hdfs distributed file system theory (1)

Why develop HDFS distributed file system

Can better support distributed computing.
The hadoop distribute file system is a distributed file system that operates on files. Additions and deletions are based on files.

storage model

  • The file is linearly cut into blocks by bytes, and has an offset. The id
    offset refers to the offset of the block. For example, the block size is 10, and the offset can be 0, 10, 20, or 30. . .
    id is the name of the block, such as block1, block2. . .
  • The block size of files can be different. It
    refers to different files. The block size can be different. For example, the block size of file A is 10Byte and the size of file B is 20Byte.
  • Except for the last block in a file, all other blocks have the same size.
    The data in the file will definitely be cut. How to recover it? It can be recovered during later splicing.
  • The size of the block is adjusted according to the I/O characteristics of the hardware.
  • Blocks are dispersedly stored in the nodes of the cluster. Location
    refers to which node in the cluster the block is stored on.
  • Blocks have replicas (replication), and there is no master-slave concept. The replicas cannot appear on the same node.
    There is no master-slave concept for replicas. For example, 3 replicas mean that each block is stored in three copies, and any of the three copies can be read.
  • Replicas are key to meeting reliability and performance.
    Multiple replicas allow multiple programs to calculate in parallel, and the calculations want to move to the data without pulling files.
  • The block size and number of copies can be specified for file upload. Only the number of copies can be modified after uploading. The
    block size cannot be adjusted again after the file is uploaded, but the number of copies of the block can be adjusted.
  • Write once and read many times. Modification is not supported.
    It can only be read and cannot be modified, but data can be appended.
    Why: The reason is that if the block data is modified, the offsets of subsequent blocks will be changed, resulting in a serious waste of resources (CPU, network bandwidth, IO, etc.). This is a compromise design, because the purpose of HDFS is to improve Good support for distributed computing, so there is no need to flood operations for modification operations.
  • Support for appending data.
    Appending data will not cause the offset of other blocks to be modified. Only the data of the last block will change, which will not cause flooding operations. Therefore, append operations are supported.

Architecture design

  • HDFS is a master-slave (Master/Slaves) architecture. The master and slaves
    collaborate to complete a task;
    the master-slave means that one works and the other does not work. When the master fails, the slave switches to the master. In fact, the slave is a backup.
  • It consists of a NameNode and some DataNodes.
    NameNode is the master and DataNode is the slave.
  • File-oriented inclusion: file data (data) and file metadata (metadata)
  • NameNode is responsible for storing and managing file metadata, and maintaining a hierarchical file directory tree
    similar to the file directory tree of Linux.
  • The DataNode is responsible for storing file data (blocks) and providing block reading
    and writing. The client first confirms with the NameNode where to access, and then accesses the corresponding DataNode.
    NameNode is like a gateway and forwards requests to different nodes.
  • DataNode maintains a state of mind with NameNode and reports the block information it holds.
    NameNode confirms whether the file storage is completed based on the information reported by DataNode.
  • Client and NameNode interact with file metadata and DataNode interact with file block data
    Insert image description here
    Insert image description here

role function

NameNode

  • Completely based on memory to store file metadata, directory tree structure, and file block mapping
    . Based on memory storage, for fast access, memory IO is 100,000 times that of disk IO.
  • A persistence solution is needed to ensure data reliability
  • Provide replica placement strategy
    DataNode
  • Based on local disk storage blocks (in the form of files),
    if a file is divided into 10 blocks, it will be stored into 10 small files and distributed to different DataNodes.
  • And save the checksum of the block to ensure the reliability of the block. Ensure that the block
    data will not be destroyed when downloading. After downloading the block, calculate the checksum and compare it with the checksum on the DataNode. If the consistent data is normal.
  • Maintain heartbeat with NameNode and report block list status

Metadata persistence

  • Any operation that modifies file system metadata will be recorded by the NameNode using a transaction log called EditLog.
  • Use FsImage to store all metadata states in memory
  • Save EditLog and FsImage using local disk
  • EditLog has integrity and less data loss, but the recovery speed is slow and there is a risk of volume expansion.
  • FsImage has fast recovery speed and the size is equivalent to the memory data, but it cannot be saved in real time and the data is lost a lot.
  • NameNode uses the FsImage+EditLog integration solution
    to update the incremental EditLog to FsImage on a rolling basis to ensure a closer time point in FsImage and a smaller EditLog size.

The role of log files

Record the addition, deletion and modification operations that occur in real time, and the previous data can be restored by reading the log.
Advantages: The integrity of the log is better because it is real-time.
Disadvantages: Loading data and restoring data is slow and takes up a lot of space.

The functions of mirror, snapshot, dump and db

The entire memory data is written to the disk based on a certain point in time, which is similar to the serialization of the entire memory data to the disk.
However, the interval cannot be too short because I/O reading and writing are slow.
Advantages: Usually binary file storage, fast recovery.
Disadvantages: Because of interval saving, it is easy to lose some data.

How HDFS metadata is persisted

Through the combination of image (FsImage) + log (EditLog).
How are HDFS used together?
EditLog: Keep the log file size as small as possible and have as few records as possible.
FsImage: Update the image file through the pipeline as quickly as possible (generate FsImage files regularly)

Data is persisted through the latest FsImage + incremental EditLog.
For example, it is 10 o'clock now:
use FsImage at 9 o'clock + incremental EL
recovery steps from 9 o'clock to 10 o'clock:
1. Load FI
2. Load EL
3. The memory will get the full amount of data before shutdown

safe mode

  • HDFS will be formatted when it is built, and the formatting operation will generate an empty FsImage.

  • When the NameNode starts, it reads the Editlog and FsImage from the hard disk

  • Apply all transactions in Editlog to FsImage in memory

  • and save this new version of FsImage from memory to local disk

  • Then delete the old EditLog, because the transactions of this old Editlog have already been applied to FsImage.

  • After the NameNode starts, it enters a special state called safe mode.

  • NameNode in safe mode will not copy data blocks.
    For example, file metadata includes: file attributes, and which DN each block exists on.
    During persistence: file attributes will be persisted, but the location of each block of the file will not be persisted. Therefore, when NN starts recovery, NN will lose the location information of the blocks.
    The reason for this design is: in distributed storage, data consistency is very important. If it is persisted, but a certain DN hangs up when starting the cluster, then an error will occur when downloading the block. All use startup DD to report data to NN.

  • The NameNode receives heartbeat signals and block status reports from all Datanodes.

  • Whenever the NameNode detects that the number of replicas of a data block reaches this minimum value, the data block is considered to be safely repicated.

  • After a certain percentage (this parameter is configurable) of data blocks detected by the NameNode and confirmed to be safe (plus an additional 30 seconds waiting time), the NameNode will exit the safe mode state.

  • Next it will confirm which data blocks have replicas that do not reach the specified number, and copy these data blocks to other DataNodes.

SNN in HDFS

SecondaryNameNode(SNN)

  • In non-Ha mode, SNN is generally an independent node, and the EditLog and FsImage of the NN are merged periodically to reduce the size of the EditLog and the NN startup time.
  • The time interval fs.checkpoint.period set according to the configuration file defaults to 3600 seconds.
  • Set the edits log size according to the configuration file fs.checkpoint.size stipulates that the maximum edits file size is 64MB by default.
    Insert image description here

Block’s copy placement strategy

  • The first copy: placed in the DN of the uploaded file; if submitted outside the cluster, a node whose disk is not too full and whose CPU is not too busy is randomly selected.
  • Second replica: Placed on a node on a different rack than the first replica.
  • Third replica: A node in the same rack as the second replica.
    Reduce the cost of the client. If 2 and 3 are placed on a rack, there is no need to leave the rack, reducing network IO.
  • More copies: random nodes.

Reading and writing process

Insert image description here

  • The client first gets the sorted list of DN from NN.
  • The client establishes a TCP connection with the nearest DN, the first DN establishes a TCP connection with the second DN, and the second DN establishes a TCP connection with the third DN, forming a pipeline, that is, the client only establishes a TCP connection with one DN. connect. The client cuts the file into multiple blocks. The client transfers one block to the first DN. After a few times, the client transfers the second block. There is no need to wait for the second DN and the third DN to be transferred. The subsequent DN data is The previous DN is responsible. In fact, the pipeline is also a variant of parallelism.

Detailed HDFS writing process

The client phenomenon NN applies for the copy position of a block; after the client sends a block, it applies to NN again for the copy position of the next block.
The following is the specific process for the client to send a block.

  • Client and NN connections create file metadata
  • NN determines whether the metadata is valid
  • NN triggers the replica placement strategy and returns an ordered DN list
  • Client and DN establish Pipeline connection
  • Client splits the block into packets (64KB) and fills them with chunk(512B) + chunksum(4B).
    That is, the client sends a packet each time.
    64KB = (512B + 4B)*n, that is, 64KB data is divided into multiple chunk+chunksum
  • The Client puts the Packet into the sending queue dataqueue and sends it to the first DN.
  • After receiving the packet, the first DN saves it locally and sends it to the second DN.
  • After receiving the packet, the second DN saves it locally and sends it to the third DN.
  • During this process, the upstream node sends the next packet at the same time
  • Analogy to the pipeline of engineering in life: Conclusion: Streaming is actually a variant of parallel computing
  • HDFS uses this transmission method, and the number of copies is transparent to the client.
  • When the block transmission is completed, the DNs each report to the NN, and the client continues to transmit the next block.
  • Therefore, client transmission and block reporting are also parallel.

HDFS reading process

Insert image description here

HDFS reading process

  • In order to reduce the overall bandwidth consumption and read latency, HDFS will try to let the reader read the copy closest to it.
  • If there is a replica on the same rack as the reader, then that replica is read.
  • If an HDFS cluster spans multiple data centers, the client will also read the local data center copy first.
  • Semantics: Download a text:
    • The client and NN interact with file metadata to obtain
      the locations of all blocks in the fileBlockLocation file.
    • NN will be sorted and returned according to the distance strategy.
    • Client attempts to download the block and verify data integrity.
  • Semantics: Downloading a file actually obtains all the block component data of the file, so the subset acquisition of certain blocks should be established.
    • HDFS supports the client to give the offset of the file, customize the DN of which blocks to connect to, and customize the data acquisition.
    • This is the core that supports divide-and-conquer and parallel computing in the computing layer.

Guess you like

Origin blog.csdn.net/tianzhonghaoqing/article/details/131364448