1. The overall architecture of HDFS
- Vague vocabulary explanation:
Client:
Any end that accesses HDFS through API or HDFS commands can be regarded as a client.Rack:
Rack, the placement strategy of the copy is related to the rack.Block Size:
Hadoop2.7.3 default start 128 M , the following default Hadoop2.7.3 64 M .
2. The relationship between block, packet, and chunk
- Block, packet, and chunk are all data storage units involved in HDFS.
- Xml file in our own Hadoop can configure:
core-site.xml
,hdfs-site.xml
and so, when I do not know how to make changes, you can viewcore-default.xml
,hdfs-site.xml
and other documents.
① block
- Block is the unit of file partitioning in HDFS. A file that does not have 64 M will occupy a block. Size is the actual size of the file, and blocksize is the size of the block.
- You can modify the default block size
hdfs-site.xml
throughdfs.block.size
configuration items in the file . - The relationship between block and disk addressing time and transmission time:
- The larger the block, the shorter the disk addressing time and the longer the data transmission time.
- The smaller the block, the longer the disk addressing time and the shorter the data transmission time.
- The block setting is too small:
- NameNode memory overload: If the block setting is too small, a
NameNode
large number of small file metadata information is stored in theNameNode
memory , which will cause memory overload. - Addressing time is too long: If the block is set too small, the disk addressing time will increase, making the program always look for the beginning of the block.
- Block is set too large:
- Map task time is too long:
MapReduce
Medium Map usually only processes tasks in one data block at a time. If the block is set too large, the processing time of Map tasks will be too long. - Data transmission time is too long: If the block is set too large, the data transmission time will far exceed the data addressing time, which will affect the data processing speed.
② packet
- Packet is the second largest unit. It is the basic unit of data transmission between DFSClient and DataNode or DataNode's Pipeline . The default size is 64 kb .
- You can modify the default packet size
hdfs-site.xml
throughdfs.write.packet.size
configuration items in the file .
③ chunk
- chunk is the smallest unit, it is
DFSClient
toDataNode
, orDataNode
isPipeline
carried out between the data check basic unit, the default is 512 byte . - You can modify the default chunk size
core-site.xml
throughio.bytes.per.checksum
configuration items in the file . - As the basic unit of data verification, each chunk needs to carry 4 bytes of verification information . Therefore, when actually written to the packet, it is 516 byte , and the ratio of the real data to the check data is
128 : 1
. - Example: A 128M file can be divided into 256 chunks, and
256 * 4 byte = 1 M
the verification information needs to be carried in total . - Summary of the three:
- chunk is
DFSClient
toDataNode
orDataNode
isPipeline
performed between data check basic units, each 4 byte chunk need to carry the parity information. - packet is
DFSClient
toDataNode
orDataNode
isPipeline
performed between the data transmission base unit, when the actual size of the chunk is written to the packet as 516 byte. - Block is the unit of file block, countless packets form a block. Small files are less than a block size, but will occupy a metadata slot, causing
NameNode
memory overload.
④ Three-layer buffer in the writing process
- The writing process involves
DataQueue
three-tier caches of chunk, packet, and three granularities:
- When data flows
DFSOutputStream
in, there will be a chunk-sized buffer. When the data fills this buffer, or when a forcedflush()
operation is encountered , a checksum is calculated. - The chunk and checksum are written into the packet together. When multiple chunks fill the packet, the packet will enter the
DataQueue
queue. DataQueue
The packet in is taken out by the thread and sent toDataNode
, and the packet that is not confirmed to be successfully written will be moved to AckQueue for confirmation.- If you receive
DataNode
the ack (write successful), by theResponseProcessor
responsible packet fromAckQueue
deletion; otherwise, it will move toDataQueue
the re-write.
Three-layer buffer
3. Basic knowledge
NameNode
- Manage metadata information (Metadata). Note that only metadata information is stored.
- The namenode manages the metadata information and puts a copy in the memory for access and query. The metadata information will also be persisted to the disk through the fsimage and edits files.
- Version 1.0 of Hadoop uses SecondaryNamenode to merge fsimage and edits files, but this mechanism cannot achieve the effect of hot backup. The namenode of Hadoop 1.0 has a single point of failure.
- Metadata is roughly divided into two levels: Namespace management layer , responsible for managing the tree-like directory structure in the file system and the mapping relationship between files and data blocks. The block management layer is responsible for managing the mapping relationship BlocksMap between the physical blocks of the files in the file system and the actual storage location.
datanode
- Data node, used to store file blocks.
- In order to prevent data loss caused by the datanode hanging down, a file block must be backed up, and a file block defaults to three copies.
rack
- Rack, HDFS uses rack awareness strategy to place replicas.
- The first copy: If the writer is a dataNode, put it directly locally; otherwise, randomly select a dataNode for storage.
- The second copy: a dataNode on the remote rack
- Third copy: another dataNode on the same remote rack as the second copy.
- This placement strategy reduces the write traffic between racks and improves write performance.
- More than 3 copies: The placement requirements for the remaining copies meet the following conditions:
- A dataNode is only allowed to have one copy of the block
- The maximum number of copies of a Hadoop cluster is the total number of dataNodes
Reference link: HDFS Replica Placement Policy
client
- Client, any end that is operated through API or commands can be regarded as client
blockSize
- Data blocks generally have a default size, which can be configured in the hdfs-site.xml file
dfs.blocksize
. - Hadoop1.0:64MB。Hadoop2.0 :128MB。
- The problem of block size: From the perspective of big data processing, the larger the block, the better. Therefore, from the development of technology, the future blocks will become larger and larger, because the block size will reduce the number of disk addressing, thereby reducing the addressing time.
4. HDFS read and write process
① Reading process of HDFS
- The client calls the DistributedFileSystem.open() method to obtain the input stream object (FSDataInputStream) of the data block to be read.
- When the open() method is running, DistributedFileSystem uses RPC to call NameNode to obtain the dataNode addresses of all copies of the block. After the open() method runs, it returns the FSDataInputStream object, which encapsulates the input stream DFSInputStream.
- Call the input stream FSDataInputStream.read() method, so that DFSInputStream automatically connects to a suitable dataNode for data reading (the distance of the network topology) according to the principle of proximity.
- The read() method is called in a loop to transfer data from the dataNode to the client.
- After reading the current block, close the connection with the current dataNode. Establish a connection with the dataNode of the next block to continue reading the next block.
This process is transparent to the client. From the client's perspective, it seems that only one continuous stream is read.
- After the client finishes reading all the blocks, it calls FSDataInputStream.close() to close the input stream, thus ending the file reading process.
- Read error:
- If an error occurs during the reading process, DFSInputStream will try to read the block in the adjacent DataNode. At the same time, the dataNode that has the problem will be recorded and will not communicate with it in the subsequent data request process.
- Every time a block is read, DFSInputStream will check the integrity of the data. If there is damage, the client will notify the NameNode and continue to read the copy from other DataNodes.
② HDFS writing process
- Distributed File System client calls the
DistributedFileSystem.create( )
method sends a request to create a file NameNode. - When the create() method runs, it
DistributedFileSystem
sends an RPC request to the NameNode, and the NameNode completes the check before file creation. If it passes the check, first record the write operation in EditLog, and then return the output stream objectFSDataOutputStream
(internally encapsulatedDFSOutputDtream
). - The client calls the
FSOutputStream.write()
function, writing data to the corresponding file. - When writing a file, the
DFSOutputDtream
file is divided into packets and the packets are written to the DataQueue.DataStreamer
Responsible for managing the DataQueue, it will ask the NameNode to allocate suitable new blocks for storing copies. A pipeline is formed between DataNodes, and packets are transmitted through the pipeline.
-
DataStreamer
Stream the packet to DataNode1 through the pipeline - DataNode1 transmits the received packet to DataNode2
- DataNode2 transmits the received packet to DataNode3 to form a triple copy storage of the packet.
- In order to ensure the consistency of the copy, the DataNode that has received the packet needs to return an ack packet to the sender. After receiving enough responses, the packet will be deleted from the internal queue.
- After the file is written, the client calls the
FSOutputStream.close()
method to close the file input stream. - Call the
DistributedFileSystem.complete()
method to notify the NameNode that the file is successfully written.