3. Hadoop series HDFS architecture and file upload and download

In this article, we learn the HDFS architecture, advantages and disadvantages, file block size, file upload and download through shell commands in Hadoop

1. HDFS usage scenarios

Suitable for writing once and reading many times. Once a file is created, written to, and closed it does not need to be changed

2. HDFS advantages and disadvantages
2.1 Advantages of HDFS
  • High fault tolerance
    • Data is automatically saved in multiple copies. Improve fault tolerance by adding replicas
    • Can be automatically restored after a copy is lost
  • Suitable for processing big data
    • Data scale: Ability to handle data scales reaching GB, TB, or even PB levels
    • File size: capable of processing files exceeding one million in size
  • Can be built on cheap machines and improve reliability through multi-copy mechanism
2.2 HDFS Disadvantages
  • Not suitable for low-latency data access, such as millisecond-level stored data cannot be achieved
  • Unable to efficiently store a large number of small files
    • Storing a large number of small files will occupy a large amount of NameNode memory to store file directories and block information. This is not advisable. NameNode memory is always limited.
    • The addressing time of small file storage will exceed the reading time, which violates HDFS design goals
  • Concurrent writing and random modification of files are not supported
    • There can only be one write to a file, and multiple threads are not allowed to write at the same time.
    • Only data append is supported, random modification of files is not supported.
3. HDFS architecture

  • NameNode(nn) : It is Master
    • Managing HDFS namespaces
    • Configure replica policy
    • Manage data block (Block) mapping information
    • Handle client read and write requests
  • DataNode : It is Slave, NameNode issues commands, and DataNode performs actual operations.
    • Store actual data blocks
    • Perform read and write operations on data blocks
  • Secondary Node : It is not the hot standby of NameNode. When NodeNode hangs up, it cannot immediately replace NanmeNode and provide services.
    • Auxiliary NameNode, share its workload, such as regularly merging Fsimage and Edits (these two will be explained in detail in later articles), and push them to NameNode
    • In emergency situations, it can assist in restoring the NameNode
  • Client : It is the client
    • File splitting. When a file is uploaded to HDFS, the client divides the file into file blocks and then uploads them.
    • Interact with NameNode to obtain file location information
    • Interact with DataNode to write or read data
    • Client provides some commands to manage HDFS, such as NameNode formatting
    • Client provides some commands to access HDFS, such as HDFS addition, deletion, modification and query operations.
4. HDFS file block size

Files in HDFS are physically stored in blocks (Block). The size of the block is dfs.blocksizecontrolled by configuration parameters. The default size is 128M .

  • If the addressing time is 10ms, the time to find the target block is 10ms.
  • When the addressing time is 1% of the transmission time, it is optimal [expert] , because the transmission time is 10ms/0.01=1000ms=1s
  • The current disk transfer rate is generally 100MB/S
  • Block size=1s*100MB/S=100MB, so the default block size is about 100M
Why can't the block size be set too small or too large?
  • The HDFS block setting is too small, which will increase the seeking time , and the program is always looking for the start position of the block.
  • If the HSFS block setting is too large, the disk transfer time is significantly greater than the seeking time , and the program will be very slow when processing the block.
  • Summary: HDFS block size setting mainly depends on disk transfer time
5. HDFS file upload and download
  • We access the previously deployed datanode node http://localhost:9870/ . The content shown in the figure is the file that comes after deployment.

  • We enter the namenode container and perform shell related command operations
// 创建目录/shenjian
# hadoop fs -mkdir /shenjian
# echo '算法小生' > wechat_upload.txt
// 上传文件
# hadoop fs -put wechat_upload.txt /shenjian
2023-01-02 10:51:48,467 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
// 下载文件
# hadoop fs -get /shenjian/wechat_upload.txt wechat_download.txt 
2023-01-02 10:52:14,348 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
# cat wechat_download.txt
算法小生

  • We try to append the file
# hadoop fs -appendToFile wechat_download.txt /shenjian/wechat_upload.txt
2023-01-02 11:11:12,480 WARN hdfs.DataStreamer: DataStreamer Exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.29.0.2:9866,DS-1e5040af-3a7b-47f3-8f18-1c1ec13464d3,DISK]], original=[DatanodeInfoWithStorage[172.29.0.2:9866,DS-1e5040af-3a7b-47f3-8f18-1c1ec13464d3,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
        at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
        at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
        at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:720)
appendToFile: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.29.0.2:9866,DS-1e5040af-3a7b-47f3-8f18-1c1ec13464d3,DISK]], original=[DatanodeInfoWithStorage[172.29.0.2:9866,DS-1e5040af-3a7b-47f3-8f18-1c1ec13464d3,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration

We can see in the picture above that the number of copies is 3, and we have only started 1 datanode node, so the write will fail. We can just close the local development, modify the HDFS configuration file in hadoop.env, and add the new configuration. Restart

# 会写入容器文件/opt/hadoop-3.2.1/etc/hadoop/hdfs-site.xml替换为dfs.client.block.write.replace-datanode-on-failure.policy=NEVER

HDFS_CONF_dfs_client_block_write_replace___datanode___on___failure_policy=NEVER

Append again

# echo 'shenjian' > test.txt
# hadoop fs -appendToFile test.txt /shenjian/wechat_upload.txt
2023-01-02 11:47:46,448 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
# 查看文件末尾,OK追加成功
# hadoop fs -tail /shenjian/wechat_upload.txt
2023-01-02 11:49:17,188 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
算法小生 
shenjian

Welcome to follow the public account algorithm niche

Guess you like

Origin blog.csdn.net/SJshenjian/article/details/128539099