Hadoop (three) HDFS basic process of writing data

HDFS write data flow

HDFS shell upload files a.txt, 300M

File block, every block default 128M.
shell sends a request to upload files NameNode
NameNode detect file system directory tree, see if you can upload
NameNode send notifications to allow uploading shell
Upload block1 shell sends to NameNode, backup notification 3.
NameNode detection DataNode information pool for the three DataNode of IP, IP to find the following mechanisms:
- Network topology nearest (minimum experience switch)
- If the shell itself is a DataNode, local backup copy will.
- Same rack as a backup (detection on the rack, see below)
- A different chassis backup
NameNode the detected IP returned to the shell
detecting the IP recent shell, such DataNode1, a data transmission connection establishment request, the establishment of pipeline
- Hadoop pipeline is used to transmit data to the object, similar to pipeline delivery.
- DataNode1 build pipeline to DataNode2
- DataNode2 build pipeline to DataNode3
DataNode3 DataNode2 return pipeline to build a successful notification, then back step 2-1,1-shell.
By the OutputStream shell, to packet (64K) as a unit to send data to DataNode1, and progressively issued.
- After receiving the DataNode levels, store data locally.
After DataNode save data will cascade transmitting reverse packet data check, to verify that data transfer is complete.
Transfer is completed, close the Pipeline, repeat 5-11.

Rack Detection

The following is based on the URL content of the discussion:

https://blog.csdn.net/w182368851/article/details/53729790

https://www.cnblogs.com/zwgblog/p/7096875.html

Rack detection principle is actually core-site.xml configuration file is an option:

<property> 
 <name>topology.script.file.name</name> 
 <value>/home/bigdata/apps/hadoop-talkyun/etc/hadoop/topology.sh</value> 
</property>

This configuration option value is specified as an executable program, usually a script.
The script takes one parameter, the output value.
Parameter is usually a DataNode machine ip address table, and the value of the output generally corresponds Rack for the ip address is located DataNode (rack).
Process:
- When NameNode start, it will determine whether the configuration option is empty, if not empty, it means that the perception of the rack configuration has been enabled.
- At this time, the script will look NameNode depending on the configuration.
- Upon receiving the heartbeat any DataNode (Heartbeat), the DataNode ip address as a parameter to the script, the DataNode can be obtained for each Rack, a map stored in the memory, so that this machine will be able to know whether each on the same rack.
Profile simple example:

#!/usr/bin/python
#-*-coding:UTF-8 -*-
import sys
 
rack = {"NN01":"rack2",
        "NN02":"rack3",
        "DN01":"rack4",
        "DN02":"rack4",
        "DN03":"rack1",
        "DN04":"rack3",
        "DN05":"rack1",
        "DN06":"rack4",
        "DN07":"rack1",
        "DN08":"rack2",
        "DN09":"rack1",
        "DN10":"rack2",
        "172.16.145.32":"rack2",
        "172.16.145.33":"rack3",
        "172.16.145.34":"rack4",
        "172.16.145.35":"rack4",
        "172.16.145.36":"rack1",
        "172.16.145.37":"rack3",
        "172.16.145.38":"rack1",
        "172.16.145.39":"rack4",
        "172.16.145.40":"rack1",
        "172.16.145.41":"rack2",
        "172.16.145.42":"rack1",
        "172.16.145.43":"rack2",
        }
 
if __name__=="__main__":
    print "/" + rack.get(sys.argv[1],"rack0")

Hadoop (three) HDFS basic process of writing data

HDFS write data flow

Rack Detection

Guess you like