-
HDFS shell upload files a.txt, 300M
-
-
shell sends a request to upload files NameNode
-
NameNode detect file system directory tree, see if you can upload
-
NameNode send notifications to allow uploading shell
-
Upload block1 shell sends to NameNode, backup notification 3.
-
NameNode detection DataNode information pool for the three DataNode of IP, IP to find the following mechanisms:
-
Network topology nearest (minimum experience switch)
-
If the shell itself is a DataNode, local backup copy will.
-
Same rack as a backup (detection on the rack, see below)
-
A different chassis backup
-
-
NameNode the detected IP returned to the shell
-
detecting the IP recent shell, such DataNode1, a data transmission connection establishment request, the establishment of pipeline
-
Hadoop pipeline is used to transmit data to the object, similar to pipeline delivery.
-
DataNode1 build pipeline to DataNode2
-
DataNode2 build pipeline to DataNode3
-
-
DataNode3 DataNode2 return pipeline to build a successful notification, then back step 2-1,1-shell.
-
By the OutputStream shell, to packet (64K) as a unit to send data to DataNode1, and progressively issued.
-
After receiving the DataNode levels, store data locally.
-
-
After DataNode save data will cascade transmitting reverse packet data check, to verify that data transfer is complete.
-
Transfer is completed, close the Pipeline, repeat 5-11.
Rack Detection
The following is based on the URL content of the discussion:
https://blog.csdn.net/w182368851/article/details/53729790
https://www.cnblogs.com/zwgblog/p/7096875.html
Rack detection principle is actually core-site.xml configuration file is an option:
<property> <name>topology.script.file.name</name> <value>/home/bigdata/apps/hadoop-talkyun/etc/hadoop/topology.sh</value> </property>
-
This configuration option value is specified as an executable program, usually a script.
-
The script takes one parameter, the output value.
-
Parameter is usually a DataNode machine ip address table, and the value of the output generally corresponds Rack for the ip address is located DataNode (rack).
-
Process:
-
When NameNode start, it will determine whether the configuration option is empty, if not empty, it means that the perception of the rack configuration has been enabled.
-
At this time, the script will look NameNode depending on the configuration.
-
Upon receiving the heartbeat any DataNode (Heartbeat), the DataNode ip address as a parameter to the script, the DataNode can be obtained for each Rack, a map stored in the memory, so that this machine will be able to know whether each on the same rack.
-
-
Profile simple example:
#!/usr/bin/python #-*-coding:UTF-8 -*- import sys rack = {"NN01":"rack2", "NN02":"rack3", "DN01":"rack4", "DN02":"rack4", "DN03":"rack1", "DN04":"rack3", "DN05":"rack1", "DN06":"rack4", "DN07":"rack1", "DN08":"rack2", "DN09":"rack1", "DN10":"rack2", "172.16.145.32":"rack2", "172.16.145.33":"rack3", "172.16.145.34":"rack4", "172.16.145.35":"rack4", "172.16.145.36":"rack1", "172.16.145.37":"rack3", "172.16.145.38":"rack1", "172.16.145.39":"rack4", "172.16.145.40":"rack1", "172.16.145.41":"rack2", "172.16.145.42":"rack1", "172.16.145.43":"rack2", } if __name__=="__main__": print "/" + rack.get(sys.argv[1],"rack0")