Hadoop--aware network layout and design of the rack

Forwarding https://blog.csdn.net/haboop/article/details/89786422

1. Data block
HDFS as a distributed file system in Hadoop, and is specifically designed for its MapReduce, HDFS so as their addition must meet high reliability distributed file system must also provide for the efficient reading MapReduce write performance, then how do these HDFS is it?

First, the HDFS data of each file is divided into blocks of storage, while each data block and stores a plurality of copies, the copies of data blocks distributed nodes on different machines, this stored data block copy strategy is HDFS + key reliability and performance. this is because:

A block memory according to the file after the data block to read, concurrent read and improve the efficiency of the efficiency of the random read file

Several copies of stored data blocks to two different nodes to achieve the reliability of the machine but also improve the efficiency of concurrent read the same data block

Third data block is thought very relevant MapReduce task segmentation. Here, the storage policy copy of HDFS is the key to achieving high reliability and high performance.

The concept of big data and artificial intelligence are vague, in accordance with what the line to learn where to completion of the development, want to learn, want to learn the students welcome to join the Big Data learning qq group: 458 345 782, there are a lot of dry goods ( zero-based combat and advanced classical) for everyone to share, and a senior lecturer at Tsinghua graduate of big data to you free lessons to share with high-end real practical learning process system is currently the most complete big data. Starting from java and linux, followed by gradually deep into HADOOP-hive-oozie-web-flume-python-hbase-kafka-scala-SPARK eleven other related knowledge to share!

2. rack awareness
HDFS reliability using an optional rack sensing strategy referred to improve data availability and utilization of network bandwidth. By a process of perception rack, each rack may be determined NameNode id DataNode belongs (this is the reason for using NameNode NetworkTopology data structure to store data node).

A simple but no optimization strategy is a copy stored on different racks, which can prevent the loss of data when the failure of the entire rack, and allow time to read the data fully utilize the bandwidth of multiple racks. This strategy can be set uniform distribution of copies in the cluster, even when the load is conducive to the case of component failure, however, because this strategy requires a write operation to transfer multiple racks, which increases the cost of writing.

In most cases, the replication factor is 3, HDFS storage strategy is to store a copy on the local node chassis, a copy stored on another node in the same rack, and the last copy in different racks node. This strategy reduces the data transmission between the rack and improve the efficiency of the write operation. Far less than the rack error error node, so this policy does not affect the reliability and availability of data. At the same time, because only data blocks stored in the two different racks, so this strategy reduces the total bandwidth required for transmission of network data is read. In this strategy, the copy is not evenly distributed across different racks: a third of the copies on a node, two-thirds of copies on a rack, other copies evenly distributed in the rest of the rack, this strategy improves write performance without compromising data reliability and read performance. Here's a look at how to HDFS concrete realization of this strategy.

Configuration
is not configured rack awareness, namenode print log as follows:

2016-07-17 17: 27: 26,423 INFO org.apache.hadoop.net.NetworkTopology : Adding a new node: / default-rack / 192.168.147.92:50010
rack ID is corresponding to each IP / default-rack . Perceived need to enable the chassis configuration file Core-the site.xml , configuration items as follows:

<property>  
  <name>topology.script.file.name</name>  
  <value>/etc/hadoop/topology.sh</value>  
</property>

value is a shell script, the main function of the input IP DataNode returns the corresponding rack ID. Will determine whether the rack awareness enabled namenode start, if enabled will find configuration according to the configuration script, and pass its IP acquired the rack ID is stored in a map memory at the time of receipt of a heartbeat DataNode. A simple configuration script as follows:

#!/bin/bash
HADOOP_CONF=etc/hadoop/config
while [ $# -gt 0 ] ; do
nodeArg= 1 e x e c &lt; 1 exec&lt; {HADOOP_CONF}/topology.data
result=""
while read line ; do
ar=( l i n e ) i f [ &quot; line ) if [ &quot; {ar[0]}" = “ n O d e A r g " ] [ " nodeArg&quot; ]||[ &quot; {ar[1]}” = “ n O d e A r g " ] ; t h e n r e s u l t = &quot; nodeArg&quot; ]; then result=&quot; {ar[2]}”
fi
done
shift
if [ -z “ r e s u l t &quot; ] ; t h e n e c h o n &quot; / d e f a u l t r a c k &quot; e l s e e c h o n &quot; result&quot; ] ; then echo -n &quot;/default-rack&quot; else echo -n &quot; Result"
Fi
DONE
wherein topology.data following format: topology.data, the format is: node (ip or host name) / switch xx / xx rack

Tbe192168147091 192.168.147.91 / DC1 / RACK1
192.168.147.92 tbe192168147092 / DC1 / RACK1
192.168.147.93 tbe192168147093 / DC1 / Rack2
192.168.147.94 tbe192168147094 / DC1 / rack3
192.168.147.95 tbe192168147095 / DC1 / rack3
192.168.147.96 tbe192168147096 / DC1 / rack3
may be used

./hadoop dfsadmin -printTopology
view the rack configuration information.

4. dynamically add nodes
how not to restart namenode in the cluster to dynamically add a DataNode node? We can do this in a rack-aware cluster enabled in: 192.168.147.68 on the assumption that the Hadoop cluster deployed NameNode and DataNode, enabled rack awareness, the results bin / hadoop dfsadmin -printTopology see:

Rack: / dc1 / rack1 192.168.147.68:50010 ( dbj68)
now want to add a physical location rack2 192.168.147.69 data nodes to the cluster, without restarting NameNode. First, modify NameNode topology.data node configuration, adding: 192.168.147.69 dbj69 / dc1 / rack2, save.

192.168.147.68 dbj68 / dc1 / rack1 192.168.147.69 dbj69 / dc1 / rack2
Then, sbin / hadoop-daemons.sh start datanode node activation data dbj69, the results of any node bin / hadoop dfsadmin -printTopology see:

Rack: / dc1 / rack1 192.168.147.68:50010 ( dbj68) Rack: / dc1 / rack2 192.168.147.69:50010 (dbj69)
Description hadoop has perceived the newly added nodes dbj69, if not added to the configuration of the dbj69 topology.data , do sbin / hadoop-daemons.sh start datanode start a data node dbj69, datanode log will be abnormal, leading to dbj69 start unsuccessful.

Guess you like

Origin blog.csdn.net/hyy_blue/article/details/92427585