HDFS cluster

HDFS cluster is built on top of Hadoop cluster. Since HDFS is the main daemon process of Hadoop, the configuration process of HDFS cluster is the representative of Hadoop cluster configuration process.

Using Docker can build a cluster environment more conveniently and efficiently.

Configuration in each computer

How to configure Hadoop clusters, and what configurations should be in different computers, these problems arise during learning. A typical example will be provided in the configuration of this chapter, but Hadoop's complex and diverse configuration items go far beyond this.

The remote control of the HDFS name node to the data node is realized through SSH, so the key configuration items should be configured on the name node, and the non-key node configuration should be configured on each data node. That is to say, the configurations of data nodes and naming nodes can be different, and the configurations of different data nodes can also be different.

However, in order to facilitate the establishment of a cluster in this chapter, the same configuration file will be used to synchronize to all cluster nodes in the form of a Docker image, which is explained here.

Specific steps

The general idea is this. We first use a Hadoop image to configure it so that all nodes in the cluster can share it, and then use it as a prototype to generate several containers to form a cluster.

configuration prototype

First, we will use the previously prepared hadoop_proto image to start as a container:

docker run -d --name=hadoop_temp --privileged hadoop_proto /usr/sbin/init

Enter the configuration file directory of Hadoop:

cd $HADOOP_HOME/etc/hadoop

Now give a brief description of the role of the files here:

document effect
workers Record the hostname or IP address of all data nodes
core-site.xml Hadoop core configuration
hdfs-site.xml HDFS configuration items
mapred-site.xml MapReduce configuration items
yarn-site.xml YARN configuration items

Note: The role of YARN is to provide resource management services for MapReduce, which is not used here temporarily.

We now design such a simple cluster:

  • 1 named node nn
  • 2 data nodes dn1, dn2

First edit workers and change the file content to:

dn1
dn2

Then edit core-site.xml and add the following configuration items in it:

<!-- Configure HDFS host address and port number--> 
<property> 
    <name>fs.defaultFS</name> 
    <value>hdfs://nn:9000</value> 
</property> 
<!-- configuration Hadoop's temporary file directory --> 
<property> 
    <name>hadoop.tmp.dir</name> 
    <value>file:///home/hadoop/tmp</value> 
</property>

Configure hdfs-site.xml, add the following configuration items:

<!-- Each data block copies 2 copies of storage--> 
<property> 
    <name>dfs.replication</name> 
    <value>2</value> 
</property> 

<!-- Set the directory for storing naming information --> 
<property> 
    <name>dfs.namenode.name.dir</name> 
    <value>file:///home/hadoop/hdfs/name</value> 
</property>

Finally, you need to configure SSH:

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
ssh-copy-id -i ~/.ssh/id_rsa hadoop@localhost

So far, the prototype of the cluster is configured, you can exit the container and upload the container to the new mirror cluster_proto:

docker stop hadoop_temp
docker commit hadoop_temp cluster_proto

Here you can delete the temporary mirror hadoop_temp if necessary.

deploy cluster

Next deploy the cluster.

First, create the private network hnet for the Hadoop cluster:

docker network create --subnet=172.20.0.0/16 hnet

Next create the cluster container:

docker run -d --name=nn --hostname=nn --network=hnet --ip=172.20.1.0 --add-host=dn1:172.20.1.1 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init
docker run -d --name=dn1 --hostname=dn1 --network=hnet --ip=172.20.1.1 --add-host=nn:172.20.1.0 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init
docker run -d --name=dn2 --hostname=dn2 --network=hnet --ip=172.20.1.2 --add-host=nn:172.20.1.0 --add-host=dn1:172.20.1.1 --privileged cluster_proto /usr/sbin/init

Enter the named node:

docker exec -it nn su hadoop

Format HDFS:

hdfs namenode -format

If nothing goes wrong, then the next step is to start HDFS:

start-dfs.sh

After successful startup, the jps command should be able to check the existence of NameNode and SecondaryNameNode. The DataNode process does not exist for the named nodes because this process runs in dn1 and dn2.

At this point, you can detect the operation of HDFS in the same way as described in the last chapter about pseudo-cluster mode, and there is no difference in the way of using HDFS (named nodes represent the entire cluster). 

Guess you like

Origin blog.csdn.net/leyang0910/article/details/130534498