HDFS cluster is built on top of Hadoop cluster. Since HDFS is the main daemon process of Hadoop, the configuration process of HDFS cluster is the representative of Hadoop cluster configuration process.
Using Docker can build a cluster environment more conveniently and efficiently.
Configuration in each computer
How to configure Hadoop clusters, and what configurations should be in different computers, these problems arise during learning. A typical example will be provided in the configuration of this chapter, but Hadoop's complex and diverse configuration items go far beyond this.
The remote control of the HDFS name node to the data node is realized through SSH, so the key configuration items should be configured on the name node, and the non-key node configuration should be configured on each data node. That is to say, the configurations of data nodes and naming nodes can be different, and the configurations of different data nodes can also be different.
However, in order to facilitate the establishment of a cluster in this chapter, the same configuration file will be used to synchronize to all cluster nodes in the form of a Docker image, which is explained here.
Specific steps
The general idea is this. We first use a Hadoop image to configure it so that all nodes in the cluster can share it, and then use it as a prototype to generate several containers to form a cluster.
configuration prototype
First, we will use the previously prepared hadoop_proto image to start as a container:
docker run -d --name=hadoop_temp --privileged hadoop_proto /usr/sbin/init
Enter the configuration file directory of Hadoop:
cd $HADOOP_HOME/etc/hadoop
Now give a brief description of the role of the files here:
document | effect |
---|---|
workers | Record the hostname or IP address of all data nodes |
core-site.xml | Hadoop core configuration |
hdfs-site.xml | HDFS configuration items |
mapred-site.xml | MapReduce configuration items |
yarn-site.xml | YARN configuration items |
Note: The role of YARN is to provide resource management services for MapReduce, which is not used here temporarily.
We now design such a simple cluster:
- 1 named node nn
- 2 data nodes dn1, dn2
First edit workers and change the file content to:
dn1 dn2
Then edit core-site.xml and add the following configuration items in it:
<!-- Configure HDFS host address and port number--> <property> <name>fs.defaultFS</name> <value>hdfs://nn:9000</value> </property> <!-- configuration Hadoop's temporary file directory --> <property> <name>hadoop.tmp.dir</name> <value>file:///home/hadoop/tmp</value> </property>
Configure hdfs-site.xml, add the following configuration items:
<!-- Each data block copies 2 copies of storage--> <property> <name>dfs.replication</name> <value>2</value> </property> <!-- Set the directory for storing naming information --> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/hadoop/hdfs/name</value> </property>
Finally, you need to configure SSH:
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa ssh-copy-id -i ~/.ssh/id_rsa hadoop@localhost
So far, the prototype of the cluster is configured, you can exit the container and upload the container to the new mirror cluster_proto:
docker stop hadoop_temp docker commit hadoop_temp cluster_proto
Here you can delete the temporary mirror hadoop_temp if necessary.
deploy cluster
Next deploy the cluster.
First, create the private network hnet for the Hadoop cluster:
docker network create --subnet=172.20.0.0/16 hnet
Next create the cluster container:
docker run -d --name=nn --hostname=nn --network=hnet --ip=172.20.1.0 --add-host=dn1:172.20.1.1 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init docker run -d --name=dn1 --hostname=dn1 --network=hnet --ip=172.20.1.1 --add-host=nn:172.20.1.0 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init docker run -d --name=dn2 --hostname=dn2 --network=hnet --ip=172.20.1.2 --add-host=nn:172.20.1.0 --add-host=dn1:172.20.1.1 --privileged cluster_proto /usr/sbin/init
Enter the named node:
docker exec -it nn su hadoop
Format HDFS:
hdfs namenode -format
If nothing goes wrong, then the next step is to start HDFS:
start-dfs.sh
After successful startup, the jps command should be able to check the existence of NameNode and SecondaryNameNode. The DataNode process does not exist for the named nodes because this process runs in dn1 and dn2.
At this point, you can detect the operation of HDFS in the same way as described in the last chapter about pseudo-cluster mode, and there is no difference in the way of using HDFS (named nodes represent the entire cluster).