Building a distributed environment of hadoop and mapreduce based on docker technology
1. Install doker
1. Confirm the host environment
-
(if not already) install the lsb-relaease tool
apt install lsb-release
-
Check version
lsb_release -a
2. Prepare the installation environment
-
update system
sudo apt update
sudo apt upgrade
-
Download curl:
sudo apt install curl
3. Install docker
-
Install docker via curl tool
curl -fssl https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
-
Confirm docker installation
sudo docker version
-
(Optional) Install docker-compose, the latest version is 1.29.2
You can visit https://github.com/docker/compose/releases/ first to confirm the version number
sudo curl -l "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
-
Add domestic docker image cdn
sudo vi /etc/docker/daemon.json
{ "registry-mirrors":["https://kfp63jaj.mirror.aliyuncs.com","https://docker.mirrors.ustc.edu.cn","https://registry.docker-cn.com","http://hub-mirror.c.163.com"] }
-
Reload docker to make cdn take effect:
sudo systemctl daemon-reload
sudo systemctl restart docker
Restart docker but encounter problems (as shown above)
For convenience, I first install a vim (I really can't use vi)
So I deleted docker and reinstalled and the problem was solved
-
Test whether docker can grab the image and run normally
-
Run the hello-world test case
sudo docker run hello-world
-
-
View the running record of the hello-world image
sudo docker ps -a
2. Building hadoop and mapreduce based on docker technology
1. Prepare the container environment
-
Grab the image of ubuntu 18.04 as the basis to build the hadoop environment
sudo docker pull ubuntu:18.04
-
Check whether the image is successfully captured
sudo docker images
-
Start a container with that ubuntu image
Connect <host-share-path> with <container-share-path>
sudo docker run -it -v ~/hadoop/build:/home/hadoop/build ubuntu
It seems that the mirror cannot be found in the first few cdns, and then an error is reported, and then the mirror is found in the following cdns
After the container starts, it will automatically enter the console of the container
-
Install the required software on the console of the container
apt-get update
apt-get upgrade
-
Need to install net-tools (network management tools), vim (command line text editor) and ssh (remote login protocol)
apt-get install net-tools vim openssh-server
2. Configure ssh server
-
Make the ssh server start automatically
vim ~/.bashrc
Press o at the very end of the file to enter edit mode and add:
/etc/init.d/ssh start
Press esc to return to command mode, enter: wq to save and exit
-
Make changes take effect immediately
source ~/.bashrc
-
Configure passwordless access for ssh
ssh-keygen -t rsa
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
3. Install jdk8
(Note: hadoop3.x currently only supports jdk7,8)
-
install jdk8
apt-get install openjdk-8-jdk
-
hero jdk in environment variables, edit bash command line config file
vim ~/.bashrc
At the end of the file add:
export java_home=/usr/lib/jvm/java-8-openjdk-amd64/ export path=$path:$java_home/bin
-
Make the jdk configuration take effect immediately
source ~/.bashrc
-
Test jdk working properly
java -version
4. Save the image
-
(Optional) To log in to docker, you need to register an account on the docker website in advance. The advantage is that you can submit your own image to the Internet
sudo docker login
-
query container id
sudo docker ps -a
-
Save the current container as an image
sudo docker commit <container id> <image name>
-
When there are too many containers, you can delete the container with the following command
docker rm -f <containerid>
5. Install hadoop
-
Download the hadoop binary tarball on the host console
The hadoop version used in this article is 3.2.1, the latest version: 3.3.2
Other versions can be downloaded from apache hadoop official website: https://hadoop.apache.org/releases.html
cd /<host-share-path>
<host-share-path> refers to the previous path when creating the container: ~/hadoop/build
wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
(The download here is just a web page, it is useless, so you still have to download the hadoop compressed package yourself, understand!)
-
Unzip hadoop on container console
If you don't like the name of the container or it's not easy to type, you can rename it
docker rename <原名> <新名>
Open the container
docker exec -it <容器名或容器id> /bin/bash
If the container is not started, you need to start the container first
docker start <容器名或容器id>
cd /<container-share-path>
<container -share-path> refers to the previous path when the container was created: /home/hadoop/build
tar -zxvf hadoop-3.3.2.tar.gz -c /usr/local
(Note that when I used hadoop-3.2.3.tar.gz, I had problems with decompression, so I replaced hadoop-3.3.2.tar.gz)
-
The installation is complete, check the hadoop version
-
Configure environment variables
export hadoop_home=/usr/local/hadoop-3.3.2 export hadoop_yarn_home=$hadoop_home
-
test
cd /usr/local/hadoop-3.3.2
./bin/hadoop version
-
-
specify jdk location for hadoop
-
Modify the configuration file
Execute in the hadoop installation directory
vim etc/hadoop/hadoop-env.sh
Find the commented out java_home configuration location and change it to the jdk location just set
export java_home=/usr/lib/jvm/java-8-openjdk-amd64/
-
-
hadoop online configuration
-
Configure the core-site.xml file
Execute in the hadoop installation directory
vim etc/hadoop/core-site.xml
join in
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop-3.2.1/tmp</value> <description>abase for other temporary directories.</description> </property> <!-- 配置文件系统的uri,代码中可以通过该地址访问文件系统,使用 hdfsoperator.hdfs_uri 调用 --> <property> <name>fs.defaultfs</name> <value>hdfs://master:9000</value> </property> </configuration>
-
Cooperate with hdfs-site.xml file
Execute in the hadoop installation directory
vim etc/hadoop/hdfs-site.xml
join in
<configuration> <!-- 配置保存fsimage位置 --> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop-3.2.1/namenode_dir</value> </property> <!-- 配置保存数据文件的位置 --> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop-3.2.1/datanode_dir</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
-
mapreduce configuration
The definition description of this configuration file refers to:
https://hadoop.apache.org/docs/r<hadoop版本号>/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
-
Configure mapred-site.xml
Execute in the hadoop installation directory
vim etc/hadoop/mapred-site.xml
join in
<configuration> <!-- mapreduce框架的名字 --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!-- 设定hadoop的位置给yarn和mapreduce程序 --> <property> <name>yarn.app.mapreduce.am.env</name> <value>hadoop_mapred_home=${hadoop_home}</value> </property> <property> <name>mapreduce.map.env</name> <value>hadoop_mapred_home=${hadoop_home}</value> </property> <property> <name>mapreduce.reduce.env</name> <value>hadoop_mapred_home=${hadoop_home}</value> </property> </configuration>
-
Configure yarn-site.xml file
Execute in the hadoop installation directory
vim etc/hadoop/yarn-site.xml
join in
<configuration> <!-- site specific yarn configuration properties --> <!-- 辅助服务,数据混洗 --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- 设定资源管理服务器的host名称,这个名称(master)将在下个小节中设定--> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> </configuration>
-
-
Service startup permission configuration
-
Configure start-dfs.sh and stop-dfs.sh files
Execute in the hadoop installation directory
vim sbin/start-dfs.sh
and
vim sbin/stop-dfs.sh
add at the beginning of the file
HDFS_DATANODE_USER=root HADOOP_SECURE_DN_USER=hdfs HDFS_NAMENODE_USER=root HDFS_SECONDARYNAMENODE_USER=root
-
Configure start-yarn.sh and stop-yarn.sh files
Execute in the hadoop installation directory
vim sbin/start-yarn.sh
and
vim sbin/stop-yarn.sh
add at the beginning of the file
YARN_RESOURCEMANAGER_USER=root HADOOP_SECURE_DN_USER=yarn YARN_NODEMANAGER_USER=root
-
Configuration is complete, save the image
-
Return to the host
exit
-
View containers
docker ps
-
upload container
docker commit <container id> <image name>
-
-
-
Start hadoop and configure the network
-
Open three host consoles and start three containers: one master and two slaves:
-
master
Open port mapping: 8088=>8080
sudo docker run -p 8088:8080 -it -h master –-name master <image name>
-
worker01
sudo docker run -it -h worker01 –-name worker01 <image name>
-
worker02
sudo docker run -it -h worker02 –-name worker02 <image name>
-
Open the /etc/hosts of the three containers respectively, and complete the mapping information of each other's IP addresses and host names (all three containers need to be configured in this way)
vim /etc/hosts
(You can also use the following command to query the ip when needed: ifconfig<if and config do not have spaces between them, otherwise it is another command>)
Add information (this file needs to be adjusted every time the container starts)
<master的实际ip> master <worker01的实际ip> worker01 <worker02的实际ip> worker02
The contents of the hosts file of the three hosts are the same
-
Check if the configuration is valid
ssh master
ssh worker01
ssh worker02
-
-
Configure the hostname of the worker container on the master container
cd /usr/local/hadoop-3.3.2
vim etc/hadoop/workers
delete localhost, join
worker01 worker02
network configuration complete
-
start hadoop
-
On the master host, start hadoop
cd /usr/local/hadoop-3.3.2
./bin/hdfs namenode -format
./sbin/start-all.sh
-
Create a directory to store files on hdfs
Suppose the directory to be created is: /home/hadoop/input
./bin/hdfs dfs -mkdir -p /home/hadoop/input
./bin/hdfs dfs -put ./etc/hadoop/*.xml /home/hadoop/input
-
Check whether distribution replication is normal
./bin/hdfs dfs -ls /home/hadoop/input
-
-
Run the sample program that comes with mapreduce
-
run the program
-
Create a new directory /home/hadoop/wordcount
./bin/hdfs dfs -mkdir /home/hadoop/wordcount
-
Create a new input file hello.txt and place it in the /home/hadoop/wordcount/ directory
-
execute program
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.2.jar wordcount /home/hadoop/wordcount /home/hadoop/wordcount/output
-
-
-
After running, check the output
(Because the original task provided by my cloud server can't run, so I replaced it with the wordcount task)
./bin/hdfs dfs -ls /home/hadoop/wordcount/output
./bin/hdfs dfs -cat /home/hadoop/wordcount/output/*
3. Q&A
Q1: How to get name node out of safe mode.
A1: Before shutting down the container, failure to execute the stop-all.sh command will cause the name node to enter safe mode. The command to exit safe mode is as follows
./bin/hadoop dfsadmin -safemode leave
Q2: How to delete exited containers in batches.
A2:
sudo docker container prune
Q3: The error connect to host <docker node> port 22: connection refused appears when the hadoop service is started.
A3: The ssh server is not started, use the following command to start the ssh server
/etc/init.d/ssh start