Docker deploys hadoop and uses docker to build spark operating environment (the most detailed tutorial on the entire network)
First check the version environment (if you have not downloaded docker and docker-compose in docker, you can read my previous blog
Linux installation and configuration of Docker and Docker compose and deploy mysql and Chinese version of portainer graphical management interface in docker )
View docker and docker-compose versions:
docker version
docker-compose version
OK, the environment is fine, we officially start deploying hadoop in Docker
<Deploying Hadoop in Docker>
Update system
sudo apt update
sudo apt upgrade
Domestic accelerated image download modification warehouse source
Create or modify the /etc/docker/daemon.json file
sudo vi /etc/docker/daemon.json
{
"registry-mirrors": [
"http://hub-mirror.c.163.com",
"https://docker.mirrors.ustc.edu.cn",
"https://registry.docker-cn.com",
"https://kfp63jaj.mirror.aliyuncs.com"]
}
Reload docker to make the CDN configuration take effect
sudo systemctl daemon-reload
sudo systemctl restart docker
Grab the image of ubuntu 20.04 as the basis to build the hadoop environment
sudo docker pull ubuntu:20.04
Use the ubuntu image to start and fill in the specific path instead.
sudo docker run -it -v <host-share-path>:<container-share-path> ubuntu
For example
sudo docker run -it -v ~/hadoop/build:/home/hadoop/build ubuntu
After the container is started, it will automatically enter the container's console.
Install the required software in the container's console
apt-get update
apt-get upgrade
Install required software
apt-get install net-tools vim openssh-server
/etc/init.d/ssh start
Let the ssh server start automatically
vi ~/.bashrc
Press O at the end of the file to enter edit mode, and add:
/etc/init.d/ssh start
Press ESC to return to command mode, enter: wq to save and exit.
Make changes effective immediately
source ~/.bashrc
Configure passwordless access to ssh
ssh-keygen -t rsa
Press Enter continuously
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
Enter the container in ubuntu in docker
docker start 11f9454b301f
docker exec -it clever_gauss bash
Install JDK 8
hadoop 3.x currently only supports jdk 7, 8
apt-get install openjdk-8-jdk
Reference jdk in environment variables and edit the bash command line configuration file
vi ~/.bashrc
At the end of the file add
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export PATH=$PATH:$JAVA_HOME/bin
Let jdk configuration take effect immediately
source ~/.bashrc
Test the normal operation of jdk
java -version
Save the current container as an image
sudo docker commit <CONTAINER ID> <IMAGE NAME> #My own image name
sudo docker commit 11f9454b301f ubuntu204 #我的是ubuntu204
You can see that the image has been created successfully. You can use this image directly next time you need to create a new container.
Notice! ! ! The two relevant paths for this process are as follows (don’t get confused):
<host-share-path> refers to ~/hadoop/build
<container-share-path> refers to /home/hadoop/build
Download hadoop, taking 3.2.3 as an example below
https://hadoop.apache.org/releases.html
cd ~/hadoop/build
wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
(This method can download, but the downloaded package size will be wrong. We can use the second method)
Method Two:
Enter download https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz in your computer browser
Download to your computer and upload to the virtual machine through winscp
Then open the terminal in the directory where the installation package is located and enter
sudo mv hadoop-3.2.3.tar.gz ~/hadoop/build
Move the files to the directory ~/hadoop/build
Unzip hadoop on the container console (it is the console of the previously created container, not your own console!
docker start 11f9454b301f
docker exec -it clever_gauss bash
cd /home/hadoop/build
tar -zxvf hadoop-3.2.3.tar.gz -C /usr/local
The installation is complete, check the hadoop version
cd /usr/local/hadoop-3.2.3
./bin/hadoop version
Specify jdk location for hadoop
vi etc/hadoop/hadoop-env.sh
Find the commented out JAVA_HOME configuration location and change it to the jdk location just set
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
hadoop online configuration
Configure core-site.xml file
vi etc/hadoop/core-site.xml
join in:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop-3.2.3/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
Configure hdfs-site.xml file
vi etc/hadoop/hdfs-site.xml
join in
<configuration> <!--- 配置保存Fsimage位置 --> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop-3.2.3/namenode_dir</value> </property> <!--- 配置保存数据文件的位置 --> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop-3.2.3/datanode_dir</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
MapReduce configuration
Definition of this configuration file:
https://hadoop.apache.org/docs/r<Hadoop版本号>/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
Configure mapred-site.xml file
vi etc/hadoop/mapred-site.xml
join in:
<configuration>
<!--- mapreduce框架的名字 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<! -- 设定HADOOP的位置给yarn和mapreduce程序 -->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
Configure yarn-site.xml file
vi etc/hadoop/yarn-site.xml
join in
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 辅助服务,数据混洗 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
Service startup permission configuration
Configure start-dfs.sh and stop-dfs.sh files
vi sbin/start-dfs.sh 和 vi sbin/stop-dfs.sh
vi sbin/start-dfs.sh
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Continue to modify the configuration file
vi sbin/stop-dfs.sh
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Configure start-yarn.sh and stop-yarn.sh files
vi sbin/start-yarn.sh 和 vi sbin/stop-yarn.sh
vi sbin/start-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
vi sbin/stop-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
The core files must not be mismatched, otherwise there will be many problems later!
Configuration completed, save the image
docker ps
docker commit 11f9454b301f ubuntu-myx
The saved image is named ubuntu-myx
Start hadoop and configure the network
Open three host consoles and start three containers, one master and two slaves.
master
Open port mapping: 8088 => 8088
sudo docker run -p 8088:8088 -it -h master --name master ubuntu-myx
Start node worker01
sudo docker run -it -h worker01 --name worker01 ubuntu-myx
Node worker02
sudo docker run -it -h worker02 --name worker02 ubuntu-myx
Open the /etc/hosts of the three containers respectively, and complete the mapping information between each other's IP addresses and host names (all three containers need to be configured in this way)
vi /etc/hosts
Use the following command to query the ip
ifconfig
Add information (this file needs to be adjusted every time the container starts)
172.17.0.3 master
172.17.0.4 worker01
172.17.0.5 worker02
Check if the configuration is valid
ssh master
ssh worker01
ssh worker02
The master connects to the worker01 node successfully:
The worker01 node successfully connected to the master:
worker02 connects to worker01 node successfully:
Configure the host name of the worker container on the master container
cd /usr/local/hadoop-3.2.3
vi etc/hadoop/workers
Delete localhost and add
worker01
worker02
Network configuration completed
Start hadoop
on the master host
cd /usr/local/hadoop-3.2.3
./bin/hdfs namenode -format
Normal start
Start service
./sbin/start-all.sh
The effect is as follows, which means it is normal
Create a directory on hdfs to store files
Assume the directory is: /home/hadoop/input
./bin/hdfs dfs -mkdir -p /home/hadoop/input
./bin/hdfs dfs -put ./etc/hadoop/*.xml /home/hadoop/input
Check whether distribution replication is normal
./bin/hdfs dfs -ls /home/hadoop/input
Running case:
Create a directory on hdfs to store files
For example
./bin/hdfs dfs -mkdir -p /home/hadoop/wordcount
Put the text program in
./bin/hdfs dfs -put hello /home/hadoop/wordcount
View distribution status
./bin/hdfs dfs -ls /home/hadoop/wordcount
Run MapReduce's built-in wordcount sample program (the built-in sample program cannot run, possibly due to virtual machine performance issues, so here it is replaced with a simple wordcount program)
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar wordcount /home/hadoop/wordcount /home/hadoop/wordcount/output
Run successfully:
After running, view the output results
./bin/hdfs dfs -ls /home/hadoop/wordcount/output
./bin/hdfs dfs -cat /home/hadoop/wordcount/output/*
At this point, Docker deployed hadoop successfully! There is usually no problem if you follow the steps.
Next we use docker to build the spark running environment
<Use docker to build spark running environment>
Use docker hub to find the image we need
ReferenceDocker Hub
curl -LO https://raw.githubusercontent.com/bitnami/containers/main/bitnami/spark/docker-compose.yml
提示:curl: (7) Failed to connect to raw.githubusercontent.com port 443: Connection refused
The reason should be the foreign IP. If you hit a wall, just look at the solution.
solution:
1. Open the website, https://www.ipaddress.com/, and check the IP address corresponding to raw.githubusercontent.com on this website.
Go to this website to find the IP bound to this domain name
Vi etc/hosts
Add at the end:
185.199.108.133 raw.githubusercontent.com
curl runs successfully
curl -LO https://raw.githubusercontent.com/bitnami/containers/main/bitnami/spark/docker-compose.yml
docker-compose up
Install Spark ’s docker image
docker pull bitnami/spark:latest
docker pull bitnami/spark:[TAG]
Solution to git clone fatal: unable to access 'https://github.com/...'
After consulting some information, I found that mapping needs to be added to the hosts file.
vi /etc/hosts
Add two lines to the hosts file
140.82.113.4 github.com
140.82.113.4 www.github.com
git clone
cd bitnami/APP/VERSION/OPERATING-SYSTEM
Find the corresponding directory:
cd /home/rgzn/containers/bitnami//spark/3.2/debian-11
# . represents the current directory
docker build -t bitnami/spark:latest .
Parameter Description:
-t: Specify the target image name to be created
.: The directory where the Dockerfile file is located, you can specify the absolute path of the Dockerfile
Find the directory containing the Dockerfile and execute the command to build the image yourself
Deploy the spark environment using yml deployment files
The spark.yml file can be edited locally and then uploaded to the virtual machine or server. The contents of the spark.yml file are as follows:
version: '3.8'
services:
spark-master:
image: bde2020/spark-master
container_name: spark-master
ports:
- "8080:8080"
- "7077:7077"
volumes:
- ~/spark:/data
environment:
- INIT_DAEMON_STEP=setup_spark
spark-worker-1:
image: bde2020/spark-worker:latest
container_name: spark-worker-1
depends_on:
- spark-master
ports:
- "8081:8081"
volumes:
- ~/spark:/data
environment:
- "SPARK_MASTER=spark://spark-master:7077"
spark-worker-2:
image: bde2020/spark-worker:latest
container_name: spark-worker-2
depends_on:
- spark-master
ports:
- "8082:8081"
volumes:
- ~/spark:/data
environment:
- "SPARK_MASTER=spark://spark-master:7077"
Deploy the spark environment using yml deployment files
cd /usr/local/bin
Create the file sudo vim spark.yml
sudo chmod 777 spark.yml
In the directory where the spark.yml file is located, execute the command:
sudo docker-compose -f spark.yml up -d
View container creation and running status
sudo docker ps
Format the output
sudo docker ps --format '{
{.ID}} {
{.Names}}'
Use a browser to view the master's web ui interface
127.0.0.1:8080
http://192.168.95.171:50070
Enter the spark-master container
sudo docker exec -it <master container id, just enter part of it> /bin/bash
sudo docker exec -it 98600cfa9ba7 /bin/bash
Query the spark environment and install it under /spark.
ls /spark/bin
Enter spark-shell
/spark/bin/spark-shell --master spark://spark-master:7077 --total-executor-cores 8 --executor-memory 2560m
or
/spark/bin/spark-shell
Enter the browser to view the status of spark-shell
Test: Create RDD and filter processing
Create an RDD
val rdd=sc.parallelize(Array(1,2,3,4,5,6,7,8))
Print rdd content
rdd.collect()
Query the number of partitions
rdd.partitions.size
Select a value greater than 5
val rddFilter=rdd.filter(_ > 5)
Print rddFilter content
rddFilter.collect()
Exit spark-shell
:quit
Running case successfully!
The above is Docker deployment of hadoop and use docker to build spark operating environment. I have been preparing for this tutorial for a long time. If there are no problems with the environment, I can basically complete docker deployment of hadoop and run spark. I wish everyone all the best!
“It is better to retreat and build a net than to look at the fish in the abyss.”