Docker deploys hadoop and uses docker to build spark operating environment (the most detailed tutorial on the entire network)

Docker deploys hadoop and uses docker to build spark operating environment (the most detailed tutorial on the entire network)

First check the version environment (if you have not downloaded docker and docker-compose in docker, you can read my previous blog
Linux installation and configuration of Docker and Docker compose and deploy mysql and Chinese version of portainer graphical management interface in docker )

View docker and docker-compose versions:

 docker version
docker-compose version

OK, the environment is fine, we officially start deploying hadoop in Docker

<Deploying Hadoop in Docker>

Update system

sudo apt update

sudo apt upgrade

Domestic accelerated image download modification warehouse source

Create or modify the /etc/docker/daemon.json file

sudo vi /etc/docker/daemon.json
{
    "registry-mirrors": [ 
    "http://hub-mirror.c.163.com",
    "https://docker.mirrors.ustc.edu.cn",
    "https://registry.docker-cn.com",
    "https://kfp63jaj.mirror.aliyuncs.com"]
}

Reload docker to make the CDN configuration take effect

sudo systemctl daemon-reload
sudo systemctl restart docker

Grab the image of ubuntu 20.04 as the basis to build the hadoop environment

sudo docker pull ubuntu:20.04

Use the ubuntu image to start and fill in the specific path instead.

sudo docker run -it -v <host-share-path>:<container-share-path> ubuntu

For example

sudo docker run -it -v ~/hadoop/build:/home/hadoop/build ubuntu

 

After the container is started, it will automatically enter the container's console.

Install the required software in the container's console

apt-get update

apt-get upgrade

 Install required software

apt-get install net-tools vim openssh-server

 

/etc/init.d/ssh start

Let the ssh server start automatically

vi ~/.bashrc

Press O at the end of the file to enter edit mode, and add:

/etc/init.d/ssh start

 

Press ESC to return to command mode, enter: wq to save and exit.

Make changes effective immediately

source ~/.bashrc

Configure passwordless access to ssh

ssh-keygen -t rsa

Press Enter continuously

cd ~/.ssh
cat id_rsa.pub >> authorized_keys

Enter the container in ubuntu in docker

docker start 11f9454b301f
docker exec -it clever_gauss  bash

Install JDK 8

hadoop 3.x currently only supports jdk 7, 8

apt-get install openjdk-8-jdk

Reference jdk in environment variables and edit the bash command line configuration file

vi ~/.bashrc

At the end of the file add

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

export PATH=$PATH:$JAVA_HOME/bin

Let jdk configuration take effect immediately

source ~/.bashrc

Test the normal operation of jdk

java -version

Save the current container as an image

sudo docker commit <CONTAINER ID> <IMAGE NAME> #My own image name

 sudo docker commit 11f9454b301f  ubuntu204 #我的是ubuntu204

 You can see that the image has been created successfully. You can use this image directly next time you need to create a new container.

Notice! ! ! The two relevant paths for this process are as follows (don’t get confused):
<host-share-path> refers to ~/hadoop/build
<container-share-path> refers to /home/hadoop/build

Download hadoop, taking 3.2.3 as an example below

https://hadoop.apache.org/releases.html

cd  ~/hadoop/build
wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz

(This method can download, but the downloaded package size will be wrong. We can use the second method)

Method Two:

Enter download https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz in your computer browser

Download to your computer and upload to the virtual machine through winscp

Then open the terminal in the directory where the installation package is located and enter

sudo mv hadoop-3.2.3.tar.gz ~/hadoop/build

Move the files to the directory ~/hadoop/build

Unzip hadoop on the container console (it is the console of the previously created container, not your own console!

docker start 11f9454b301f
docker exec -it clever_gauss  bash
cd /home/hadoop/build
tar -zxvf hadoop-3.2.3.tar.gz -C /usr/local

 

The installation is complete, check the hadoop version

cd /usr/local/hadoop-3.2.3
./bin/hadoop version

Specify jdk location for hadoop

vi etc/hadoop/hadoop-env.sh

Find the commented out JAVA_HOME configuration location and change it to the jdk location just set

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

hadoop online configuration

Configure core-site.xml file

vi etc/hadoop/core-site.xml

join in:

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop-3.2.3/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
        </property>
</configuration>

Configure hdfs-site.xml file

vi etc/hadoop/hdfs-site.xml

join in

<configuration>
    <!--- 配置保存Fsimage位置 -->
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop-3.2.3/namenode_dir</value>
    </property>
    <!--- 配置保存数据文件的位置 -->
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop-3.2.3/datanode_dir</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
</configuration>

MapReduce configuration

Definition of this configuration file:

https://hadoop.apache.org/docs/r<Hadoop版本号>/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Configure mapred-site.xml file

vi etc/hadoop/mapred-site.xml

join in: 

<configuration>
    <!--- mapreduce框架的名字 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <! -- 设定HADOOP的位置给yarn和mapreduce程序 -->
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
</configuration>

Configure yarn-site.xml file

vi etc/hadoop/yarn-site.xml

 join in

<configuration>
<!-- Site specific YARN configuration properties -->
        <!-- 辅助服务,数据混洗 -->
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>master</value>
        </property>
</configuration>

Service startup permission configuration

Configure start-dfs.sh and stop-dfs.sh files

vi sbin/start-dfs.sh 和 vi sbin/stop-dfs.sh
vi sbin/start-dfs.sh
HDFS_DATANODE_USER=root

HADOOP_SECURE_DN_USER=hdfs

HDFS_NAMENODE_USER=root

HDFS_SECONDARYNAMENODE_USER=root

Continue to modify the configuration file

vi sbin/stop-dfs.sh
HDFS_DATANODE_USER=root

HADOOP_SECURE_DN_USER=hdfs

HDFS_NAMENODE_USER=root

HDFS_SECONDARYNAMENODE_USER=root

Configure start-yarn.sh and stop-yarn.sh files

vi sbin/start-yarn.sh 和 vi sbin/stop-yarn.sh
vi sbin/start-yarn.sh
YARN_RESOURCEMANAGER_USER=root

HADOOP_SECURE_DN_USER=yarn

YARN_NODEMANAGER_USER=root

vi sbin/stop-yarn.sh
YARN_RESOURCEMANAGER_USER=root

HADOOP_SECURE_DN_USER=yarn

YARN_NODEMANAGER_USER=root

 The core files must not be mismatched, otherwise there will be many problems later!

Configuration completed, save the image

docker ps

docker commit 11f9454b301f ubuntu-myx

The saved image is named ubuntu-myx

 

Start hadoop and configure the network

Open three host consoles and start three containers, one master and two slaves.

master

Open port mapping: 8088 => 8088

sudo docker run -p 8088:8088 -it -h master --name master ubuntu-myx

Start node worker01

sudo docker run -it -h worker01 --name worker01 ubuntu-myx

Node worker02

sudo docker run -it -h worker02 --name worker02 ubuntu-myx

Open the /etc/hosts of the three containers respectively, and complete the mapping information between each other's IP addresses and host names (all three containers need to be configured in this way)

vi /etc/hosts

Use the following command to query the ip

ifconfig

Add information (this file needs to be adjusted every time the container starts)

172.17.0.3      master

172.17.0.4      worker01

172.17.0.5      worker02

 

Check if the configuration is valid

ssh master
ssh worker01
ssh worker02

The master connects to the worker01 node successfully:

The worker01 node successfully connected to the master:

  worker02 connects to worker01 node successfully:

Configure the host name of the worker container on the master container

cd /usr/local/hadoop-3.2.3
vi etc/hadoop/workers

Delete localhost and add

worker01

worker02

Network configuration completed

Start hadoop

on the master host

cd /usr/local/hadoop-3.2.3
./bin/hdfs namenode -format

Normal start 

Start service

./sbin/start-all.sh

The effect is as follows, which means it is normal

Create a directory on hdfs to store files

Assume the directory is: /home/hadoop/input

./bin/hdfs dfs -mkdir -p /home/hadoop/input
./bin/hdfs dfs -put ./etc/hadoop/*.xml /home/hadoop/input

Check whether distribution replication is normal

./bin/hdfs dfs -ls /home/hadoop/input

Running case:

Create a directory on hdfs to store files

For example

./bin/hdfs dfs -mkdir -p /home/hadoop/wordcount

Put the text program in

./bin/hdfs dfs -put hello /home/hadoop/wordcount

View distribution status

./bin/hdfs dfs -ls /home/hadoop/wordcount

Run MapReduce's built-in wordcount sample program (the built-in sample program cannot run, possibly due to virtual machine performance issues, so here it is replaced with a simple wordcount program)

./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar wordcount /home/hadoop/wordcount /home/hadoop/wordcount/output

 Run successfully:

 

After running, view the output results

./bin/hdfs dfs -ls /home/hadoop/wordcount/output
./bin/hdfs dfs -cat /home/hadoop/wordcount/output/*

 At this point, Docker deployed hadoop successfully! There is usually no problem if you follow the steps.

Next we use docker to build the spark running environment

<Use docker to build spark running environment>

Use docker hub to find the image we need

ReferenceDocker Hub

curl -LO https://raw.githubusercontent.com/bitnami/containers/main/bitnami/spark/docker-compose.yml

提示:curl: (7) Failed to connect to raw.githubusercontent.com port 443: Connection refused

The reason should be the foreign IP. If you hit a wall, just look at the solution.

solution:

1. Open the website, https://www.ipaddress.com/, and check the IP address corresponding to raw.githubusercontent.com on this website.

Go to this website to find the IP bound to this domain name

Vi etc/hosts

Add at the end:

185.199.108.133 raw.githubusercontent.com

curl runs successfully

curl -LO https://raw.githubusercontent.com/bitnami/containers/main/bitnami/spark/docker-compose.yml

 docker-compose up

 

 

Install  Spark  ’s docker image

docker pull bitnami/spark:latest

docker pull bitnami/spark:[TAG]

Solution to git clone fatal: unable to access 'https://github.com/...'

After consulting some information, I found that mapping needs to be added to the hosts file.

vi /etc/hosts

Add two lines to the hosts file

140.82.113.4 github.com

140.82.113.4 www.github.com

git clone

cd bitnami/APP/VERSION/OPERATING-SYSTEM

Find the corresponding directory:

cd /home/rgzn/containers/bitnami//spark/3.2/debian-11

 

 # . represents the current directory

docker build -t bitnami/spark:latest .

Parameter Description:

-t: Specify the target image name to be created

.: The directory where the Dockerfile file is located, you can specify the absolute path of the Dockerfile

Find the directory containing the Dockerfile and execute the command to build the image yourself

Deploy the spark environment using yml deployment files

The spark.yml file can be edited locally and then uploaded to the virtual machine or server. The contents of the spark.yml file are as follows:

version: '3.8'

services:
  spark-master:
    image: bde2020/spark-master
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    volumes:
      - ~/spark:/data
    environment:
      - INIT_DAEMON_STEP=setup_spark
  spark-worker-1:
    image: bde2020/spark-worker:latest
    container_name: spark-worker-1
    depends_on:
      - spark-master
    ports:
      - "8081:8081"
    volumes:
      - ~/spark:/data
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
  spark-worker-2:
    image: bde2020/spark-worker:latest
    container_name: spark-worker-2
    depends_on:
      - spark-master
    ports:
      - "8082:8081"
    volumes:
      - ~/spark:/data
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"

Deploy the spark environment using yml deployment files

cd /usr/local/bin

Create the file sudo vim spark.yml

sudo chmod 777 spark.yml

In the directory where the spark.yml file is located, execute the command:

sudo docker-compose -f spark.yml up -d

View container creation and running status

sudo docker ps

Format the output

sudo docker ps --format '{
   
   {.ID}} {
   
   {.Names}}'

Use a browser to view the master's web ui interface

127.0.0.1:8080

http://192.168.95.171:50070

Enter the spark-master container

sudo docker exec -it <master container id, just enter part of it> /bin/bash

sudo docker exec -it 98600cfa9ba7 /bin/bash

Query the spark environment and install it under /spark.

ls /spark/bin

Enter spark-shell

/spark/bin/spark-shell --master spark://spark-master:7077 --total-executor-cores 8 --executor-memory 2560m

or

/spark/bin/spark-shell

Enter the browser to view the status of spark-shell

Test: Create RDD and filter processing

Create an RDD

val rdd=sc.parallelize(Array(1,2,3,4,5,6,7,8))

Print rdd content

rdd.collect()

 Query the number of partitions

rdd.partitions.size

Select a value greater than 5

val rddFilter=rdd.filter(_ > 5)

Print rddFilter content

rddFilter.collect()

Exit spark-shell

:quit

 Running case successfully!

The above is Docker deployment of hadoop and use docker to build spark operating environment. I have been preparing for this tutorial for a long time. If there are no problems with the environment, I can basically complete docker deployment of hadoop and run spark. I wish everyone all the best!

                                                                                               “It is better to retreat and build a net than to look at the fish in the abyss.”

Guess you like

Origin blog.csdn.net/Myx74270512/article/details/127692079