Building a distributed environment of hadoop and mapreduce based on docker technology

Building a distributed environment of hadoop and mapreduce based on docker technology

1. Install doker

1. Confirm the host environment

  1. (if not already) install the lsb-relaease tool

    apt install lsb-release

    insert image description here

  2. Check version

    lsb_release -a

    insert image description here

2. Prepare the installation environment

  1. update system

    sudo apt update

    insert image description here

    insert image description here

    sudo apt upgrade

    insert image description here

    insert image description here

  2. Download curl:

    sudo apt install curl

    insert image description here

3. Install docker

  1. Install docker via curl tool

    curl -fssl https://get.docker.com -o get-docker.sh

    insert image description here

    sudo sh get-docker.sh

    insert image description here

    insert image description here

  2. Confirm docker installation

    sudo docker version

    insert image description here

  3. (Optional) Install docker-compose, the latest version is 1.29.2

    You can visit https://github.com/docker/compose/releases/ first to confirm the version number

    sudo curl -l "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

    insert image description here

    sudo chmod +x /usr/local/bin/docker-compose

    insert image description here

  4. Add domestic docker image cdn

    sudo vi /etc/docker/daemon.json

    {
          
          
    
      "registry-mirrors":["https://kfp63jaj.mirror.aliyuncs.com","https://docker.mirrors.ustc.edu.cn","https://registry.docker-cn.com","http://hub-mirror.c.163.com"]
    
    }
    

    insert image description here

  5. Reload docker to make cdn take effect:

    sudo systemctl daemon-reload

    insert image description here

    sudo systemctl restart docker

    insert image description here

    Restart docker but encounter problems (as shown above)

    For convenience, I first install a vim (I really can't use vi)

    insert image description here

    So I deleted docker and reinstalled and the problem was solved

    insert image description here

  6. Test whether docker can grab the image and run normally

    1. Run the hello-world test case

      sudo docker run hello-world

      insert image description here

  7. View the running record of the hello-world image

    sudo docker ps -a

    insert image description here

2. Building hadoop and mapreduce based on docker technology

1. Prepare the container environment

  1. Grab the image of ubuntu 18.04 as the basis to build the hadoop environment

    sudo docker pull ubuntu:18.04

    insert image description here

  2. Check whether the image is successfully captured

    sudo docker images

    insert image description here

  3. Start a container with that ubuntu image

    Connect <host-share-path> with <container-share-path>

    sudo docker run -it -v ~/hadoop/build:/home/hadoop/build ubuntu

    insert image description here

    It seems that the mirror cannot be found in the first few cdns, and then an error is reported, and then the mirror is found in the following cdns

    After the container starts, it will automatically enter the console of the container

  4. Install the required software on the console of the container

    apt-get update

    insert image description here

    apt-get upgrade

    insert image description here

  5. Need to install net-tools (network management tools), vim (command line text editor) and ssh (remote login protocol)

    apt-get install net-tools vim openssh-server

    insert image description here

2. Configure ssh server

  1. Make the ssh server start automatically

    vim ~/.bashrc

    Press o at the very end of the file to enter edit mode and add:

    /etc/init.d/ssh start

    Press esc to return to command mode, enter: wq to save and exit

    insert image description here

  2. Make changes take effect immediately

    source ~/.bashrc

    insert image description here

  3. Configure passwordless access for ssh

    ssh-keygen -t rsa

    insert image description here

    cd ~/.ssh

    cat id_rsa.pub >> authorized_keys

    insert image description here

3. Install jdk8

​ (Note: hadoop3.x currently only supports jdk7,8)

  1. install jdk8

    apt-get install openjdk-8-jdk

    insert image description here

  2. hero jdk in environment variables, edit bash command line config file

    vim ~/.bashrc

    At the end of the file add:

    export java_home=/usr/lib/jvm/java-8-openjdk-amd64/
    
    export path=$path:$java_home/bin
    

    insert image description here

  3. Make the jdk configuration take effect immediately

    source ~/.bashrc

    insert image description here

  4. Test jdk working properly

    java -version

    insert image description here

4. Save the image

  1. (Optional) To log in to docker, you need to register an account on the docker website in advance. The advantage is that you can submit your own image to the Internet

    sudo docker login

    insert image description here

  2. query container id

    sudo docker ps -a

    insert image description here

  3. Save the current container as an image

    sudo docker commit <container id> <image name>

    insert image description here

  4. When there are too many containers, you can delete the container with the following command

    docker rm -f <containerid>

    insert image description here

5. Install hadoop

  1. Download the hadoop binary tarball on the host console

    The hadoop version used in this article is 3.2.1, the latest version: 3.3.2

    Other versions can be downloaded from apache hadoop official website: https://hadoop.apache.org/releases.html

    cd /<host-share-path>

    <host-share-path> refers to the previous path when creating the container: ~/hadoop/build

    insert image description here

    wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

    (The download here is just a web page, it is useless, so you still have to download the hadoop compressed package yourself, understand!)

    insert image description here

  2. Unzip hadoop on container console

    If you don't like the name of the container or it's not easy to type, you can rename it

    docker rename <原名> <新名>

    insert image description here

    insert image description here

    insert image description here

    Open the container

    docker exec -it <容器名或容器id> /bin/bash

    insert image description here

    If the container is not started, you need to start the container first

    insert image description here

    docker start <容器名或容器id>

    insert image description here

    cd /<container-share-path>

    <container -share-path> refers to the previous path when the container was created: /home/hadoop/buildinsert image description here

    tar -zxvf hadoop-3.3.2.tar.gz -c /usr/local

    (Note that when I used hadoop-3.2.3.tar.gz, I had problems with decompression, so I replaced hadoop-3.3.2.tar.gz)

    insert image description here

  3. The installation is complete, check the hadoop version

    1. Configure environment variables

      export hadoop_home=/usr/local/hadoop-3.3.2
      export hadoop_yarn_home=$hadoop_home
      

      insert image description here

    2. test

      cd /usr/local/hadoop-3.3.2

      ./bin/hadoop version

      insert image description here

  4. specify jdk location for hadoop

    1. Modify the configuration file

      Execute in the hadoop installation directory

      vim etc/hadoop/hadoop-env.sh

      insert image description here

      Find the commented out java_home configuration location and change it to the jdk location just set

      export java_home=/usr/lib/jvm/java-8-openjdk-amd64/
      

      insert image description here

  5. hadoop online configuration

    1. Configure the core-site.xml file

      Execute in the hadoop installation directory

      vim etc/hadoop/core-site.xml

      insert image description here

      join in

      <configuration>
      <property>
      <name>hadoop.tmp.dir</name>
      <value>file:/usr/local/hadoop-3.2.1/tmp</value>
      <description>abase for other temporary directories.</description>
      </property>
      <!-- 配置文件系统的uri,代码中可以通过该地址访问文件系统,使用 hdfsoperator.hdfs_uri 调用 -->
      <property>
      <name>fs.defaultfs</name>
      <value>hdfs://master:9000</value>
      </property>
      </configuration>
      

      insert image description here

    2. Cooperate with hdfs-site.xml file

      Execute in the hadoop installation directory

      vim etc/hadoop/hdfs-site.xml

      insert image description here

      join in

      <configuration>
      <!-- 配置保存fsimage位置 -->
      <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/usr/local/hadoop-3.2.1/namenode_dir</value>
      </property>
      <!-- 配置保存数据文件的位置 -->
      <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/usr/local/hadoop-3.2.1/datanode_dir</value>
      </property>
      <property>
      <name>dfs.replication</name>
      <value>3</value>
      </property>
      </configuration>
      

      insert image description here

    3. mapreduce configuration

      The definition description of this configuration file refers to:

      https://hadoop.apache.org/docs/r<hadoop版本号>/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

    4. Configure mapred-site.xml

      Execute in the hadoop installation directory

      vim etc/hadoop/mapred-site.xml

      insert image description here

      join in

      <configuration>
      <!-- mapreduce框架的名字 -->
      <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
      </property>
      <!-- 设定hadoop的位置给yarn和mapreduce程序 -->
      <property>
      <name>yarn.app.mapreduce.am.env</name>
      <value>hadoop_mapred_home=${hadoop_home}</value>
      </property>
      <property>
      <name>mapreduce.map.env</name>
      <value>hadoop_mapred_home=${hadoop_home}</value>
      </property>
      <property>
      <name>mapreduce.reduce.env</name>
      <value>hadoop_mapred_home=${hadoop_home}</value>
      </property>
      </configuration>
      

      insert image description here

    5. Configure yarn-site.xml file

      Execute in the hadoop installation directory

      vim etc/hadoop/yarn-site.xml

      insert image description here

      join in

      <configuration>
      <!-- site specific yarn configuration properties -->
      <!-- 辅助服务,数据混洗 -->
      <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
      </property>
      <!-- 设定资源管理服务器的host名称,这个名称(master)将在下个小节中设定-->
      <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>master</value>
      </property>
      </configuration>
      

      insert image description here

  6. Service startup permission configuration

    1. Configure start-dfs.sh and stop-dfs.sh files

      Execute in the hadoop installation directory

      vim sbin/start-dfs.sh

      insert image description here

      and

      vim sbin/stop-dfs.sh

      insert image description here

      add at the beginning of the file

      HDFS_DATANODE_USER=root
      HADOOP_SECURE_DN_USER=hdfs
      HDFS_NAMENODE_USER=root
      HDFS_SECONDARYNAMENODE_USER=root
      

      insert image description here
      insert image description here

    2. Configure start-yarn.sh and stop-yarn.sh files

      Execute in the hadoop installation directory

      vim sbin/start-yarn.sh

      insert image description here

      and

      vim sbin/stop-yarn.sh

      insert image description here

      add at the beginning of the file

      YARN_RESOURCEMANAGER_USER=root
      HADOOP_SECURE_DN_USER=yarn
      YARN_NODEMANAGER_USER=root
      

      insert image description here
      insert image description here

    3. Configuration is complete, save the image

      1. Return to the host

        exit

        insert image description here

      2. View containers

        docker ps

        insert image description here

      3. upload container

        docker commit <container id> <image name>

        insert image description here

  7. Start hadoop and configure the network

    1. Open three host consoles and start three containers: one master and two slaves:

    2. master

      Open port mapping: 8088=>8080

      sudo docker run -p 8088:8080 -it -h master –-name master <image name>

      insert image description here

    3. worker01

      sudo docker run -it -h worker01 –-name worker01 <image name>

      insert image description here

    4. worker02

      sudo docker run -it -h worker02 –-name worker02 <image name>

      insert image description here

    5. Open the /etc/hosts of the three containers respectively, and complete the mapping information of each other's IP addresses and host names (all three containers need to be configured in this way)

      vim /etc/hosts

      insert image description here

      (You can also use the following command to query the ip when needed: ifconfig<if and config do not have spaces between them, otherwise it is another command>)

      insert image description here

      insert image description here

      insert image description here

      Add information (this file needs to be adjusted every time the container starts)

      <master的实际ip>   master
      <worker01的实际ip>   worker01
      <worker02的实际ip>   worker02
      

      insert image description here
      insert image description here
      insert image description here

      The contents of the hosts file of the three hosts are the same

    6. Check if the configuration is valid

      ssh master

      ssh worker01

      ssh worker02

      insert image description here

      insert image description here

      insert image description here

  8. Configure the hostname of the worker container on the master container

    cd /usr/local/hadoop-3.3.2

    vim etc/hadoop/workers

    insert image description here

    delete localhost, join

    worker01
    worker02
    

    network configuration complete

    insert image description here

  9. start hadoop

    1. On the master host, start hadoop

      cd /usr/local/hadoop-3.3.2

      ./bin/hdfs namenode -format

      insert image description here

      insert image description here

      ./sbin/start-all.sh

      insert image description here

    2. Create a directory to store files on hdfs

      Suppose the directory to be created is: /home/hadoop/input

      ./bin/hdfs dfs -mkdir -p /home/hadoop/input

      ./bin/hdfs dfs -put ./etc/hadoop/*.xml /home/hadoop/input

      insert image description here

    3. Check whether distribution replication is normal

      ./bin/hdfs dfs -ls /home/hadoop/input

      insert image description here

  10. Run the sample program that comes with mapreduce

    1. run the program

      1. Create a new directory /home/hadoop/wordcount

        ./bin/hdfs dfs -mkdir /home/hadoop/wordcount

        insert image description here

      2. Create a new input file hello.txt and place it in the /home/hadoop/wordcount/ directory

        insert image description here

        insert image description here

      3. execute program

        ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.2.jar wordcount /home/hadoop/wordcount /home/hadoop/wordcount/output

        insert image description hereinsert image description here

  11. After running, check the output

    (Because the original task provided by my cloud server can't run, so I replaced it with the wordcount task)

    ./bin/hdfs dfs -ls /home/hadoop/wordcount/output

    insert image description here

    ./bin/hdfs dfs -cat /home/hadoop/wordcount/output/*

    insert image description here

3. Q&A

Q1: How to get name node out of safe mode.

A1: Before shutting down the container, failure to execute the stop-all.sh command will cause the name node to enter safe mode. The command to exit safe mode is as follows

./bin/hadoop dfsadmin -safemode leave

insert image description here

Q2: How to delete exited containers in batches.

A2:sudo docker container prune

Q3: The error connect to host <docker node> port 22: connection refused appears when the hadoop service is started.

A3: The ssh server is not started, use the following command to start the ssh server

/etc/init.d/ssh start

Guess you like

Origin blog.csdn.net/weixin_45795947/article/details/124179617