Hadoop 3.1 installation and initial use

Check the operating system version

$ sudo lsb_release -a

A. Preparatory work (optional)

Education Network 1.1 Adding a mirror
First, make a backup of the original /etc/apt/sources.list.

$ sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup

Then, the $ sudo vim /etc/apt/sources.listcommand to replace the entire contents of /etc/apt/sources.list

# 清华大学镜像,默认注释了源码镜像以提高 apt update 速度,如有需要可自行取消注释
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted universe multiverse
 
# 预发布软件源,不建议启用
# deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-proposed main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-proposed main restricted universe multiverse

or

#上海交通大学更新服务器
deb http://ftp.sjtu.edu.cn/ubuntu/ lucid main multiverse restricted universe
deb http://ftp.sjtu.edu.cn/ubuntu/ lucid-backports main multiverse restricted universe
deb http://ftp.sjtu.edu.cn/ubuntu/ lucid-proposed main multiverse restricted universe
deb http://ftp.sjtu.edu.cn/ubuntu/ lucid-security main multiverse restricted universe
deb http://ftp.sjtu.edu.cn/ubuntu/ lucid-updates main multiverse restricted universe
deb-src http://ftp.sjtu.edu.cn/ubuntu/ lucid main multiverse restricted universe
deb-src http://ftp.sjtu.edu.cn/ubuntu/ lucid-backports main multiverse restricted universe
deb-src http://ftp.sjtu.edu.cn/ubuntu/ lucid-proposed main multiverse restricted universe
deb-src http://ftp.sjtu.edu.cn/ubuntu/ lucid-security main multiverse restricted universe
deb-src http://ftp.sjtu.edu.cn/ubuntu/ lucid-updates main multiverse restricted universe

Finally, update the data source.

$ sudo apt-get update           #更新数据源 
$ sudo apt-get upgrade         #更新软件版本,可以不执行

1.2 modify the host name
ubuntu 18.04 you will need to modify the file /etc/cloud/cloud.cfg

$ sudo vim /etc/cloud/cloud.cfg     #找到preserve_hostname: false修改为preserve_hostname: true

Three computers names are: Master, Slave01, Slave02. Use the command

$ sudo vim /etc/hostname       #永久修改主机名(需要重新启动$ sudo reboot)

Modify, save. You must then restart the computer to be able to change the computer name. Or use the command

$ sudo hostname Master         #临时修改主机名

Note : After executing this command, temporarily modify the host name (in the new terminal to see host name has been changed), the host name after the restart and restore the original.


1.3 Configuring IP addresses
Ubuntu 18.04 Configuration /etc/netplan/50-cloud-init.yaml

$ sudo vim /etc/netplan/50-cloud-init.yaml         #双网卡地址配置 

Here Insert Picture Description

$ sudo netplan apply                               #配置生效

Three computers to configure a static IP address (the default card here called eth0, some also have default is called ens160, ens32, you can use $ ifconfig command to view it).

$ sudo vim /etc/network/interfaces

Here Insert Picture Description


1.4 modify / etc / hosts
using the $ ifconfigcommand to view the IP address of the computer.
Three computers sequentially input in $sudo vim /etc/hostsorder to modify the following (note that IP address according to the actual situation):
Here Insert Picture Description
After editing, the ping command Bureau three computers can be connected.

$ ping Slave01
$ ping Slave02
$ ping Master

1.5 hadoop user to create
three computers are set up in a hadoop user named miles account (you can set hadoop user name depending on the circumstances)

$ sudo useradd -m miles -s /bin/bash        #创建hadoop用户叫做miles,并使用/bin/bash作为shell
$ sudo passwd miles                         #为hadoop用户设置密码,之后需要连续输入两次密码(例如:密码为0123456789)
$ sudo adduser miles sudo                   #为hadoop用户增加管理员权限
$ su - miles                                #切换当前用户为用户miles(hadoop用户)
$ sudo apt-get update                       #更新hadoop用户的apt,方便后面的安装

1.6 SSH installation, set up SSH without password
if already served by ssh, you can skip this step and set up SSH without a password.

$ sudo apt-get install openssh-server                      #安装SSH server

No set up SSH login password.

$ ssh localhost               #登陆SSH,第一次登陆输入yes
$ exit                        #退出登录的ssh localhost
$ cd ~/.ssh/                  #如果没法进入该目录,执行一次ssh localhost
$ ssh-keygen -t rsa 

After input $ ssh-keygen -t rsaafter the statement requires three consecutive tapping carriage, as shown below:
Here Insert Picture Description
wherein, the first is to allow the transport KEY stored in the default position, to facilitate a subsequent command input (note record). The second and third are determined passphrase, little correlation. Enter twice after completion of the input, if the output appears similar to that shown in the figure, i.e. success:
Here Insert Picture Description
wherein file id_rsa the private key (Key privite), id_rsa.pub public key (public key). The two keys are by default in the directory ~ / .ssh /, go to the directory with $ cd ~ / .ssh / command. Then enter:

$ cat ./id_rsa.pub >> ./authorized_keys      #将id_rsa.pub追加到授权的key中,创建authorized_keys文件
$ chmod 600 authorized_keys       #修改authorized_keys的权限为:拥有者读写权
$ ssh localhost 

This time do not need a password to log on localhost, and can be seen below. If that fails, you can repeat the steps above, or search for "SSH login password-free" to seek answers.
Here Insert Picture Description
Repeat steps on each computer 1.6 (SSH installed, the SSH no password), and to achieve id_rsa.pub authorized_keys generate a public and Slave02 on Slave01 respectively. Then, the following commands are executed on the two child nodes:

$ scp miles@Master:~/.ssh/id_rsa.pub ./master_rsa.pub        #将Master上的公钥id_rsa.pub复制到子节点上,并命名为master_rsa.pub

The following picture appears, answer "yes" and miles user's password.

$ cat master_rsa.pub >> ./authorized_keys            #将master_rsa.pub追加到子节点上授权的key中

Back to the master node Master, enter the command $ ssh Slave01to test whether can log in directly. At this point, you can log in directly found two Slave Master node (no password).
Q: How does the password to log between Slave nodes?


II. Installation jdk8

About the downloaded package, consult https://www.cnblogs.com/gbyukg/p/3326825.html
off 2018.11.8 found Hadoop3.1.1 not support the latest jdk11
computers within the Hadoop system needs to be installed . In the first oracle official website to download and install the next jdk8 http://www.oracle.com/technetwork/java/javase/downloads/index.html environment variable configuration, select the corresponding version of PC systems based on, I chose jdk- 8u191-linux-x64.tar.gz

$ sudo mkdir /usr/lib/jvm           #创建jvm文件夹
$ sudo tar -zxvf Downloads/jdk-8u191-linux-x64.tar.gz -C /usr/lib/jvm
                      #解压到/usr/lib/jvm目录下,jdk下载后默认路径为~/Downloads中
$ cd /usr/lib/jvm                           #进入该目录
$ sudo mv  jdk1.8.0_191 java                #将jdk1.8.0_191目录重命名为java
$ vim ~/.bashrc                             #给JDK配置环境变量

Note:

  1. Where if permission is not sufficient to create jvm folder, you can use under the relevant directory $ sudo -icommand to enter the root account to create the folder.
  2. Also recommended to use vim to edit environment variables, if not installed, used $sudo apt-get install vimto install.
    Add the following instruction at the end .bashrc file:
export JAVA_HOME=/usr/lib/jvm/java
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

Such as:
Here Insert Picture Description
After the file modification is completed, enter the command:

$ source ~/.bashrc                       #使新配置的环境变量生效
$ java -version                             #检测是否安装成功,查看java版本

If the content appears as shown below, is the successful installation.
Here Insert Picture Description


III. Installation hadoop-3.1.1

Computers within the Hadoop system needs to be installed.
Hadoop-3.1.1.tar.gz can download from http://mirrors.hust.edu.cn/apache/hadoop/common/, installation:

$ sudo tar -zxvf  hadoop-3.1.1.tar.gz -C /usr/local       #解压到/usr/local目录下
$ cd /usr/local
$ sudo mv  hadoop-3.1.1   hadoop      #将hadoop-3.1.1目录重命名为hadoop
$ sudo chown -R miles ./hadoop        #修改文件权限,属于用户miles

Hadoop to configure the environment variables, use the $ vim ~/.bashrccommand, add the following code to the .bashrc file:

export HADOOP_HOME=/usr/local/hadoop
export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"

As shown below
Here Insert Picture Description
perform source ~/.bashrcthe settings take effect. Then use $ hadoop versionthe command to hadoop successfully installed, if successful, it returns the following information, as shown in FIG.
Here Insert Picture Description


Four. Hadoop operating mode

Operating Hadoop cluster has three modes, namely:

  1. Local / standalone mode: By default, this mode is used for running Java programs.
  2. Distributed Simulation mode (also called pseudo-distribution pattern): distributed simulation on a computer. Hadoop guardian of each process, including hdfs, yarn, MapReduce, etc., as a stand-alone Java program to run, for development use.
  3. Fully-distributed: multiple computers in a cluster of two or more (including) the use of Hadoop, suitable for practical use. As used herein, the mode.
    In addition, Hadoop 2.x and Hadoop 3.x vary, read "big data Hadoop2.x Hadoop3.x compared with what changes," a text (https://blog.csdn.net/wshyb0314/article/details / 82184680). Wherein, the port number used by default they are different, as shown in the following table.
    Here Insert Picture Description

V. fully distributed configuration

5.1 Create a directory (only disposed on the Master)
to create and dfs in hadoop tmp directory user (user name Miles) were used as the main directory data storage and exchange, and creates a subdirectory under the name and data dfs directory command is as follows:

$ mkdir -p ~/dfs/name       #在普通用户主目录下同时创建dfs目录及其子目录name
$ mkdir ~/dfs/data         #创建dfs的子目录data
$ mkdir ~/tmp               #在普通用户主目录下创建目录tmp

5.2 modify the configuration file
path of the profile (set according HADOOP_HOME) of / usr / local / hadoop / etc / hadoop / directory to that directory.

$ cd /usr/local/hadoop/etc/hadoop/
  1. Execution $ sudo vim workers, add the host name to three computers.

  2. Execution $ sudo vim core-site.xml, the final document will <configuration> </configuration>be replaced

<configuration>
        <property>
             <name>fs.defaultFS</name>
             <value>hdfs://Master:9000</value>
        </property>
        <property>
              <name>hadoop.tmp.dir</name>
              <value>/home/miles/tmp</value>
         </property>
</configuration>
  1. Execution $ sudo vim hadoop-env.sh, set JAVA_HOME JDK installation.
    Java home: the JAVA_HOME = / usr / lib / JVM / Java
    the Hadoop home directory: HADOOP_HOME = / usr / local / hadoop ( this not set)
    as follows:
    Here Insert Picture Description
  2. The implementation of $ sudo vim hdfs-site.xmlthe final document to replace
<configuration>
        <property>
            <name>dfs.namenode.http-address</name>
             <!-- Master为当前机器名或者IP地址 -->
             <value>hdfs://Master:9001</value>
        </property>
        <property>
              <name>dfs.namenode.name.dir</name>
              <!-- 以下为存放节点命名的路径 -->
              <value>/home/miles/dfs/name</value>
         </property>
        <property>
              <name>dfs.datanode.data.dir</name>
              <!-- 以下为存放数据命名的路径 -->
              <value>/home/miles/dfs/data</value>
        </property>
        <property>
              <name>dfs.replication</name>
              <!-- 备份次数,因为有2台DataNode-->
              <value>2</value>
         </property>
        <property>
              <name>dfs.webhdfs.enabled</name>
              <!-- Web HDFS-->
              <value>true</value>
         </property>
</configuration>
  1. Execution $ sudo vim mapred-site.xml, the final document will <configuration> </configuration>be replaced
<configuration>
        <property>
            <name>mapreduce.framework.name</name>
             <!-- MapReduce Framework -->
             <value>yarn</value>
        </property>
        <property>
              <name>mapreduce.jobhistory.address</name>
              <!-- MapReduce JobHistory, 当前计算机的IP -->
              <value>Master:10020</value>
         </property>
        <property>
              <name>mapreduce.jobhistory.webapp.address</name>
              <!-- MapReduce Web App JobHistory, 当前计算机的IP -->
              <value>Master:19888</value>
        </property>
        <property>
              <name>yarn.app.mapreduce.am.env</name>
              <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property>
        <property>
              <name>mapreduce.map.env</name>
              <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property>
              <name>mapreduce.reduce.env</name>
              <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property>
</configuration>
  1. Execution $ sudo vim yarn-site.xml, the final document will <configuration> </configuration>be replaced
<configuration>
 <!-- Site specific YARN configuration properties -->
        <property>
            <name>yarn.resourcemanager.hostname</name>
             <!-- Master为当前机器名或者ip号 -->
             <value>Master</value>
        </property>
        <property>
              <name>yarn.nodemanager.aux-services</name>
              <!-- Node Manager辅助服务 -->
              <value>mapreduce_shuffle</value>
         </property>
        <property>
              <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
              <!-- Node Manager辅助服务类 -->
              <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
              <name>yarn.nodemanager.resource.cpu-vcores</name>
              <!-- CPU个数,需要根据当前计算机的CPU设置-->
              <value>2</value>
         </property>
        <property>
              <name>yarn.resourcemanager.admin.address</name>
              <!-- Resource Manager管理地址 -->
              <value>Master:8033</value>
         </property>
        <property>
              <name>yarn.resourcemanager.webapp.address</name>
              <!-- Resource Manager Web地址 -->
              <value>Master:8088</value>
         </property>
</configuration>

VI. Copy the Hadoop configuration file

After the Master configuration, the need to copy the Master / usr / local / hadoop (HADOOP_HOME) directory Slave configuration files to all nodes. First, execute the following command on the Master:

$ cd /usr/local
$ sudo rm -r ./hadoop/tmp		#删除Hadoop临时文件
$ sudo rm -r ./hadoop/logs/*		#删除Hadoop日志文件
$ tar -zcf ~/hadoop.master.tar.gz ./hadoop		#将Hadoop的配置目录压缩成一个文件到用户主目录下
$ cd ~
$ scp ./hadoop.master.tar.gz miles@Slave01:~/	#将压缩包文件复制到Slave01主机上miles用户的主目录下

Then, the following commands on Slave node:

$ sudo rm -r /usr/local/hadoop	#删除原来的Hadoop文件和目录
$ sudo tar -zxf ~/hadoop.master.tar.gz -C /usr/local	#解压缩都/usr/local目录中
$ sudo chown -R miles /usr/local/hadoop	    #修改/usr/local/hadoop目录的拥有者为Hadoop用户miles

VII. Start Hadoop

1. Start HDFS
● initialization NameNode
Master is namenode, Slave is datanode, so only needs to be formatted on HDFS Master, the following command:

$ hdfs namenode -format                     #初始化namenode

Formatting will produce multiple pieces of information, in which the third line if there is
...... Storage directory / home / miles / hdfs / name has been successfully formatted, then the formatting HDFS success.
No error return "Exiting with status 0" is successful, "Exiting with status 1" for the failure.
● Test start HDFS
execute the following command on the Master

$ start-dfs.sh                      #启动dfs
$ start-yarn.sh                    #启动yarn
$ mapred --daemon start historyserver		#启动historyserver,它不是必须的步骤

Performed on the Master $jpscommand, you can see five processes.

NameNode
SecondaryNameNode
NodeManager
ResourceManager
Jps
JobHistoryServer

Performed on the Slave $jpscommand, you can see three processes.

DataNode
NodeManager
Jps

Note : The lack of any process have expressed a mistake. Can be accessed through a browser http: // Master: 9870 (Hadoop3) or http: // Master: 50070 (Hadoop2 ) View Log Logs, or the name of the node and data nodes.
2. Check the Hadoop cluster cases
executed on the Master $ hdfs dfsadmin -reportcommand to see if the data node starts normally. If the normal start, it returns the following information as shown in FIG. Live datanodes which is not zero, indicating that the cluster started successfully. The following figure illustrates the cluster has three datanode.
Here Insert Picture Description
You can also enter in the browser http: // master: 9870 or IP address http://192.168.1.10:9870, see the Web cluster state, as shown below.
Here Insert Picture Description


Eight. Hadoop platform testing

  • WordCount
    in Hadoop installation directory (/ usr / local / hadoop / ), provides a word count program --WordCount. WordCount program running on Hadoop computing platform, taking advantage of MapReduce and HDFS. It can count the number of times the word appears in the file, and then gives statistical results. We WordCount by running a program that can detect whether Hadoop platform to run properly.
    1 to establish the appropriate directory and the corresponding text files on HDFS. In the home directory (/ home / miles /) create a folder input, and enter the folder input, execute the following command:
$ mkdir /home/miles/input
$ cd /home/miles/input

File01 create documents and file02, and writes the content to be counted are ordered as follows:

$ echo "hello world bye world" > file01           #写内容到文件file01中
$ echo "hello hadoop goodbye hadoop" >file02      #写内容到文件file02中

2 Create a directory input on HDFS, the file01 and file02 uploaded to the input of hdfs directory, execute the following command

$ cd /home/miles/input
$ hadoop fs -mkdir /hadoopusers/miles/input            #创建HDFS的/hadoopusers/miles/input目录
$ hadoop fs -put * /hadoopusers/miles/input             #将/home/miles/input目录下的文件上传到HDFS的/hadoopusers/miles/input目录
$ hadoop fs -ls /hadoopusers/miles/input                  #/查看HDFS的/hadoopusers/miles/input目录下文件的状态

When the terminal display:

-rw-r--r--   4 miles supergroup         22 2018-11-09 21:43 input/file01
-rw-r--r--   4 miles supergroup         28 2018-11-09 21:43 input/file02

It means that files uploaded to the HDFS success.
3 Run WordCount program
entry (set according HADOOP_HOME) of / usr / local / hadoop / etc / hadoop // share / hadoop / mapreduce directory. WordCount then execute the program, the following command

#进入目录   
$ cd /usr/local/hadoop/share/hadoop/mapreduce
#执行Wordcount程序
$ hadoop jar hadoop-mapreduce-examples-3.1.1.jar wordcount /hadoopusers/miles/input  /hadoopusers/miles/output

Note : output must be a new one in an empty directory.
Operating the terminal will generate more information system, if the following information is displayed:

18/11/09 21:56:18 INFO mapreduce.Job:  map 0% reduce 0%
18/11/09 21:56:29 INFO mapreduce.Job:  map 100% reduce 0% 
18/11/09 21:56:38 INFO mapreduce.Job:  map 100% reduce 100%
18/11/09 21:56:38 INFO mapreduce.Job: Job job_1410242637907_0001 completed successfully
18/11/09 21:56:38 INFO mapreduce.Job: Counters: 43

Show program has performed an mapreduce distributed programming mode during operation.
4 operating results output of the program
after WordCount program run is completed, the survey results will be output to a part-r-00000 HDFS file under the output directory. Enter the following command to view the statistics word.

$ hadoop fs -cat /hadoopusers/miles/output/part-r-00000

The output terminal of the terminal display:

bye      1
goodbye  1
hadoop   2
hello     2
world    2

We can see from the results, Hadoop clusters can successfully count the number of files file01 and file02 out each word appears, indicating that Hadoop platform can run successfully.
Q: How do I know, to set up and modify the default directory hadoop user?


  • View Hadoop cluster information

During execution, enter in the browser http: // master: 8088 or http://192.168.1.10:8080, view the Hadoop cluster information through the Web interface, as shown below.
Here Insert Picture Description
Hadoop cluster described above are properly started. If you need to close the Hadoop cluster, you can execute commands on the Master:

$ stop-yarn.sh
$ stop-dfs.sh
$ mapred --daemon stop historyserver		#关闭历史记录服务

Q: Self use Docker (container) as Hadoop cluster configuration according to the present experiment.


IX. Error

  1. “2018-10-30 13:30:40,855 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”
    In the / etc / profile, add the following configuration:
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"

Validate the configuration:$ source /etc/profile

  1. Prior to start over if Hadoop, but now can not start properly, especially when datanode not start. You can delete the name stored cluster name on Master, the name of the file data, execute the following command:
 $ rm -r /home/miles/hdfs/name/*
 $ rm -r /home/miles/hdfs/data/*

And all temporary files, including Master and Slave, including all the nodes, execute the following command:

$ rm -r /home/miles/tmp/*

In re-execute $ hdfs namenode -formatthe command, re-format the Hadoop cluster.

3. If you encounter the following error, " java.net.ConnectException: Call From Master/192.168.1.10 to Master:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused" likely cause is ResourceManager did not start.


references

  1. Apache Hadoop 3.1.1 official documents https://hadoop.apache.org/docs/r3.1.1/
  2. Hadoop cluster install the English version https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
  3. Hadoop 3.x new features analysis https://www.cnblogs.com/smartloli/p/9028267.html
  4. xfce4 and VNCServer configuration https://www.howtoing.com/how-to-install-and-configure-vnc-on-ubuntu-18-04
  5. Ubuntu16.04 hadoop the installation and configuration (pseudo distributed environment) https://www.cnblogs.com/87hbteo/p/7606012.html
  6. Fully distributed Hadoop Cluster Setup https://blog.csdn.net/u014636511/article/details/80171002
  7. Hadoop3.1.0 fully distributed cluster deployment of ultra-detailed record https://blog.csdn.net/weixin_42142630/article/details/81837131
  8. Hadoop3.1.0 fully distributed cluster deployment of ultra-detailed record https://blog.csdn.net/dream_an/article/details/80258283?utm_source=blogxgwz0
  9. Hadoop environment to build (stand-alone) https://blog.csdn.net/qazwsxpcm/article/details/78637874?utm_source=blogxgwz0
  10. Hadoop3.0 stable version of the installation and deployment https://blog.csdn.net/rlnLo2pNEfx9c/article/details/78816075
Published 10 original articles · won praise 1 · views 1221

Guess you like

Origin blog.csdn.net/miles_ye/article/details/101210775