Building a hadoop cluster in Linux (fully distributed)

This article installs jdk by default

# Check jdk version

Command: java -version

If the result shows the java version, it means that the jdk configuration is successful.

 Step 1: Turn off the firewall

In order to allow local machines to access cluster resources through web pages, and to prevent the cluster from being inaccessible when the cluster is running, the firewall needs to be turned off.

Instruction: systemctl stop firewalld

# Disable firewall on boot

Instruction: systemctl disable firewalld

Step 2: Modify the host name and configure IP address mapping.

View hostname command: hostname

Modify the hostname command: hostnamectl set-hostname docker

Add the following code to the /etc/hosts file:

x.x.x.x     master
y.y.y.y     slave1
z.z.z.z     slave2

Step 3: Configure password-free operation

# Execute in the ~ directory to generate a key

3.1 master node

## 1.master节点配置免密
[root@docker ~]# /etc/init.d/sshd start
[root@docker ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
/root/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
fc:39:e8:f3:57:ee:40:c9:99:8c:95:16:a1:af:5f:69 root@master
The key's randomart image is:
+--[ RSA 2048]----+
|            o.   |
|           . o   |
|          . +    |
|       .   B +   |
|        S . O    |
|         o +  .. |
|        . = .oE  |
|       ..  o.+.  |
|        .o.....  |
+-----------------+
## 2.将/root/.ssh下的id_rsa.pub添加到authorized_keys
[root@docker ~]## cd /root/.ssh
[root@docker ~]## ls
authorized_keys  config  id_rsa  id_rsa.pub
[root@docker ~]# cat id_rsa.pub > authorized_keys
[root@docker ~]# cat authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwOpTpijrGlJTnlFEBjIRq9m24KIQ9oc50javrRpWBqQxhUrjIuVQTaOqiW6Dpgo8zDVNRratkB+HnlNQ8q3L0kE+IvlZINrCijZAGksZJpgbyuhqHKdf8Tmdco90FhAENQc54pcCvpDCD4dukSuICN3533rXIBxJU7omHnTQMORo+AMyGDTWW7pNgNDQoC7iZsjE+GcpF9Aq2+joQqYwOOrTDmQ2HI6TFomnM02PERlwZkbM/5ELZsb616JPu9QMNuv8BDHgRF87PtzZEI1vBEDeNfBAc3/J5vuirlqqxgS+zk5DFiWD0jcstJG1hTX5qRCXmvEyHMfE2kEtgkAXYQ== root@master

3.2 slave1 node

[root@slave1 ~]# /etc/init.d/sshd start
[root@slave1 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
/root/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
a3:6c:f7:44:82:e8:a3:3f:49:49:01:ca:6e:81:5c:c6 root@slave1
The key's randomart image is:
+--[ RSA 2048]----+
|  o+             |
|+.oE.            |
|o+   .           |
|. . .. .         |
| o .... S .      |
|.  .o. . +       |
|   .o.+ . .      |
|   .oo . o       |
|  ....    .      |
+-----------------+
[root@slave1 ~]# cd /root/.ssh
[root@slave1 ~]# cat id_rsa.pub > authorized_keys
[root@slave1 ~]# cat authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAuT0X48XmXIzJ3toiSd0tmWI1ABBgbBiILUgfTqZVdDlBGMq/bywoyJcp80cUI/k2DL4aMWHTypc776/1bMZ3/dmYnFdVWRJ2PzxJ6H2h32QXXXqAuCzVks9BU7Q58guEGoiMeWHHyIffyrcIoiS7rD0tuqgoQnFvQPQBWLVrbxyi1BxBOLeF0jsPdNmc5e4Y8ZaOjJDBvBUmSaJ2zRE6/fxQk58Xe01e4JnnwovDdL+rZB5oue6rvUwXoEiZlGkKkmr2b7xw6UCi/mnYS+wYa2LolhuMTilalkRYEtRYXZWz3VrjdlDGmuCeccUoP+r44f7yk2BTv1OIyYPpDS0HSQ== root@slave1

3.3 slave2 node

[root@slave1 ~]#  /etc/init.d/sshd start
[root@slave1 ~]#  ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
/root/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
86:c1:25:7f:2e:64:43:cd:36:57:a2:07:6a:cc:17:6d root@slave2
The key's randomart image is:
+--[ RSA 2048]----+
|      . ooo....  |
|     . B .=+E.   |
|      o X.++.    |
|       * = .     |
|      . S .      |
|       . .       |
|                 |
|                 |
|                 |
+-----------------+
[root@slave1 ~]#  cd /root/.ssh
[root@slave1 ~]#  cat id_rsa.pub > authorized_keys
[root@slave1 ~]#  cat authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAowCUkXu8rkzpSIi/E53Zv3qYeIItuIO72MS+m/fMsHO7edKC8XjzB4DB406iPF6q2SobaAr8x++txLSq0bnTS4MHhY0GqUICuv7R0F5js1G5jV4cSMpdHfsIb9skz3LMjTXplxXR8md7TCmeS7Kmuf9FNmCXQIuWAdnPuUVxk12b4hFeaNyCi/yktuBy5Tqyi+b2Xp9cGXxz/dl32aHxBf9pkoEUJcej1o/g+hgAZ7c+9mv9aiLhbSI6ZPUGT1F3WBEq9tCx1rkbc5nQfjTCC3/K5btLvnkaaSUShEB4tUleh06LeuuTmCxNZ7vNBoVAb1cjyGvqdJZ6hHqHHddnQQ== root@slave2

 Merge the three authorized_keys and overwrite the authorized_keys in the three virtual machines after merging

Step 4: Install Hadoop (all three servers are the same)

4.1: Download hadoop-3.1.3.

Login URL: https://archive.apache.org/dist/hadoop/common/

Download hadoop-3.1.3.tar.gz. It can also be downloaded using the command.

Instruction: cd /usr/local/src/software

wget https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz:

4.2: Upload and decompress the tar.gz compressed package.

Instruction: cd /usr/local/src/software

tar -zxvf hadoop-3.1.3.tar.gz -C /usr/local/src/server

4.3: Configure environment variables.

Add the following code to the /etc/profile file:

export HADOOP_HOME=/home/hadoop/hadoop-3.1.3
 
export PATH=$PATH:$HADOOP_HOME/bin
 
export PATH=$PATH:$HADOOP_HOME/sbin

# Make the environment variable take effect

Instruction: source /etc/profile

# Test whether the configuration is successful

Instruction: hadoop version

If the result shows "Hadoop 3.1.3", it means that hadoop is configured successfully.

Step 5: Cluster configuration

The configuration files are in /home/hadoop/hadoop-3.1.3/etc/hadoop.

1. Configure the Hadoop custom configuration file in the master.

1) Create a directory .

Instruction: mkdir /home/hadoop/tmp

        mkdir /home/hadoop/dfs

2) Place hadoop-env.sh.

Add the following code to the hadoop-env.sh file:

export JAVA_HOME=/home/jdk/jdk1.8.0_202
# 在文件末尾加入下面的代码,设置用户以执行对应角色shell命令
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

3) Configure core-site.xml .

Add the following code to the core-site.xml file:

<configuration>
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://master:9000</value>
</property>
<!-- 配置Hadoop存储数据目录,默认/tmp/hadoop-${user.name} -->
<property>
   <name>hadoop.tmp.dir</name>
 <value>file:/home/hadoop/tmp</value>
</property>
<!--  缓冲区大小,实际工作中根据服务器性能动态调整 -->
<property>
   <name>io.file.buffer.size</name>
   <value>131072</value>
</property>
<!--  开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟 -->
<property>
   <name>fs.trash.interval</name>
   <value>10080</value>
</property>
</configuration>

4) Configure hdfs-site.xml .

Add the following code to the hdfs-site.xml file:

<configuration>
<!-- namenode存储hdfs名字的空间的元数据文件目录 -->
<property>
    <name>dfs.namenode.name.dir</name>                     
    <value>file:/home/hadoop/dfs/name</value>
 </property>
 <!-- datanode上的一个数据块的物理的存储位置文件 -->
 <property>
     <name>dfs.datanode.data.dir</name>
     <value>file:/home/hadoop/dfs/data</value>
 </property>
 <!-- 指定HDFS保存数据副本的数量 -->
 <property>
     <name>dfs.replication</name>
     <value>3</value>
 </property>
 <!-- 设置一个block的大小:128M-->
 <property>
     <name>dfs.blocksize</name>
     <value>134217728</value>
 </property>
 <!-- 定义namenode界面的访问地址 -->
 <property>
     <name>dfs.http.address</name>
     <value>master:50070</value>
 </property>
 <!-- 设置HDFS的文件权限-->
 <property>
     <name>dfs.permissions</name>
     <value>false</value>
 </property>
 <!-- 指定DataNode的节点配置文件 -->
 <property>
     <name>dfs.hosts</name>
     <value>/home/hadoop/hadoop-3.1.3/etc/hadoop/workers</value>
 </property>
 </configuration>

5) Configure workers.

Add the following code to the workers file:

master
slave1
slave2

6) mapred-site.xml configuration .

Add the following code to the mapred-site.xml file:

<configuration>
 <!-- 指定 MapReduce 程序运行在 Yarn 上,表示MapReduce使用yarn框架 -->
 <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
 </property>
 <!-- 开启MapReduce小任务模式 -->
 <property>
     <name>mapreduce.job.ubertask.enable</name>
     <value>true</value>
 </property>
 <!-- 设置历史任务的主机和端口 -->
 <property>
     <name>mapreduce.jobhistory.address</name>
     <value>master:10020</value>
 </property>
 <!-- 设置网页访问历史任务的主机和端口 -->
 <property>
     <name>mapreduce.jobhistory.webapp.address</name>
     <value>master:19888</value>
 </property>
 </configuration>

7) Configure yarn-site.xml .

Add the following code to the yarn-site.xml file:

<configuration>
 <!-- Site specific YARN configuration properties -->
 <!-- NodeManager获取数据的方式shuffle -->
 <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
 </property>
 <!-- yarn的web访问地址 -->
 <property>
     <description>
         The http address of the RM web application.
         If only a host is provided as the value,the webapp will be served on a random port.
     </description>
     <name>yarn.resourcemanager.webapp.address</name>
     <value>${yarn.resourcemanager.hostname}:8088</value>
 </property>
 <property>
     <description>
         The https address of the RM web application.
         If only a host is provided as the value,the webapp will be served on a random port.
     </description>
     <name>yarn.resourcemanager.webapp.https.address</name>
     <value>${yarn.resourcemanager.hostname}:8090</value>
 </property>
 <!-- 开启日志聚合功能,方便我们查看任务执行完成之后的日志记录 -->
 <property>
     <name>yarn.log-aggregation-enable</name>
     <value>true</value>
 </property>
 <!-- 设置聚合日志在hdfs上的保存时间 -->
 <property>
     <name>yarn.log-aggregation.retain-seconds</name>
     <value>604800</value>
 </property>
 </configuration>

2. Configure the Hadoop custom configuration file in the slave.

 1) Create a directory .

Instruction: mkdir /home/hadoop/tmp

        mkdir /home/hadoop/dfs

2) Place hadoop-env.sh.

Add the following code to the hadoop-env.sh file:

export JAVA_HOME=/home/jdk/jdk1.8.0_202
# 在文件末尾加入下面的代码,设置用户以执行对应角色shell命令
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

3) Configure core-site.xml .

Add the following code to the core-site.xml file:

<configuration>
 <property>
     <name>fs.defaultFS</name>
     <value>hdfs://slave1:9000</value>
 </property>
 <!-- 配置Hadoop存储数据目录,默认/tmp/hadoop-${user.name} -->
 <property>
     <name>hadoop.tmp.dir</name>
     <value>file:/home/hadoop/tmp</value>
 </property>
 <!--  缓冲区大小,实际工作中根据服务器性能动态调整 -->
 <property>
     <name>io.file.buffer.size</name>
     <value>131072</value>
 </property>
 <!--  开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟 -->
 <property>
     <name>fs.trash.interval</name>
     <value>10080</value>
 </property>
 </configuration>

4) Configure hdfs-site.xml .

Add the following code to the hdfs-site.xml file:

<configuration>
 <!-- namenode存储hdfs名字的空间的元数据文件目录 -->
 <property>
     <name>dfs.namenode.name.dir</name>           
     <value>file:/home/hadoop/dfs/name</value>
 </property>
 <!-- datanode上的一个数据块的物理的存储位置文件 -->
 <property>
     <name>dfs.datanode.data.dir</name>         
     <value>file:/home/hadoop/dfs/data</value>
 </property>
 <!-- 指定HDFS保存数据副本的数量 -->
 <property>
     <name>dfs.replication</name>
     <value>3</value>
 </property>
 <!-- 设置一个block的大小:128M-->
 <property>
     <name>dfs.blocksize</name>
     <value>134217728</value>
 </property>
 <!-- 设置HDFS的文件权限-->
 <property>
     <name>dfs.permissions</name>
     <value>false</value>
 </property>
 <!-- 指定DataNode的节点配置文件 -->
 <property>
     <name>dfs.hosts</name>
     <value>/home/hadoop/hadoop-3.1.3/etc/hadoop/workers</value>
 </property>
 </configuration>

5) Configure workers.

Add the following code to the workers file:

master
slave1
slave2

6) mapred-site.xml configuration .

Add the following code to the mapred-site.xml file:

<configuration>
 <!-- 指定MapReduce程序运行在Yarn上,表示MapReduce使用yarn框架 -->
 <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
 </property>
 <!-- 开启MapReduce小任务模式 -->
 <property>
     <name>mapreduce.job.ubertask.enable</name>
     <value>true</value>
 </property>
 <!-- 设置历史任务的主机和端口 -->
 <property>
     <name>mapreduce.jobhistory.address</name>
     <value>slave1:10020</value>
 </property>
 <!-- 设置网页访问历史任务的主机和端口 -->
 <property>
     <name>mapreduce.jobhistory.webapp.address</name>
     <value>slave1:19888</value>
 </property>
 </configuration>

7) Configure yarn-site.xml .

Add the following code to the yarn-site.xml file:

<configuration>
 <!-- Site specific YARN configuration properties -->
 <!-- NodeManager获取数据的方式shuffle -->
 <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
 </property>
 <!-- 指定YARN的ResourceManager的地址 -->
 <property>
     <name>yarn.resourcemanager.hostname</name>
     <value>slave1</value>
 </property>
 <!-- yarn的web访问地址 -->
 <property>
    <description>
        The http address of the RM web application.
        If only a host is provided as the value,the webapp will be served on a random port.
     </description>
     <name>yarn.resourcemanager.webapp.address</name>
     <value>${yarn.resourcemanager.hostname}:8088</value>
 </property>
 <property>
     <description>
         The https address of the RM web application.
         If only a host is provided as the value,the webapp will be served on a random port.
     </description>
     <name>yarn.resourcemanager.webapp.https.address</name>
     <value>${yarn.resourcemanager.hostname}:8090</value>
 </property>
 <!-- 开启日志聚合功能,方便我们查看任务执行完成之后的日志记录 -->
 <property>
     <name>yarn.log-aggregation-enable</name>
     <value>true</value>
 </property>
 <!-- 设置聚合日志在hdfs上的保存时间 -->
 <property>
     <name>yarn.log-aggregation.retain-seconds</name>
     <value>604800</value>
 </property>
 </configuration>

3. Configure the Hadoop custom configuration file in slave2.

  1) Create a directory .

Instruction: mkdir /home/hadoop/tmp

        mkdir /home/hadoop/dfs

2) Place hadoop-env.sh.

Add the following code to the hadoop-env.sh file:

export JAVA_HOME=/home/jdk/jdk1.8.0_202
# 在文件末尾加入下面的代码,设置用户以执行对应角色shell命令
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

3) Configure core-site.xml .

Add the following code to the core-site.xml file:

<configuration>
 <property>
     <name>fs.defaultFS</name>
     <value>hdfs://slave2:9000</value>
 </property>
 <!-- 配置Hadoop存储数据目录,默认/tmp/hadoop-${user.name} -->
 <property>
     <name>hadoop.tmp.dir</name>
     <value>file:/home/hadoop/tmp</value>
 </property>
 <!--  缓冲区大小,实际工作中根据服务器性能动态调整 -->
 <property>
     <name>io.file.buffer.size</name>
     <value>131072</value>
 </property>
 <!--  开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟 -->
 <property>
     <name>fs.trash.interval</name>
     <value>10080</value>
 </property>
 </configuration>

4) Configure hdfs-site.xml .

Add the following code to the hdfs-site.xml file:

<configuration>
 <!-- namenode存储hdfs名字的空间的元数据文件目录 -->
 <property>
     <name>dfs.namenode.name.dir</name>                   
     <value>file:/home/hadoop/dfs/name</value>
 </property>
 <!-- datanode上的一个数据块的物理的存储位置文件 -->
 <property>
     <name>dfs.datanode.data.dir</name>
     <value>file:/home/hadoop/dfs/data</value>
 </property>
 <!-- 指定HDFS保存数据副本的数量 -->
 <property>
     <name>dfs.replication</name>
     <value>3</value>
 </property>
 <!-- 设置一个block的大小:128M-->
 <property>
     <name>dfs.blocksize</name>
     <value>134217728</value>
 </property>
 <!-- 定义secondarynamenode的通信地址 -->
 <property>
     <name>dfs.namenode.secondary.http-address</name>
     <value>slave2:50071</value>
 </property>
 <!-- 设置HDFS的文件权限-->
 <property>
     <name>dfs.permissions</name>
     <value>false</value>
 </property>
 <!-- 指定DataNode的节点配置文件 -->
 <property>
     <name>dfs.hosts</name>
     <value>/home/hadoop/hadoop-3.1.3/etc/hadoop/workers</value>
 </property>
 </configuration>

5) Configure workers.

Add the following code to the workers file:

master
slave1
slave2

6) mapred-site.xml configuration .

Add the following code to the mapred-site.xml file:

<configuration>
 <!-- 指定 MapReduce 程序运行在 Yarn 上,表示MapReduce使用yarn框架 -->
 <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
 </property>
 <!-- 开启MapReduce小任务模式 -->
 <property>
     <name>mapreduce.job.ubertask.enable</name>
     <value>true</value>
 </property>
 <!-- 设置历史任务的主机和端口 -->
 <property>
     <name>mapreduce.jobhistory.address</name>
     <value>slave2:10020</value>
 </property>
 <!-- 设置网页访问历史任务的主机和端口 -->
 <property>
     <name>mapreduce.jobhistory.webapp.address</name>
     <value>slave2:19888</value>
 </property>
 </configuration>

7) Configure yarn-site.xml .

Add the following code to the yarn-site.xml file:

<configuration>
 <!-- Site specific YARN configuration properties -->
 <!-- NodeManager获取数据的方式shuffle -->
 <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
 </property>
 <!-- yarn的web访问地址 -->
 <property>
     <description>
         The http address of the RM web application.
         If only a host is provided as the value,the webapp will be served on a random port.
     </description>
     <name>yarn.resourcemanager.webapp.address</name>
     <value>${yarn.resourcemanager.hostname}:8088</value>
 </property>
 <property>
     <description>
         The https address of the RM web application.
         If only a host is provided as the value,the webapp will be served on a random port.
     </description>
     <name>yarn.resourcemanager.webapp.https.address</name>
     <value>${yarn.resourcemanager.hostname}:8090</value>
 </property>
 <!-- 开启日志聚合功能,方便我们查看任务执行完成之后的日志记录 -->
 <property>
     <name>yarn.log-aggregation-enable</name>
     <value>true</value>
 </property>
 <!-- 设置聚合日志在hdfs上的保存时间 -->
 <property>
     <name>yarn.log-aggregation.retain-seconds</name>
     <value>604800</value>
 </property>
 </configuration>

Step 6: Cluster Startup

# Note: If the cluster is started for the first time, the NameNode needs to be formatted on the node where the Namenode is located. It is not necessary to perform the format Namenode operation for the first time.

Command: hadoop namenode -format

  1. ## start hadoop

Execute in the /home/hadoop/hadoop-3.1.3/sbin directory

./start-all.sh

Step 7: How to judge the success of hadoop cluster startup

1. The Hadoop cluster is enabled successfully. Using the Jps command in the virtual machine Hadoop, the following interface appears:

insert image description here

2.Hadoop can access the 50070 interface 

 

Guess you like

Origin blog.csdn.net/askuld/article/details/130426464