06-Hadoop cluster setup (root user)

Building Hadoop cluster process

Environmental preparation

1. Set up the basic environment (turn off the intranet firewall, host name, plan static IP, hosts mapping , time synchronization ntp, jdk, ssh password-free, etc.)

2. Hadoop source code compilation (in order to adapt to local libraries, local environments, etc. between different operating systems)

3. Modification of Hadoop configuration file (shell file, 4 xml files, workers file)

4. Configuration file cluster synchronization (scp for distribution)

The specific process is as follows:

Commonly used commands (this part depends on the situation)

vim /etc/sysconfig/network-scripts/ifcfg-ens33

# 做出如下修改
BOOTPROTO=static  # 改为静态
# 末尾添加如下内容
IPADDR=192.168.188.128
GATEWAY=192.168.188.2
NETMASK=255.255.255.0
DNS1=114.114.114.114


# 重启网卡
systemctl restart network.service

# 修改主机名
vim /etc/hostname
# 配置hosts映射
vim /etc/hosts
192.168.188.128 kk01
192.168.188.129 kk02
192.168.188.130 kk03


# 修改window的主机映射文件(hosts)
# 进入C:\Windows\System32\drivers\etc
# 添加如下内容
192.168.188.128 kk01
192.168.188.129 kk02
192.168.188.130 kk03
# 查看防火墙状态
firewall-cmd --state
# 或
systemctl status firewalld.service 

systemctl stop firewalld.service # 关闭当前防火墙
systemctl disable firewalld.service  # 关闭防火墙开机自启动

Create a normal user and give it root permissions (you can ignore this step if you use the root user)

# 创建用户 (如果安装Linux时已经创建了,这一步骤可以忽略)
useradd nhk
passwd 123456

# 配置普通用户(nhk)具有root权限,方便后期加sudo执行root权限的命令
vim /etc/sudoers

# 在%wheel这行下面添加一行 (大概是在100行左右位置)

## Allow root to run any commands anywhere 
root    ALL=(ALL)       ALL 

## Allows members of the 'sys' group to run networking, software, 
## service management apps and more.
# %sys ALL = NETWORKING, SOFTWARE, SERVICES, STORAGE, DELEGATING, PROCESSES, LOCATE, DRIVERS

## Allows people in group wheel to run all commands
%wheel  ALL=(ALL)       ALL 

## Same thing without a password
# %wheel        ALL=(ALL)       NOPASSWD: ALL
nhk ALL=(ALL) NOPASSWD: ALL 

Notice:

​ nhk ALL=(ALL) NOPASSWD: ALL This line of content must be placed under %wheel, and must not be placed under root.

Create a unified working directory

(Subsequently, it will need to be created between multiple machines in the cluster)

# 个人习惯
mkdir -p /opt/software/ # 软件安装目录、安装包存放目录
mkdir -p /opt/data/   # 数据存储路径

# 修改文件夹所有者和所属组 (如果是使用root用户搭建集群可以忽略)
chown nhk:nhk /opt/software
chown nhk:nhk /opt/data

# 黑马推荐
mkdir -p /export/server/   # 软件安装目录
mkdir -p /export/data/  # 数据存储路径
mkdir -p /export/software/ # 安装包存放目录

SSH password-free login

# 只需配置(kk01 --> kk01 kk02 kk03 即可 )
ssh-keygen  # kk01生成公钥、私钥

# 配置免密 kk01 --> kk01 kk02 kk03
ssh-copy-id kk01
ssh-copy-id kk02
ssh-copy-id kk03 

Cluster time synchronization

# 在集群的每台集群
yum -y install ntpdate 

ntpdate ntp4.aliyun.com
# 或
ntpdate ntp5.aliyum.com

# 查看时间
date

Uninstall the jdk that comes with Linux

(If Linux does not come with open jdk, ignore this step)

# 查看jdk命令
rpm -qa | grep -i java   # -i 忽略大小写
# 删除
rpm -e –nodeps 待删除的jdk    


# 也可以使用以下命令删除
rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps 
# 其中 xargs -n1表示每次只传递一个参数
#     -e --nodeps表强制卸载软件

# 删除完后,使用以下命令查看jdk是否还在
java -version

Configure JDK environment variables

Note: Before installing jdk, make sure that the jdk that comes with Linux has been uninstalled.

# 上传jdk压缩包(使用rz命令或者ftp工具)

# 解压jdk
tar -zxvf jdk-8u152-linux-x64.tar.gz -C /opt/software/

# 配置JDK环境(如果是普通用户进入需要加sudo)
sudo vim /etc/profile

export JAVA_HOME=/opt/software/jdk1.8.0_152
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

# 重新加载环境变量文件
source /etc/profile

Hadoop installation

Unzip the Hadoop compressed package

# 上传Hadoop压缩包
cd /opt/software/
rz

# 解压
tar -xvzf hadoop-3.2.2.tar.gz 
# 在每个节点中创建用于存放数据的data目录

Configuration file description

1 in the first category: hadoop-env.sh

There are 4 in the second category: xxxx-site.xml. Site indicates a user-defined configuration, which will override the default configuration by default.

​ core-site.xml core module configuration

hdfs-site.xml hdfs file system module configuration

​ mapred-site.xml mapreduce module configuration

yarn-site.xml yarn module configuration

1 in the third category: workers

All configuration file directory files: /opt/software/haoop-3.2.2/etc/hadoop

Place NameNode (core-site.xml)

cd /opt/software/hadoop-3.2.2/etc/hadoop
vim core-site.xml   

# 添加如下内容
<!-- 数组默认使用的文件系统 Hadoop支持file、HDFS、GFS、ali|Amazon云等文件系统-->
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://kk01:8020</value>
  <description>指定NameNode的URL</description>
</property>

 <!-- 设置hadoop本地保存数据路径-->
<property>
    <name>hadoop.tmp.dir</name>
    <!--可以配置在/opt/data/hadoop-3.2.2,但是我们建议配置在${HADOOP_HOME}/data下面-->
    <value>/opt/data/hadoop-3.2.2</value>  
</property>
<!-- 设置HDFS web UI用户身份为 root,如果使用的是普通用户,配置成普通用户即可-->
<property>
    <name>hadoop.http.staticuser.user</name>
    <value>root</value>
</property>

<!--  缓冲区大小,实际工作中根据服务器性能动态调整 -->
<property>
    <name>io.file.buffer.size</name>
    <value>4096</value>
</property>

Configure HDFS path (hdfs-site.xml)

vim hdfs-site.xml

# 添加如下内容
<!-- 设置SNN进程运行时机器位置的信息-->
<!--2nn web端访问地址 -->
<property>
	<name>dfs.namenode.secondary.http-address</name>
	<value>kk02:9868</value>
</property>

<!-- nn web端访问地址-->
<!--hadoop3默认该端口为9870,不配也行-->
<property>
    <name>dfs.namenode.http-address</name>
    <value>kk01:9870</value>
</property>

Configure YARN (yarn-site.xml)

vim yarn-site.xml

# 添加如下内容

<!-- 设置yarn集群主角色rm运行机器位置-->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>kk01</value>
</property>
<!--nodemanager上运行的附属程序。需配置成mapreduce_shuffle,才可运行mr程序-->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<!-- 是否将对容器实施物理内存限制 生产中可产生改配置-->
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<!-- 是否将对容器实施虚拟内存限制 生产中可产生改配置-->
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

<!--日志聚集功能好处:可以方便的查看到程序运行详情,方便开发调试-->
<!-- 开启日志聚集-->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- 设置yarn历史服务器地址-->
<property>
    <name>yarn.log.server.url</name>
    <value>http://kk01:19888/jobhistory/logs</value>
</property>
<!-- 历史日志保存的时间 7天-->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

Placement MapReduce (mapred-site.xml)

vim mapred-site.xml

# 添加如下内容
 <!-- 设置MR程序默认运行方式:yarn集群模式 local本地模式-->
         <property>
                 <name>mapreduce.framework.name</name>
                 <value>yarn</value>
         </property>

		<!--为了查看程序的历史运行情况,需要配置一下历史服务器-->
        <!-- MR程序历史服务地址-->
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>kk01:10020</value>
        </property>
        <!-- MR程序历史服务器web端地址-->
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>kk01:19888</value>
        </property>
	
		<!-- mr app master环境变量-->
        <property>
                <name>yarn.app.mapreduce.am.env</name>
                <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
        </property>
		<!-- mr maptask环境变量-->
        <property>
                <name>mapreduce.map.env</name>
                <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
        </property>
		<!-- mr reducetask环境变量-->
        <property>
                <name>mapreduce.reduce.env</name>
                <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
        </property>

workers file (this file is for clustering)

vim /opt/software/hadoop-3.2.2/etc/hadoop/workers

# 将文件内容替换为如下
kk01
kk02
kk03

# 注意:该文件中添加的内容结尾不允许有空格,文件中不允许有空行。

Modify hadoop-env environment variables

If you are using the root user to build Hadoop, you need to configure it, otherwise it will be ignored.

vim /opt/software/hadoop-3.2.2/etc/hadoop/hadoop-env.sh 

# 添加如下内容
export JAVA_HOME=/opt/software/jdk1.8.0_152
# 如果是root用户,需要指明,如下所示
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

Configure environment variables

# 如果使用的是普通用户需要加sudo,root用户,则不需要,后续不再赘述
sudo vi /etc/profile

# Hadoop environment variables
export HADOOP_HOME=/opt/software/hadoop-3.2.2
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

# 重新加载配置文件
source /etc/profile

Distribute the configured Hadoop installation package

(What we demonstrate here is a cluster built by the root user. Ordinary users can make slight modifications)

# 使用$PWD时,记得先切换进相应目录(/opt/software/)
scp -r /opt/software/hadoop-3.2.2 root@kk02:$PWD
scp -r /opt/software/hadoop-3.2.2 root@kk03:$PWD

# 如果其他节点没有jdk记得也要分发
scp -r /opt/software/jdk1.8.0_152 root@kk02:$PWD
scp -r /opt/software/jdk1.8.0_152 root@kk03:$PWD

#分发配置文件
scp /etc/profile  root@kk02:/etc/profile
scp /etc/profile  root@kk03:/etc/profile

# 在被分发节点上执行
source /etc/profile

rsync is mainly used for backup and mirroring. It has the advantages of being fast, avoiding copying the same content, and supporting symbolic links.

The difference between rsync and scp: using rsync to copy files is faster than scp, r sync only updates the difference files. scp copies all files.

basic grammar

rsyns -av 待拷贝文件路径/名称 目的主机用户@主机名:目的地路径/名称

# 参数说明
# -a  归档拷贝
# -v  显示拷贝过程

The following demonstrates how to use ordinary users to assign them to ordinary users.

rsync -av /opt/software/hadoop-3.2.2 nhk@kk02:/opt/software/
rsync -av /opt/software/hadoop-3.2.2 nhk@kk03:/opt/software/

rsync -av /opt/software/jdk1.8.0_152 nhk@kk02:/opt/software/
rsync -av /opt/software/jdk1.8.0_152 nhk@kk03:/opt/software/

# 为了方便分发,我们可以编写脚本 (可选)
vim xsync
#!/bin/bash

#1. 判断参数个数
if [ $# -lt 1 ]
then
    echo Not Enough Arguement!
    exit;
fi
#2. 遍历集群所有机器
for host in kk01 kk02 kk03
do
    echo ====================  $host  ====================
    #3. 遍历所有目录,挨个发送

    for file in $@
    do
        #4. 判断文件是否存在
        if [ -e $file ]
            then
                #5. 获取父目录
                pdir=$(cd -P $(dirname $file); pwd)
                #6. 获取当前文件的名称
                fname=$(basename $file)
                ssh $host "mkdir -p $pdir"
                rsync -av $pdir/$fname $host:$pdir
            else
                echo $file does not exists!
        fi
    done
done


# 让脚本具有可执行权限
chmod +x xsync
# 将脚本复制到/bin,以便全局调用
sudo cp xsync /bin/  # 普通用户需sudo
# 同步环境变量
[nhk@kk01 ~]$ sudo ./bin/xsync  # 普通用户情况下 

# 注意:如果用了sudo,那么xsync一定要给它的路径补全。

# 让环境变量生效
source /etc/profile

# 使用自定义脚本分发
xsync /opt/software/hadoop-3.2.2/

HDFS formatting

When HDFS is started for the first time, it must be formatted . Essentially some cleanup and preparation work, because HDFS does not physically exist at this time.

cd /opt/software/hadoop-3.1.4
bin/hdfs namenode -format 

# 因为我们配置了环境变量,其实格式化操作也可以简化为如下操作:
hdfs namenode -format

# 若出现 successfully formatted 字样则说明格式化成功

Cluster startup

# HDFS集群启动
start-dfs.sh

# YARN集群启动
start-yarn.sh

Check the NameNode of HDFS on the web side

http://kk01:9870

View YARN’s ResourceManager on the web side

http://kk01:8088

Write common scripts for hadoop clusters

Hadoop cluster startup and shutdown script

vim hdp.sh

#!/bin/bash

if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi

case $1 in
"start")
        echo " =================== 启动 hadoop集群 ==================="

        echo " --------------- 启动 hdfs ---------------"
        ssh kk01 "/opt/software/hadoop-3.2.2/sbin/start-dfs.sh"
        echo " --------------- 启动 yarn ---------------"
       ssh kk01 "/opt/software/hadoop-3.2.2/sbin/start-yarn.sh"
        echo " --------------- 启动 historyserver ---------------"
        ssh kk01 "/opt/software/hadoop-3.2.2/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== 关闭 hadoop集群 ==================="

        echo " --------------- 关闭 historyserver ---------------"
        ssh kk01 "/opt/software/hadoop-3.2.2/bin/mapred --daemon stop historyserver"
        echo " --------------- 关闭 yarn ---------------"
        ssh kk01 "/opt/software/hadoop-3.2.2/sbin/stop-yarn.sh"
        echo " --------------- 关闭 hdfs ---------------"
        ssh kk01 "/opt/software/hadoop-3.2.2/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

# 赋予脚本执行权限
chmod +x hdp.sh

View the three server Java process scripts: jpsall

vim jpsall

#!/bin/bash
for host in kk01 kk02 kk03
do
        echo "=========$host========="
        ssh $host "source /etc/profile; jps"
done

# 赋予脚本执行权限
chmod +x jpsall

Distribute custom scripts so that all three servers can use the scripts normally

xsync 脚本放置的目录

Things to note when setting up a cluster FAQs

If you use the root user to operate, you need to add the following configuration to the hadoop-env.sh file.

在hadoop-env.sh文件下添加一下信息
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"

To use root operation, you need to modify the following configuration in the core-site.xml file.

<!-- Web UI访问HDFS的用户名-->
<!-- 配置HDFS登录使用的静态用户为root -->
<property>
	<name>hadoop.http.staticuser.user</name>
	<value>root</value>
</property>

The firewall is not closed or YARN is not started.

INFO client.RMProxy:Connecting to ResourceManager at kk01/192.168.188.128:8032

solve:

​ Add 192.168.188.128 kk01 to the /etc/hosts file

Hostname configuration error

solve:

​Change the host name. Do not use special names such as hadoop hadoop000.

Hostname not recognized

java.net.UnknownHostException: kk01: kk01
     at java.net.InetAddress.getLocalHost(InetAddress.java:1475)
       at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:146)
      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
      at java.security.AccessController.doPrivileged(Native Method)
	  at javax.security.auth.Subject.doAs(Subject.java:415)

solve:

​ Add 192.168.188.128 kk01 to the /etc/hosts file

​ The host name should not have special names such as hadoop hadoop000.

Only one DataNode and NameNode process can work at the same time

solve:

​ Before formatting, delete the information in the DataNode (default is in the /tmp directory, if this directory is configured, then delete the data in the configured directory)

jps found that the process no longer exists, but restarted the cluster and prompted that the process has been started.

solve:

The reason is that there are temporary files for started processes in the /tmp directory under the root directory of Linux. Delete the cluster-related processes and then restart the cluster.

jps does not take effect

solve:

Reason: The global variable hadoop java does not take effect. Solution: You need to source the /etc/profile file.

Port 8088 cannot be connected

solve:

cat /etc/hosts
# 注释掉如下信息
#127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1         kk01

Hadoop cluster summary

  • The format operation is only executed once when the Hadoop cluster is first started.

  • format will cause the loss of our data many times, and may also cause the Hadoop cluster master and slave roles to not recognize each other (solution: delete the hadoop.tmp.dir directory of all nodes, delete the data and logs directories of all machines) , then reformat)

Hadoop cluster startup and shutdown

Manually start/stop processes one by one on each machine

Advantages: Conducive to precise control of the opening and closing of each process

HDFS cluster

hdfs --daemon start namenode|datanode|secondarynamenode

hdfs --daemon stop namenode|datanode|secondarynamenode

YARN cluster

yarn --daemon start resourcemanager|nodemanager

yarn --daemon stop resourcemanager|nodemanager

Shell script starts and stops with one click

Start with one click using the shell script that comes with Hadoop

Prerequisite: Configure SSH password-free login and workers files between machines

HDFS cluster

start-dfs.sh
stop-dfs.sh

YARN cluster

start-yarn.sh
stop-yarn.sh

Hadoop cluster

start-all.sh
stop-all.sh

Hadoop cluster startup log

After the startup is complete, you can use the jps command to check whether the process has started.

If you find that some processes cannot start normally or crash after starting, you can check the log file (.log) in the logs directory.

hadoop startup log: log path: in the logs directory under the Hadoop installation directory

HDFS benchmark

In the actual production environment, after the Hadoop environment is set up, the first thing to do is to perform a stress test, test the reading and writing speed of the Hadoop cluster, test whether the network bandwidth is sufficient, and other benchmark tests.

Test write speed

Write data to the HDFS file system, 10 files, each file is 10MB, and the files are stored in /benchmarks/TestDFSIO

# 启动Hadoop集群
start-all.sh

# 在确保HDFS集群和YARN集群成功启动情况下,执行下面命令
hadoop jar /opt/software/hadoop-3.2.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -write -nrFiles 10  -fileSize 10MB

# 说明:向HDFS文件系统中写入数据,10个文件,每个文件10MB,文件存放到/benchmarks/TestDFSIO中
	Throughput:吞吐量、Average IO rate:平均IO率、IO rate std deviation:IO率标准偏差

2023-04-01 11:12:56,880 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
2023-04-01 11:12:56,880 INFO fs.TestDFSIO:             Date & time: Sat Apr 01 11:12:56 EDT 2023
2023-04-01 11:12:56,880 INFO fs.TestDFSIO:         Number of files: 10
2023-04-01 11:12:56,880 INFO fs.TestDFSIO:  Total MBytes processed: 100
2023-04-01 11:12:56,880 INFO fs.TestDFSIO:       Throughput mb/sec: 4.81
2023-04-01 11:12:56,880 INFO fs.TestDFSIO:  Average IO rate mb/sec: 7.58
2023-04-01 11:12:56,880 INFO fs.TestDFSIO:   IO rate std deviation: 7.26
2023-04-01 11:12:56,880 INFO fs.TestDFSIO:      Test exec time sec: 34.81
2023-04-01 11:12:56,880 INFO fs.TestDFSIO: 

我们看到目前在虚拟机上的IO吞吐量约为:4.81MB/s

Test reading speed

Test the file reading performance of hdfs, read 10 files in the HDFS file system, each file is 10M

hadoop jar /opt/software/hadoop-3.2.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 10MB


# 说明:在HDFS文件系统中读入10个文件,每个文件10M
	Throughput:吞吐量、Average IO rate:平均IO率、IO rate std deviation:IO率标准偏差

2023-04-01 11:16:44,479 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
2023-04-01 11:16:44,479 INFO fs.TestDFSIO:             Date & time: Sat Apr 01 11:16:44 EDT 2023
2023-04-01 11:16:44,479 INFO fs.TestDFSIO:         Number of files: 10
2023-04-01 11:16:44,479 INFO fs.TestDFSIO:  Total MBytes processed: 100
2023-04-01 11:16:44,479 INFO fs.TestDFSIO:       Throughput mb/sec: 33.8
2023-04-01 11:16:44,479 INFO fs.TestDFSIO:  Average IO rate mb/sec: 93.92
2023-04-01 11:16:44,479 INFO fs.TestDFSIO:   IO rate std deviation: 115.03
2023-04-01 11:16:44,479 INFO fs.TestDFSIO:      Test exec time sec: 25.72
2023-04-01 11:16:44,479 INFO fs.TestDFSIO: 

可以看到读取的吞吐量为:33.8Mb/s

Clear test data

During the test, the /benchmarks directory will be created on the HDFS cluster. After the test is completed, we can clean up the directory.

# 查看
 hdfs dfs -ls -R /benchmarks

# 执行清理
hadoop jar  /opt/software/hadoop-3.2.2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -clean

# 说明 删除命令会将 /benchmarks目录中内容删除

Guess you like

Origin blog.csdn.net/weixin_56058578/article/details/132260459