Hadoop (cluster configuration) of big data technology


  1. Cluster deployment planning

Notice:

  • NameNode and SecondaryNameNode should not be installed on the same server

  • ResourceManager also consumes a lot of memory. Do not configure it on the same machine as NameNode and SecondaryNameNode.

hadoop102

hadoop103

hadoop104

HDFS

NameNode

DataNode

DataNode

SecondaryNameNode

DataNode

YARN

NodeManager

ResourceManager

NodeManager

NodeManager


  1. Configuration file description

There are two types of Hadoop configuration files: default configuration files and custom configuration files. Only when users want to modify a default configuration value, they need to modify the custom configuration file and change the corresponding attribute value.

2.1 Default configuration file

default file to fetch

The location where the file is stored in the Hadoop jar package

[core-default.xml]

hadoop-common-3.1.3.jar/core-default.xml

[hdfs-default.xml]

hadoop-hdfs-3.1.3.jar/hdfs-default.xml

[yarn-default.xml]

hadoop-yarn-common-3.1.3.jar/yarn-default.xml

[mapred-default.xml]

hadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml

2.2 Custom configuration files

The four configuration files core-site.xml, hdfs-site.xml, yarn-site.xml, and mapred-site.xml are stored in the path $HADOOP_HOME/etc/hadoop, and users can re-modify the configuration according to project requirements.


3. Configure the cluster

3.1 Core configuration file

3.1.1 Configure core-site.xml

[atguigu@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
[atguigu@hadoop102 hadoop]$ vim core-site.xml

The content of the file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- 指定NameNode的地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop102:8020</value>
    </property>

    <!-- 指定hadoop数据的存储目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
    </property>

    <!-- 配置HDFS网页登录使用的静态用户为atguigu -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>atguigu</value>
    </property>
</configuration>

3.1.2 Configure hdfs-site.xml [HDFS configuration file]

[atguigu@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
[atguigu@hadoop102 hadoop]$ vim hdfs-site.xml

The content of the file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- nn web端访问地址-->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop102:9870</value>
    </property>
    <!-- 2nn web端访问地址-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop104:9868</value>
    </property>
</configuration>

3.1.3 Configure yarn-site.xml [YARN configuration file]

[atguigu@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
[atguigu@hadoop102 hadoop]$ vim yarn-site.xml

The content of the file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- 指定MR走shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- 指定ResourceManager的地址-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop103</value>
    </property>

    <!-- 环境变量的继承 -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

3.1.4 Configure mapred-site.xml [MapReduce configuration file]

[atguigu@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
[atguigu@hadoop102 hadoop]$ vim mapred-site.xml

The content of the file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- 指定MapReduce程序运行在Yarn上 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

3.1.5 Distribute the configured Hadoop configuration files on the cluster

[atguigu@hadoop102 ~]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/
# 去103和104上查看文件分发情况
[atguigu@hadoop103 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml
[atguigu@hadoop104 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

3.2 Clustering together

3.2.1 Configuring workers

[atguigu@hadoop102 hadoop]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

Add the following to this file:

hadoop102
hadoop103
hadoop104

Note: spaces are not allowed at the end of the content added in this file, and blank lines are not allowed in the file.

Synchronize all node configuration files:

[atguigu@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc

3.2.2 Start the cluster

(1) Format the NameNode.

If the cluster is started for the first time , you need to format the NameNode on the hadoop102 node (Note: Formatting the NameNode will generate a new cluster id, resulting in inconsistent cluster ids between the NameNode and DataNode, and the cluster cannot find past data. If the cluster is running If you need to reformat the NameNode, you must stop the namenode and datanode processes first, and delete the data and logs directories of all machines before formatting.)

[atguigu@hadoop102 hadoop-3.1.3]$ hdfs namenode -format

(2) Start HDFS

[atguigu@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh

(3) Start YARN

Start YARN on the node (hadoop103) configured with ResourceManager

[atguigu@hadoop102 hadoop-3.1.3]$ sbin/start-yarn.sh

3.2.3 Summary of cluster start/stop methods

(1) Each module is started/stopped separately (configuring ssh is a prerequisite) commonly used

  • Overall start/stop HDFS

start-dfs.sh/stop-dfs.sh
  • Overall start/stop YARN

start-yarn.sh/stop-yarn.sh

(2) Each service component starts/stops one by one

  • Start/stop HDFS components individually

hdfs --daemon start/stop namenode/datanode/secondarynamenode
  • start/stop YARN

yarn --daemon start/stop resourcemanager/nodemanager

3.3 Writing common scripts for Hadoop clusters

3.3.1 Hadoop cluster startup and shutdown script (including HDFS, Yarn, Historyserver): myhadoop.sh

[atguigu@hadoop102 ~]$ cd /home/atguigu/bin
[atguigu@hadoop102 bin]$ vim myhadoop.sh
  • Enter the following:

#!/bin/bash

if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi

case $1 in
"start")
        echo " ======启动 hadoop集群 ======="
        echo " --------------- 启动 hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- 启动 yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
;;
"stop")
        echo " ==========关闭 hadoop集群 ========="
        echo " --------------- 关闭 yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- 关闭 hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac
  • Save and exit, then give the script execution permission

[atguigu@hadoop102 bin]$ chmod +x myhadoop.sh

3.3.2 View the Java process scripts of the three servers: jpsall

[atguigu@hadoop102 ~]$ cd /home/atguigu/bin
[atguigu@hadoop102 bin]$ vim jpsall
  • Enter the following:

#!/bin/bash

for host in hadoop102 hadoop103 hadoop104
do
        echo =============== $host ===============
        ssh $host jps 
done
  • Save and exit, then give the script execution permission

[atguigu@hadoop102 bin]$ chmod +x jpsall

Distribute the /home/atguigu/bin directory to ensure that the custom script can be used on all three machines.

[atguigu@hadoop102 ~]$ xsync /home/atguigu/bin/

3.4 Common URLs

(1) View the NameNode of HDFS on the Web side

①Input in the browser: http://hadoop102:9870

② View the data information stored on HDFS

(2) View the ResourceManager of YARN on the Web side

①Input in the browser: http://hadoop103:8088

② Check the job information running on YARN


3.3 Configure history server

In order to view the historical operation of the program, you need to configure the history server. The specific configuration steps are as follows:

1) Configure mapred-site.xml

[atguigu@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
[atguigu@hadoop102 hadoop]$ vim mapred-site.xml

Add the following configuration to this file:

<!-- 历史服务器端地址 -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop102:10020</value>
</property>

<!-- 历史服务器web端地址 -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop102:19888</value>
</property>

2) Distribution configuration

[atguigu@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/mapred-site.xml

3) Start the history server in hadoop102

[atguigu@hadoop102 hadoop]$ mapred --daemon start historyserver

4) Check whether the history server is started

[atguigu@hadoop102 hadoop]$ jps

5) View JobHistory

http://hadoop102:19888/jobhistory

3.4 Configure Log Aggregation

Log aggregation concept: After the application is running, the program running log information is uploaded to the HDFS system.

Benefits of the log aggregation function: It is convenient to view the details of the program operation, which is convenient for development and debugging.

Note: To enable the log aggregation function, NodeManager, ResourceManager and HistoryServer need to be restarted.

The specific steps to enable the log aggregation function are as follows:

1) Configure yarn-site.xml

[atguigu@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop
[atguigu@hadoop102 hadoop]$ vim yarn-site.xml

Add the following configuration to this file:

<!-- 开启日志聚集功能 -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>  
    <name>yarn.log.server.url</name>  
    <value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为7天 -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

2) Distribution configuration

[atguigu@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/yarn-site.xml

3) Close NodeManager, ResourceManager and HistoryServer

[atguigu@hadoop103 hadoop-3.1.3]$ sbin/stop-yarn.sh
[atguigu@hadoop102 hadoop-3.1.3]$ mapred --daemon stophistoryserver

4) Start NodeManager, ResourceManage and HistoryServer

[atguigu@hadoop103 ~]$ start-yarn.sh
[atguigu@hadoop102 ~]$mapred --daemon start historyserver

5) View logs

List of historical tasks in the historical server address

http://hadoop102:19888/jobhistory

3.5 Description of common port numbers

port name

Hadoop2.x

Hadoop3.x

NameNode internal communication port

8020 / 9000

8020 / 9000/9820

NameNode HTTP UI

50070

9870

MapReduce view execution task port

8088

8088

History server communication port

19888

19888

3.6 Cluster time synchronization

如果服务器在公网环境(能连接外网),可以不采用集群时间同步,因为服务器会定期和公网时间进行校准;

如果服务器在内网环境,必须要配置集群时间同步,否则时间久了,会产生时间偏差,导致集群执行任务时间不同步。

1)需求

找一个机器,作为时间服务器,所有的机器与这台集群时间进行定时的同步,生产环境根据任务对时间的准确程度要求周期同步。测试环境为了尽快看到效果,采用1分钟同步一次。

2)时间服务器配置(必须root用户)

(1)查看hadoop102服务状态和开机自启动状态(如果开着就关掉)

[atguigu@hadoop102 ~]$ sudo systemctl status ntpd
[atguigu@hadoop102 ~]$ sudo systemctl is-enabled ntpd
# 关闭
[atguigu@hadoop102 ~]$ sudo systemctl disabled ntpd

(2)修改hadoop102的ntp.conf配置文件

[atguigu@hadoop102 ~]$ sudo vim /etc/ntp.conf

修改内容如下:

①修改1(授权192.168.10.0-192.168.10.255网段上的所有机器可以从这台机器上查询和同步时间)
#restrict192.168.1.0 mask 255.255.255.0 nomodify notrap
====> 将上面的注释去掉(并将192.168.1.0改成192.168.10.0)
restrict 192.168.10.0 mask255.255.255.0 nomodify notrap

 ②修改2(集群在局域网中,不使用其他互联网上的时间)
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst
====> 给上面的内容添加注释
#server0.centos.pool.ntp.org iburst
#server1.centos.pool.ntp.org iburst
#server2.centos.pool.ntp.org iburst
#server3.centos.pool.ntp.org iburst

③添加3(当该节点丢失网络连接,依然可以采用本地时间作为时间服务器为集群中的其他节点提供时间同步)
server 127.127.1.0
fudge 127.127.1.0 stratum 10

(3)修改hadoop102的/etc/sysconfig/ntpd 文件

[atguigu@hadoop102 ~]$ sudo vim /etc/sysconfig/ntpd

增加内容如下(让硬件时间与系统时间一起同步)

SYNC_HWCLOCK=yes

(4)重新启动ntpd服务

[atguigu@hadoop102 ~]$ sudo systemctl start ntpd

(5)设置ntpd服务开机启动

[atguigu@hadoop102 ~]$ sudo systemctl enable ntpd

3)其他机器配置(必须root用户)

(1)关闭所有节点上ntp服务和自启动

[root@hadoop103 ~]$ systemctl stop ntpd
[root@hadoop103 ~]$ systemctl disable ntpd
 
[root@hadoop104 ~]$ systemctl stop ntpd
[root@hadoop104 ~]$ systemctl disable ntpd

(2)在其他机器配置1分钟与时间服务器同步一次

[root@hadoop103 ~]$ sudo crontab -e

编写定时任务如下:

*/1 * * * * /usr/sbin/ntpdate hadoop102

(3)修改任意机器时间

[root@hadoop103 ~]$ date -s "2021-9-1111:11:11"

(4)1分钟后查看机器是否与时间服务器同步

[root@hadoop103 ~]$ date

4. Hadoop常用网址

  • Web端查看HDFS的NameNode:

  • 查看HDFS上存储的数据信息

  • Web端查看YARN的ResourceManager

  • 查看YARN上运行的Job信息

  • 查看JobHistory

Guess you like

Origin blog.csdn.net/m0_57126939/article/details/129170103