Linux centos7 Hadoop study notes

The hadoop configuration is changed in 3 places: change the host name, change the IP, change the MAC
and rename the host name:

hostnamectl set-hostname hadoop1

Install nano command:

yum install nano

Gateway configuration:

Order:

nano  /etc/sysconfig/network-scripts/ifcfg-ens33 #ens33网卡名，需要修改BOOTPROTO，IPADDR，添加NETMASK，DNS1

Modified content:

BOOTPROTO=static #静态
IPADDR=IP地址 #ip地址
NETMASK=255.255.255.0 #子网掩码
DNS1=192.168.91.2 #dns解析，设置为和网关一样

View the gateway configuration: cat /etc/sysconfig/network-scripts/ifcfg-ens33
Save: Ctrl+O-->Enter

Exit: Ctrl+X
3. Change the MAC address: sed -i '/UUID=/c\UUID='`uuidgen`'' /etc/sysconfig/network-scripts/ifcfg-ens33
Update the virtual machine kernel: yum -y update
yum -y upgrade

Password- free login:
configure key: ssh-keygen -t rsa
Enter: cd /root/.ssh #(. indicates hidden files)
View all: ll -a

Key-free login: ssh root@hadoop2

Configure ip and hostname mapping: After configuration, you can connect by hostname, such as ssh root@hadoop
View a file: cat /etc/hosts
Edit the file using the nano tool: nano /etc/hosts
Add the following:

192.168.91.129 hadoop1
192.168.91.130 hadoop2
192.168.91.131 hadoop3

Install the time synchronization tool:

yum install chrony -y #-y全自动安装
查看文件：rpm -qa | grep chrony
启动服务：systemctl start chronyd
查看服务：systemctl status chronyd
开机启动服务：systemctl enable chronyd --now

Firewall configuration:

查看状态：systemctl status firewalld
关闭防火墙：systemctl stop firewalld
禁止防火墙开机启动：systemctl disable firewalld

Configure time synchronization:
Edit time synchronization: nano /etc/chrony.conf
Comment out server0,1,2,3,
add time synchronization server: server Hadoop1 iburst #The other two start one synchronously

Restart chronyd: systemctl restart chronyd
View time synchronization status: chronyc sources -v

configure jdk

Configure environment variables: /etc/profile Use the command nano /etc/profile

export JAVA_HOME=~/export/servers/jdk1.8.0_202
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

Execute the "source /etc/profile" command to initialize the system environment variables to make the configuration take effect.
#copy environment variables

scp /etc/profile root@hadoop2:/etc

#other copy jdk

scp -r export/servers/ root@hadoop2:export/ #-r表示递归，可以拷贝多层文件夹，把servers拷贝到export下

#Make the environment variable take effect

source /etc/profile

Configure zookeeper

Copy zookeeper to /export/servers/

Enter the conf directory under the ZooKeeper installation directory.
Copy the template file zoo_sample.cfg to the zoo.cfg configuration file
cp zoo_sample.cfg zoo.cfg edit the content of
zoo.cfg
vi zoo.cfg dataDir=/export/data/zookeeper/zkdata #Remember To create this folder /export/data/zookeeper/zkdata server.1=spark1.2888:3888 server.2=spark2.2888:3888 server.3=spark3.2888:3888 pwd shows the current path in the zkdata directory of hadoop1 Execute echo 1 > myid in the zkdata directory of hadoop2 Execute echo 2 > myid in the zkdata directory of hadoop3 Execute echo 3 > myid

Configure the zookeeper environment variable
export ZK_HOME=/export/se

rvers/zookeeper-3.4.10#Note the zookeeper version
export PATH=$PATH:$ZK_HOME/bin

Then copy zookeeper and profile to 2,3 hosts

zkServer.sh status#View the status of zk
zkServer.sh start#Start zk
ps#View the process
jps#View the process related to java

Install Hadoop:

Copy the hadoop package to /export/software/

Use the command to extract to /export/servers/

tar -zxvf /export/software/hadoop-2.7.4.tar.gz -C /export/servers/

configure sh command

Enter the /etc/hadoop/ directory of the Hadoop installation package. The following commands are all operations in this directory. Edit the hadoop-env.sh file

vi hadoop-env.sh#可以用nano命令

Change the default JAVA_HOME parameter in the file to the path where the JDK is installed locally.

export JAVA_HOME=/export/servers/jdk1.8.0

Enter the /etc/hadoop/ directory of the Hadoop installation package and edit the yarn-env.sh file

vi yarn-env.sh

Change the default JAVA_HOME parameter in the file to the path where the JDK is installed locally, same as the previous step

Command to edit Hadoop's core configuration file core-site.xml

vi core-site.xml

Modify the following address

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master</value><!--master集群名字-->
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/export/servers/hadoop-2.7.4/tmp</value><!--tmp临时文件目录，不存在需要手动创建-->
</property>
<property>
    <name>ha.zookeeper.quorum</name>
    <value>spark01:2181,spark02:2181,spark03:2181</value>
</property>

Command to edit the core configuration file hdfs-site.xml of HDFS

vi hdfs-site.xml

The modifications are as follows:

<property>
    <name>dfs.replication</name>
    <value>3</value><!--集群数-->
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>/export/data/hadoop/namenode</value><!--没有路径需要手动创建-->
</property>
<property>    
    <name>dfs.datanode.data.dir</name>    
    <value>/export/data/hadoop/datanode</value><!--没有路径需要手动创建-->    
</property>
<!---->
<property>
    <name>dfs.nameservices</name>
    <value>master</value>
</property>
<property>
    <name>dfs.ha.namenodes.master</name>
    <value>nn1,nn2</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.master.nn1</name>
    <value>spark01:9000</value>
</property>
<!---->
<property>
    <name>dfs.namenode.rpc-address.master.nn2</name>
    <value>spark02:9000</value>
</property>
<property>
    <name>dfs.namenode.http-address.master.nn1</name>
    <value>spark01:50070</value>
</property>
<property>
    <name>dfs.namenode.http-address.master.nn2</name>
    <value>spark02:50070</value>
</property>
<!---->
<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://spark01:8485;spark02:8485;spark03:8485/master</value>
</property>
<property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/export/data/hadoop/journaldata</value>
</property>
<property>
  <name>dfs.client.failover.proxy.provider.master</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!---->
<property>
    <name>dfs.ha.fencing.methods</name>
    <value>
        sshfence
        shell(/bin/true)
    </value>
</property>
<property>
    <name>dfs.ha.fencing.ssh.private-key-files</name>
    <value>/root/.ssh/id_rsa</value>
</property>
<property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
</property>
<!---->
<property>
	<name>dfs.ha.fencing.ssh.connect-timeout</name>
	<value>30000</value><!--超时-->
</property>
<property> 
	<name>dfs.webhdfs.enabled</name> 
	<value>true</value> 
</property>

Enter the /etc/hadoop/ directory of the Hadoop installation package, execute the command, and create
the core configuration file mapred-site.xml of MapReduce by copying the template file

cp mapred-site.xml.template mapred-site.xml

Run the command to edit the configuration file mapred-site.xml and specify the MapReduce runtime framework.

vi mapred-site.xml

amend as below:

<property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value><!--资源调度器-->
</property>

Execute the command to edit the core configuration file yarn-site.xml of YARN

cp yarn-site.xml

amend as below:

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>yarncluster</value>
</property>
<!---->
<property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>spark01</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>spark02</value>
</property>
<!---->
<property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>spark01:2181,spark02:2181,spark03:2181</value>
</property>
<property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
</property>
<property>
      <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
      <value>true</value>
</property>
<!---->
<property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
<!---->

Modify the slaves file

Execute the command to edit the file slaves that records the host names of all DataNode nodes and NodeManager nodes in the Hadoop cluster

vi slaves

The content is as follows:

spark01
spark02
spark03

Configure Hadoop environment variables

Run the command to edit the system environment variable file profile and configure the Hadoop system environment variables

vi /etc/profile

Add the following content:

export HADOOP_HOME=/export/servers/hadoop-2.7.4
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

Execute the command to initialize the system environment variables to make the configuration take effect

source /etc/profile

Hadoop installation directory and system environment variable files are distributed to 2 other virtual machines

#将Hadoop安装目录分发到虚拟机Spark02和Spark03
$ scp -r /export/servers/hadoop-2.7.4/ root@spark02:/export/servers/
$ scp -r /export/servers/hadoop-2.7.4/ root@spark03:/export/servers/
#将系统环境变量文件分发到虚拟机Spark02和Spark03
$ scp /etc/profile root@spark02:/etc/
$ scp /etc/profile root@spark03:/etc/

Execute the command to view the Hadoop version of the current system environment

hadoop version

Start the Hadoop service, a total of 5 steps

1. Start ZooKeeper
and execute it in the virtual machines Spark01, Spark02 and Spark03 respectively

zkServer.sh start

The command starts the ZooKeeper service for each virtual machine.

2. Start JournalNode Run the following commands
in virtual machines Spark01, Spark02, and Spark03 to start the JournalNode service of each virtual machine.

hadoop-daemon.sh start journalnode

Note that the command is only executed on the first boot:

Initialize NameNode (only for initial startup)
Run the "hdfs namenode -format" command on the virtual machine Spark01 of the master node of the Hadoop cluster to initialize the NameNode operation.

Initialize ZooKeeper (only for initial startup)
On the NameNode master node virtual machine Spark01, execute the "hdfs zkfc -formatZK" command to initialize the HA state in ZooKeeper.

NameNode synchronization (only for initial startup execution)
After the NameNode master node in the virtual machine Spark01 executes the initialization command, the contents of the metadata directory need to be copied to other unformatted NameNode standby nodes (virtual machine Spark02) to ensure that the master node and the The NameNode data of the standby node is consistent

scp -r /export/data/hadoop/namenode/ root@spark02:/export/data/hadoop/

3. Start HDFS
In the virtual machine Spark01 , execute the one-click startup script command to start the HDFS of the Hadoop cluster. At this time, the NameNode and ZKFC on the virtual machines Spark01 and Spark02 and the DataNodes on the virtual machines Spark01, Spark02 and Spark03 will be started. The following command:

start-dfs.sh

4. Start YARN Start the YARN of the Hadoop cluster by executing the one-click startup script command
in the virtual machine Spark01 , as follows:

start-yarn.sh

At this point, the ResourceManager on the virtual machine Spark01 and the NodeManagers on the virtual machines Spark01, Spark02, and Spark03 will be started.

5. Start ResourceManager

The ResourceManager standby node on the virtual machine Spark02 needs to be started separately on the virtual machine Spark02. Run the following command:

yarn-daemon.sh start resourcemanager

jps: command to check whether Hadoop high-availability cluster related processes are successfully started

Full node start zk

zkServer.sh start

The master node starts dfs

start-dfs.sh

The master node starts yarn

start-yarn.sh

Child node 1 starts

yarn-daemon.sh start resourcemanager

Full service shutdown

stop-all.sh

Full service start

start-all.sh

View data

#浏览器查看
ip:50070
ip:8088

Access by hostname, open the following file

C:\Windows\System32\drivers\etc\hosts

Add the following configuration

192.168.8.134 hadoop1
192.168.8.135 hadoop2
192.168.8.136 hadoop3

Operations in HDFS:

Delete folder:

hadoop fs -rm -r /文件夹名称

Create the folder:

hadoop fs -mkdir /文件夹名称

Check out the catalog:

hadoop fs -ls /

Modify permissions HDFS operation permissions:

hadoop fs  -chmod  777 /input

To leave Safe Mode manually:

//若配置环境变量，使用以下命令
hadoop dfsadmin -safemode leave

Exception handling:

Appears as follows:

Failed to retrieve data from /webhdfs/v1/data/clickLog/2022_04_24?op=LISTSTATUS:

change google browser