Big data platform-Hadoop environment configuration

1. Virtual machine configuration

Version: Centos7 is installed in VM15, mirror: CentOS-7-x86_64-Minimal-1908; hadoop2.10.0, jdk1.8.0_241

1.1 Install CentOS system

It is not recommended to install the latest version of CentOS.

Note: This article uses VMWare for configuration. If VirtualBox is used for configuration, except for the slightly different virtual machine network configuration, the others are the same.

1.2 Introduction to network connection

1. Bridge mode: The virtual machine and the physical machine are connected to the same network, and the virtual machine and the physical machine are in a parallel relationship, and the status is equivalent. Whether it is a virtual system or a real system, as long as they are on the same network segment, they can ping each other.

2. NAT mode: A physical opportunity acts as a "router". If a virtual machine wants to access the Internet, it must pass through the physical machine. If the physical machine cannot access the Internet, the virtual machine cannot access the Internet. No other configuration is required, only the physical machine can access the Internet. The virtual system can only be accessed in both directions with the host where the virtual machine is set up, and cannot access other real hosts on the same network segment.

3. Host only mode: It is equivalent to taking a network cable to directly connect the physical machine and the virtual machine. All virtual systems can communicate with each other, but the virtual system and the real network are isolated.

1.3 Virtual machine network configuration

1. Open the virtual network editor in VMWare, select VMnet8 (NAT) mode, click NAT settings, configure the gateway address (you can choose the default value).
2. Open the control panel, network center, change the adapter, open the properties of VMnet8, and configure the TCP/IPv4 properties.
3. The gateway address is consistent with VMWare, do not configure the occupied ip address.
4. Log in to the CentOS system, enter etc/sysconfig/network-scripts, and edit the NAT network card configuration file.

vi ifcfg-ens33 

5. Configure the following variables

BOOTPROTO=static       #关闭DHCP,可以不做,不影响
ONBOOT=yes 		       #开机启动
IPADDR=192.168.223.100   #该虚拟机的ip地址
GATEWAY=192.168.223.2  #网关地址 
DNS1=192.168.223.2     #DNS可根据自己网络进行配置

NATMASK=255.255.255.0  #子网掩码

1.4 Restart the network card, the network configuration takes effect

service network restart  

or

systemctl restart network

1.5 Network verification

View ip address under CentOS:

ip a

View the ip address under Windows (run cmd):

ipconfig

If the following operations are correct, the network configuration is successful

  • Virtual machine ping gateway
  • Virtual machine ping physical host
  • Physical host ping virtual machine
  • Virtual machine ping Baidu (if the external network address cannot be pinged, you can modify other DNS1 addresses in the network configuration file of the previous step)

1.6 Turn off the CentOS firewall

In a distributed cluster, the communication between each node will be hindered by the firewall, so the firewall is turned off (security issues are implemented by professional software at the periphery and unified management)

systemctl stop firewalld.service 

Prohibit firewall startup

systemctl disable firewalld.service 

If there is a network problem, check whether the network configuration file is wrong

vi /etc/sysconfig/network-scripts/ifcfg-ens33

1.7 Install XShell, XFtp

XShell and XFtp are paid software, you can download the free educational version from the English official website (the Chinese official website does not have this version)

After the installation is successful, create a new connection, enter the virtual machine master ip address to be connected, and enter the user name and password to connect

After logging in successfully, use XShell instead of the virtual machine interface to perform Shell operations

2. Basic environment configuration

2.1 Switch user to root user

su root

2.2 Configure clock synchronization

Install ntpdate online, use Alibaba Cloud ntp server to synchronize time, date command to view current time

yum install ntpdate 
ntpdate ntp.aliyun.com
date 

If you cannot connect to the external network, you can install ntpdate offline. The download address of ntpdate: https://pkgs.org/

rpm -ivh 'ntpdate包的路径' 
ntpdate ntp.aliyun.com
date

2.3 Configure the host name

The host can be uniquely identified in the network. Like the ip address, this host can be accessed through the ip address and the network host name. The role is to simplify and facilitate.

Modify the host name:

hostnamectl set-hostname node1

View the modified host name:

hostname

2.4 Configure the hosts list

The function of the hosts list is to let each server in the cluster know each other's host name and ip address

vi /etc/hosts

Add host ip and host name
192.168.223.100 node1
192.168.223.101 node2
192.168.223.102 node3

Verification, ping ip address and host name, the results are the same, no difference

ping 192.168.223.100
ping node1

2.5 Install Java environment

Create a personal data directory, java directory

mkdir /usr/wallasunrui
mkdir /usr/java
mkdir /usr/hadoop

Use XFtp to copy the java installation package to the wallasunrui directory, unzip, and move to the java directory

tar -zxvf jdk-8u241-linux-x64.tar.gz
mv jdk1.8.0_241 /usr/java/jdk1.8.0_241

Enter the system configuration file

vi /etc/profile

Add the following two lines at the end of the file

export JAVA_HOME=/usr/java/jdk1.8.0_241
export PATH=$JAVA_HOME/bin:$PATH

Make the configuration effective:

source /etc/profile
java -version

2.6 Install Hadoop environment

Use XFtp to upload the hadoop installation package to the wallasunrui folder, unzip the Hadoop installation package, and move it to a new folder

tar -zxvf hadoop-2.10.0.tar.gz 
mv hadoop-2.10.0  /usr/hadoop/hadoop-2.10.0

Configure Hadoop environment variables

vi /etc/profile

Add the following two lines at the end of the configuration file

export HADOOP_HOME=/usr/hadoop/hadoop-2.10.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Make hadoop configuration effective and verify

source /etc/profile
hadoop version
whereis hdfs

No error message means hadoop has been successfully added to the CentOS system environment

2.7 Binding hadoop and java (many students will forget this step)

Set up hadoop configuration file

cd /usr/hadoop/hadoop-2.10.0/etc/hadoop
vi hadoop-env.sh

Find the following line of code:

export JAVA_HOME=${JAVA_HOME}

Modify this line of code to

export JAVA_HOME=/usr/java/jdk1.8.0_241

3. Configure Hadoop

3.1 Hadoop core file configuration

Enter the etc folder of hadoop, configure the core-site.xml file, and add the following content

<configuration>
    <!--指定文件系统的入口地址,可以为主机名或ip -->
    <!--端口号默认为8020 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node1:8020</value>
    </property>
    <!--指定hadoop的临时工作存目录-->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/hadoop/hadoop-2.10.0/hadoop-data</value>
    </property>
</configuration>

Configure the yarn-env.sh file and find this line:

# export JAVA_HOME=/home/y/libexec/jdk1.6.0/

Amend to the following, and remove the comment #
:

export JAVA_HOME=/usr/java/jdk1.8.0_241

Configure the hdfs-site.xml file and add the following content:

<configuration>
    <!--指定hdfs备份数量,小于等于从节点数目-->
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
		
  <!--  自定义hdfs中namenode的存储位置-->
  <!--  <property>-->
  <!--      <name>dfs.namenode.name.dir</name>-->
  <!--      <value>file:/usr/hadoop/dfs/name</value>-->
  <!--  </property>-->
  <!--  自定义hdfs中datanode的存储位置-->
  <!--  <property>-->
  <!--      <name>dfs.datanode.data.dir</name>-->
  <!--      <value>file:/usr/hadoop/dfs/data</value>-->
  <!--</property>-->
</configuration>

Mapred-site.xml configuration file generated by the cp command without suffix templatefile

cp mapred-site.xml.template mapred-site.xml

Edit the mapred-site.xml file and add the following content:

<configuration>
    <!--hadoop的MapReduce程序运行在YARN上-->
    <!--默认值为local-->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Configure the yarn-site.xml file

<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node1</value>
    </property>     
    <!--nomenodeManager获取数据的方式是shuffle-->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>  
</configuration>

Modify the slaves file, delete the original content, and add the hostname of the datanode node. You can set it yourself. Here I will make three.

node1
node2
node3

3.2 Clone multiple slave machines

Use node1 to mirror and clone 3 virtual machines (to create a complete clone). The names are node2 and node3. The number can be set freely according to the computer configuration. The following steps are the same as the process above.

  • Modify the hostname of each slave, the method is the same as 2.3
    hostnamectl set-hostname node2

  • Enter etc/sysconfig/network-scripts, modify the ip address of each slave, the method is the same as 1.3

3.3 Optional—Synchronous hadoop configuration (only used when the master and slave configurations are different)

Send the hadoop folder on the master machine to three slave machines

scp -r /usr/hadoop node2:/usr/hadoop 
scp -r /usr/hadoop node3:/usr/hadoop 

4. Configure SSH login

4.1 Generate public key and private key

On the master and each slave, use the rsa algorithm to generate the public key and the private key (in the installation process, use the "Enter" key to confirm)

ssh-keygen -t rsa

View the generated private key id_rsa and public key id_rsa.pub

cd /root/.ssh/
ls

4.2 Send public key

Create a common public key authorized_keys on the master, modify the authorized_keys permissions, and send this public key to each slave

cat id_rsa.pub >> authorized_keys
chmod 644 authorized_keys
systemctl restart sshd.service
scp /root/.ssh/authorized_keys node2:/root/.ssh
scp /root/.ssh/authorized_keys node3:/root/.ssh

4.3 Introduction to Linux file permissions

Here is the knowledge introduction, non-hadoop configuration steps

  • 1-3 digits represent the permissions of the file owner

  • 4-6 digits represent the permissions of users in the same group

  • 7-9 digits represent the authority of other users

  • The read permission is equal to 4, represented by r

  • The write permission is equal to 2, denoted by w

  • The execution authority is equal to 1, denoted by x

    444 r–r--r–
    600 rw-------
    644 rw-r–r--
    666 rw-rw-rw-
    700 rwx------
    744 rwxr–r--
    755 rwxr-xr-x
    777 rwxrwxrwx

4.4 Verify SSH

SSH login verification, the login path can be changed from'~/.ssh' to'~' without a password, and logout is exit

ssh node1
ssh node2
exit
ssh node3
exit

Note: Only one-way free login from master to slave is realized! ! !

5. Ready to run hadoop

5.1 Format HDFS

On the master machine, enter the bin folder under hadoop and run the following code:

hdfs namenode -format

Note: only need to format once! Multiple formatting will result in a mismatch between the cluster ID values ​​of the NameNode and DataNode. You need to delete the NameNode, DataNode, and log folders before formatting.

5.2 Start hadoop

Enter the sbin folder in hadoop and run:

start-dfs.sh
start-yarn.sh

5.3 View hadoop process

jps

5.4 Accessing hadoop through the web

View NameNode, DataNode:
http://192.168.223.100:50070
View SecondaryNameNode information:
http://192.168.223.100:50090
View YARN interface:
http://192.168.223.100:8088

Guess you like

Origin blog.csdn.net/qq_46009608/article/details/108888139