[Nanny level tutorial] Hadoop cluster installation, configuration, Demo test

Table of contents

I. Introduction

Second, the installation and configuration of the virtual machine

1. Vmware installation

2. Install Centos in the virtual machine

3. Static network configuration

4. Virtual machine cloning

Five, configure the hosts file and ssh password-free login

6. Hadoop cluster configuration

7. Hadoop cluster test


I. Introduction

Hadoop is an open source distributed computing platform that can run on Linux clusters. This tutorial intends to use Vmware virtual machines on the Windows 11 platform to build a pseudo-distributed Hadoop operating environment for students to learn and experiment with Hadoop clusters.

Second, the installation and configuration of the virtual machine

1. Vmware installation

Since the Vmware 15.x version is not compatible with Windows 11, please make sure that your Vmware virtual machine version is higher than 16.0, otherwise the Device/Credential Guard error will appear when running the virtual machine. VMware workstation v16.2.2 https://pan.baidu.com/s/1Odk-CjAiPSryYa6IIi5VaQ?pwd=XALA

2. Install Centos in the virtual machine

Download the Centos mirror

CentOS-7-x86_64-DVD-2207-02.iso https://mirrors.tuna.tsinghua.edu.cn/centos/7.9.2009/isos/x86_64/CentOS-7-x86_64-DVD-2207-02.iso Please make sure your disk has more than 10GB free space!

Install Centos in a virtual machine:

After testing, the above configuration can run the Hadoop demo test. If your task requires more resources, you can allocate more CPU and memory (if any) to the virtual machine.

Centos installation

After creating the virtual machine, click Finish and the virtual machine will start automatically.

During startup, please ignore this prompt:

 Enter the virtual machine, use the arrow keys to select Install Centos 7, and press Enter to install:

 Wait a moment.

 Choose Chinese and continue.

Wait a moment.

Choose where to install

 Click Finish.

 Choose network and hostname

 Turn on the network card and change the hostname.

 Please keep the above information in mind:

Finish.

start installation.

 Set root password and account.

Set the root password, be sure to remember the root password.

Set up an account named hadoop01

long wait.

At this time, we can download hadoop, jdk, xshell, and xftp in advance.

Tools https://pan.baidu.com/s/11sqkOS4pidiR6Me1ioQj5g?pwd=b4oe#list/path=%2Fhadoop installs xshell and xftp on Windows by itself, if prompted to update, click OK.

At this time Centos is almost installed.

After the installation is complete, click Restart.

Use Xshell to log in to the root account

Use Xshell to set up because it can easily paste commands into the terminal

Remember the network card connection information just now?

Open Xshell and create a new session

 According to the following configuration, the host fills in the IP address in the picture just now

 The user name is root, and the password is the root password you just set.

 After confirming, double-click to connect.

 Accept and save.

 3. Static network configuration

Xshell can directly paste text content into the terminal (right click or shift + insert)

Test network status

ping baidu.com -c 4

Install net-tools

 yum upgrade
 yum install net-tools

Keep y to install, wait a moment.

View Mac address

ifconfig

If it prompts that there is no such command, it may be because net-tools is not installed, please reinstall it.

 After ether is the mac address, which can be copied and saved.

install nano

yum install nano

Use the nano editor because it is easier to use than vi, nano saves just crtl+xy and press Enter.

View available ip ranges

Back in Vmware, open the virtual network editor.

select change settings

 To view the status of the NAT mode network card, click DHCP Settings.

 Remember the starting ip address.

Modify network configuration files

nano /etc/sysconfig/network-scripts/ifcfg-ens33
  • Set BOOTPROTO to STATIC 
  • Add WADDR to the ether value (mac address) just copied
  • Add IPADDR as the IP address specified by yourself, the ip address must be selected within the range of the start and end, and cannot be repeated
  • Add GATEWAY as the value of the gateway, usually the previously recorded ip address, and change the last digit to 2 (such as 192.168.186.2)
  • Added NETMASK to 255.255.255.0
  • Add DNS1 to 8.8.8.8
  • Modify the template to look like this:
BOOTPROTO="static"
WADDR="00:0c:29:7f:c8:38"
IPADDR="192.168.186.129"
GATEWAY="192.168.186.2"
NETMASK="255.255.255.0"
DNS1="8.8.8.8"

before fixing:

 After modification:

 WADDR is the MAC address of the network card, IPADDR is the IP specified by yourself, remember to only modify the data in the red box!

 restart network service

systemctl restart network
ping www.baidu.com

If Xshell is disconnected at this time, please check whether there is a configuration error, or the assigned IP is different from before.

If the configuration is wrong, please configure it directly from the virtual machine window, and then use xshell after returning to normal.

If the assigned IP is different from before, please modify the ip address of the xshell connection!

After restarting the virtual machine, check whether the network is connected (the ip address has not changed, and the network can be connected)

reboot

Reconnect the terminal

ifconfig
ping www.baidu.com -c 4

4. Virtual machine cloning

Shut down the virtual machine first

shutdown now

Right-click the virtual machine and select Clone.

Create a full clone 

 Modify the name to hadoop02

Do the following directly in the virtual machine window

Start Hadoop02, log in to the root account ( do not use the numeric keypad )

modify hostname

hostnamectl set-hostname hadoop02
reboot  

(forgot to take a screenshot)

View ip, mac address

ifconfig

Knowing the IP address and mac address, we use the static network configuration in the virtual machine window according to the previous method!

Be careful, please record the ip address assigned to the host, it cannot be repeated !

The ip set in the tutorial is:

  • hadoop01 192.168.186.129
  • hadoop02 192.168.186.130
  • hadoop03 192.168.186.131

Restart network configuration, test

systemctl restart network
ping www.baidu.com -c 4 

Copy the host Hadoop03 by yourself according to the above content

Of course, you can also clone as many hosts as possible (if you have such a need), of course, three hosts are enough!

The final result should be as shown in the figure: Make sure that Xshell can connect to each host, and each host can ping baidu.com

Five, configure the hosts file and ssh password-free login

Power on all virtual machines and use Xshell to connect

Modify the hosts configuration file ( all virtual machines )

nano /etc/hosts
192.168.186.129 hadoop01 
192.168.186.130 hadoop02 
192.168.186.131 hadoop03 

  You can copy it directly through xshell, which is very convenient.

Generate key file ( all virtual machines )

ssh-keygen -t rsa

just keep pressing enter

Copy the local public key file to other virtual machines ( all virtual machines )

Perform the following operations in the terminal in turn, enter yes after each command is executed, and enter the corresponding virtual machine password.

ssh-copy-id hadoop01
ssh-copy-id hadoop02
ssh-copy-id hadoop03

 Check whether the password-free login is successful ( all virtual machines )

ssh hadoop02
exit

6. Hadoop cluster configuration

Create a new folder export under the root directory of all virtual machines, and create data, servers and software files in the export folder

mkdir -p /export/data
mkdir -p /export/servers
mkdir -p /export/software

Prepare to install the package (the download link has been given earlier)

  • hadoop-2.7.4.tar.gz
  • jdk-8u161-linux-x64.tar.gz

Use xftp to copy the installation package into the /export/software directory ( all virtual machines )

Select a window, press ctrl+alf+f to summon xftp, and establish an ftp connection with the corresponding host

 Find the /export/software directory and directly drag the software to be installed ( all virtual machines )

 Wait for the upload to finish

Install JDK ( all VMs )

unzip

cd /export/software
tar -zxvf jdk-8u161-linux-x64.tar.gz -C /export/servers/

double naming

cd /export/servers
mv jdk1.8.0_161 jdk

Configure environment variables

nano /etc/profile

append at the end of the file

export JAVA_HOME=/export/servers/jdk
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

Make the configuration file take effect

source /etc/profile

Check if it takes effect

java -version

 Install hadoop2.7.4 ( all virtual machines )

Unzip the installation package

cd /export/software
tar -zxvf hadoop-2.7.4.tar.gz -C /export/servers/

 open configuration file

nano /etc/profile

 Append at the end of the file (illustration omitted)

export HADOOP_HOME=/export/servers/hadoop-2.7.4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Make the configuration file take effect

source /etc/profile

Check whether the configuration is successful

hadoop version

Hadoop cluster configuration (modify locally first, then copy the modified file to three virtual machines )

We configure hadoop01 as the master node, hadoop02 and hadoop03 as child nodes

Use xftp to enter the /export/servers/hadoop-2.7.4/etc/hadoop/ directory of hadoop01

Drag the hadoop folder of the remote host directly to the local

 Open the folder and modify the file directly in windows (notepad is recommended)

Modify the hadoop-env.sh file

 change into

export JAVA_HOME=/export/servers/jdk

Modify the core-site.xml file

replace all with

<configuration>
    <!--用于设置Hadoop的文件系统,由URI指定-->
    <property>
        <name>fs.defaultFS</name>
        <!--用于指定namenode地址在hadoop01机器上-->
        <value>hdfs://hadoop01:9000</value>
    </property>
    <!--配置Hadoop的临时目录,默认/tem/hadoop-${user.name}-->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/export/servers/hadoop-2.7.4/tmp</value>
    </property>
</configuration>

Modify the hdfs-site.xml file

replace all with

<configuration>
    <!--指定HDFS的数量-->
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <!--secondary namenode 所在主机的IP和端口-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop02:50090</value>
    </property>
</configuration>

Modify the mapred-site.xml file

 Make a copy of mapred-site.xml.template and rename it to mapred-site.xml

Override mapred-site.xml with

<configuration>
    <!--指定MapReduce运行时的框架,这里指定在YARN上,默认在local-->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Modify yarn-site.xml

Override as:

<configuration>
    <!--指定YARN集群的管理者(ResourceManager)的地址-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop01</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Modify the slaves file

Override as:

hadoop01
hadoop02
hadoop03

 Overwrite the modified file to the three virtual machines

 Execute in hadoop01

scp /etc/profile hadoop02:/etc/profile
scp /etc/profile hadoop03:/etc/profile

 Execute in hadoop02 03 

source /etc/profile

Format hdfs in hadoop01

hdfs namenode -format

 7. Hadoop cluster test

 Start HDFS on the master node hadoop01

start-dfs.sh

Start yarn on the master node hadoop01

start-yarn.sh

Use the jps command to view the process

jps

 You're done, now you can use the Hadoop platform for big data analysis!

run DEMO

Execute word frequency statistics DEMO on the main node

mkdir ~/tmp
echo 'In the physical sciences, progress in understanding large complex systems has often come by approximating their constituents with random variables; for example, statistical physics and thermodynamics are based in this paradigm. Since modern neural networks are undeniably large complex systems, it is natural to consider what insights can be gained by approximating their parameters with random variables. Moreover, such random configurations play at least two privileged roles in neural networks: they define the initial loss surface for optimization, and they are closely related to random feature and kernel methods. Therefore it is not surprising that random neural networks have attracted significant attention in the literature over the years' > ~/tmp/word1.txt
echo 'Throughout this work we will be relying on a number of basic concepts from random matrix theory. Here we provide a lightning overview of the essentials, but refer the reader to the more pedagogical literature for background' > ~/tmp/word2.txt
hdfs dfs -mkdir /input
hdfs dfs -put  ~/tmp/word*.txt  /input
hadoop jar /export/servers/hadoop-2.7.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar wordcount /input output
hdfs dfs -cat /user/root/output/part-r-00000

 Congratulations, you have successfully built the Hadoop environment! Please start the next step of exploration!

Guess you like

Origin blog.csdn.net/weixin_41673825/article/details/131009390