Deploy Hadoop 3.0 high-performance cluster-pseudo-distributed deployment

table of Contents

 Introduction:

1. Environment construction

 (1) Prepare for the experiment and install a CentOS7 virtual machine on VMware. The IP and machine name of the virtual machine are as follows (this step can be ignored after installation)

 (2) Configure the hosts file on the machine to realize the correspondence between the domain name and the IP address, as follows: (configured on the DNS server in the real production system) 

(3) Create and run hadoop user account and Hadoop directory.

(4) Install Java environment JDK

(4) Turn off the firewall and let the firewall not start again. If you want to start it, you can use the fee to enable:

(5) Configure passwordless access (single machine also needs to set passwordless access)

Second, Hadoop installation and configuration

(1) Upload hadoop-3.0.0.tar.gz to the server / home / hadoop / directory

(2) Create a working directory related to hadoop

(3) Configure Hadoop: 7 configuration files need to be modified.

1) The configuration file hadoop-env.sh specifies the java running environment of hadoop

2) Configuration file yarn-env.sh, save the running environment of yarn framework (only view, no modification required)

3) Configuration file core-site.xml, specify the access path to the hadoop web interface

4) Configuration file hdfs-site.xml

5) Configuration file mapred-site.xml

6) Configure node yarn-site.xml

7) Edit the datanode node host and modify the workers file

3. Start Hadoop on hadoop163

1) The initialization of hadoop namenode only needs to be initialized the first time, after that it is unnecessary

2) Start hdfs: ./sbin/start-dfs.sh, which starts HDFS distributed storage

3) Start yarn: ./sbin/start-yarn.sh That is, start distributed computing

4) Start the storage service and resource management master service on the master node.

5) Start: jobhistory service, check the running status of mapreduce

Four, view on the web


 

 Introduction:

Hadoop3.0 can achieve distributed cluster deployment. But sometimes limited by experimental conditions, pseudo-distributed deployment can be performed on a server. In this experiment, a pseudo-distributed deployment is performed on a machine, and then a true distributed cluster deployment is performed on the basis of the pseudo-distributed deployment.

 

1. Environment construction

 (1) Prepare for the experiment and install a CentOS7 virtual machine on VMware. The IP and machine name of the virtual machine are as follows ( this step can be ignored after installation )

        Host name IP address Role

hadoop163.cn 192.168.150.163 NameNode / DataNode

 

 (2) Configure the hosts file on the machine to realize the correspondence between the domain name and the IP address, as follows: ( configured on the DNS server in the real production system ) 

  Check your ip: ifconfig 

# vim  /etc/hosts

 

Test, you can ping:

 

(3) Create and run hadoop user account and Hadoop directory.

Create a hadoop account:

#为了保障,在其它服务器上创建的hadoop用户ID保持一致,创建时,尽量把UID调大
[root@hadoop163 ~]# useradd -u 8000  hadoop  

#设置用户密码
[root@ hadoop163 ~]# echo 123456 | passwd --stdin hadoop

Note: When creating a user hadoop, you cannot use the parameter -s / sbin / nologin, because later we have to su-hadoop to switch users

 

 

(4) Install Java environment JDK

Use FileZilla to upload the JDK package in the root folder and the hadoop package in the hadoop folder, and view it.

To download FileZilla, please check:

To use FileZilla, please check:

Cannot connect to the virtual machine, please check:

To download two packages, please check:

After the upload is successful, enter the root directory to view the JDK file:

Go to the hadoop directory to view the compressed packages used later:

Install jdk:

 rpm -ivh jdk-8u161-linux-x64.rpm

 

Check the installation location : [You can know the installation directory of jdk in / usr / java by viewing the information of jdk] (note the location and use it later)

rpm -pql  /root/jdk-8u161-linux-x64.rpm  

 (Not just the one shown on the picture)

Check the JDK version : find that the version is not the version you just installed, then configure the java environment variable to make the version the version you just installed

java -version

 

 

Configure JAVA environment variables:

vim /etc/profile   

#在文件的最后添加以下内容:
#----------------------------------------------------------------------------------------

export JAVA_HOME=/usr/java/jdk1.8.0_161

export JAVA_BIN=/usr/java/jdk1.8.0_161/bin

export PATH=${JAVA_HOME}/bin:$PATH

export CLASSPATH=.:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar

#----------------------------------------------------------------------------------------

#使配置文件生效
source /etc/profile 

 

 

Check the java version again to verify whether the installation was successful:

If the corresponding version of the installation appears, the java runtime environment has been successfully installed.

Note: Only the version of jdk is upgraded here, because jdk is already installed on the system I installed.

 

(4) Turn off the firewall and let the firewall not start again. If you want to start it, you can use the fee to enable:

[root@ hadoop163 ~]# systemctl stop firewalld.service

[root@ hadoop163 ~]# systemctl disable firewalld.service

(5) Configure passwordless access (single machine also needs to set passwordless access)

   First, switch to the hadoop account:

# 生成秘钥
[Hadoop@hadoop163 ~]$ ssh-keygen 


# 分发秘钥
[Hadoop@hadoop163 ~]$ ssh-copy-id  192.168.150.163  

Generate the key (always press Enter until it is completed):

 

Distribution key :

Ssh-copy-id needs to keep up with its own ip address ( can be viewed through ifconfig ), let you choose yes when you continue, and when you enter the password, enter the password that was set when the user was created.

 

Second, Hadoop installation and configuration

 

(1) Upload hdoop-3.0.0.tar.gz to the server / home / hadoop / directory

Hadoop installation directory: /home/hadoop/hadoop-3.0.0

Note: The following steps use Hadoop account operation .

[root@hadoop163 ~]# su - hadoop     

#只要解压文件就可以,不需要编译安装
[hadoop@hadoop163 ~]$ tar zxvf hadoop-3.0.0.tar.gz   

 After decompression, there will be a directory:

(2) Create a working directory related to hadoop

Create three working directories under hadoop:

name  、data 、tmp

[hadoop@hadoop163 ~]$ mkdir -p  /home/hadoop/dfs/name  

[hadoop@hadoop163 ~]$ mkdir -p  /home/hadoop/dfs/data

[hadoop@hadoop163 ~]$ mkdir -p  /home/hadoop/tmp

[hadoop@hadoop163 ~]$ ls

 

 

(3) Configure Hadoop: 7 configuration files need to be modified.

File location: /home/hadoop/hadoop-3.0.0/etc/hadoop/

Text title: hadoop-env.sh, yarn-evn.sh, workers, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml

 

1) The configuration file hadoop-env.sh specifies the java running environment of hadoop

This file is the configuration of the basic environment in which Hadoop runs, and the location of the java virtual machine that needs to be modified.

[hadoop@ hadoop163 hadoop]$ vim /home/hadoop/hadoop-3.0.0/etc/hadoop/hadoop-env.sh

Let the file display the number of lines for easy viewing. The position to be modified is line 54: Note: Specify the java running environment variable

2) Configuration file yarn-env.sh, save the running environment of yarn framework ( only view, no modification required )

This file is the configuration of the running environment of the yarn framework, and it also needs to modify the location of the java virtual machine.

yarn: Hadoop's new MapReduce framework Yarn is the new map-reduce framework (Yarn) principle of Hadoop since version 0.23.0.

[hadoop@hadoop163 hadoop-3.0.0]$ vim /home/hadoop/hadoop-3.0.0/etc/hadoop/yarn-env.sh 

View priority rules:

3) Configuration file core-site.xml, specify the access path to the hadoop web interface

This is the core configuration file of hadoop. These two properties need to be configured here. Fs.default.name configures the naming of the HDFS system of hadoop. The location is port 9000 of the host;

hadoop.tmp.dir configures the root location of hadoop's tmp directory. A location that is not in the file system is used here, so first use the mkdir command to create a new one (we have created it earlier).

[hadoop@ hadoop163 hadoop]$ vim /home/hadoop/hadoop-3.0.0/etc/hadoop/core-site.xml

Enter the file and add code between <configuration> </ configuration>:

<property>
      <name>fs.defaultFS</name>
          <value>hdfs://hadoop163.cn:9000</value>
 </property>

 <property>
     <name>io.file.buffer.size</name>
         <value>4096</value>
 </property>

 <property>
     <name>hadoop.tmp.dir</name>
         <value>file:/home/hadoop/tmp</value>
             <description>Abase for other temporary directories.</description>
 </property>

As shown:

 

Description: The default value of io.file.buffer.size is 4096. This is the buffer size of the read and write sequence file, which can reduce the number of I / O. In a large Hadoop cluster, the recommendation can be set to 65536

 

4) Configuration file hdfs-site.xml

This is the hdfs configuration file, dfs.http.address configures the http access location of hdfs;

dfs.replication configures the number of file block copies, which is generally not greater than the number of slaves.

[root@ hadoop163 hadoop]# vim /home/hadoop/hadoop-3.0.0/etc/hadoop/hdfs-site.xml

Insert content between <configuration> and </ configuration>:

<property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>hadoop163.cn:9001</value> 
  </property>

  <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/home/hadoop/dfs/name</value>
  </property>

 <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/hadoop/dfs/data</value>
 </property>

 <property>
    <name>dfs.replication</name>
    <value>1</value>   
 </property>

 <property>
    <name>dfs.webhdfs.enabled</name>
        <value>true</value>
 </property>

<property>
    <name>dfs.http.address</name>
    <value> hadoop163.cn:50070</value>
 </property>

 

5) Configuration file mapred-site.xml

This is the configuration of the mapreduce task. Since hadoop2.x uses the yarn framework, to achieve distributed deployment, it must be configured as yarn under the mapreduce.framework.name property. mapred.map.tasks and mapred.reduce.tasks are the number of map and reduce tasks, respectively, and specify: Hadoop history server historyserver

Hadoop comes with a history server. You can use the history server to view the completed Mapreduce job records, such as how many Maps are used, how many Reduce are used, job submission time, job start time, job completion time, and other information. By default, the Hadoop history server is not started.

[hadoop@hadoop hadoop-3.0.0]$ vim /home/hadoop/hadoop-3.0.0/etc/hadoop/mapred-site.xml

Insert content between <configuration> and </ configuration>:

 <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
 </property>

 <property>
     <name>mapreduce.jobhistory.address</name>
     <value>0.0.0.0:10020</value>
 </property>

 <property>
     <name>mapreduce.jobhistory.webapp.address</name>
     <value>0.0.0.0:19888</value>
 </property>

6) Configure node yarn-site.xml

This file is the configuration of yarn framework, mainly the starting position of some tasks

[hadoop@hadoop163 hadoop]$ vim /home/hadoop/hadoop-3.0.0/etc/hadoop/yarn-site.xml

Insert content between <configuration> and </ configuration>:

<property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
 </property>

 <property>
     <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
     <value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>

 <property>
    <name>yarn.resourcemanager.address</name>
    <value>hadoop163.cn:8032</value>
 </property>

 <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>hadoop163.cn:8030</value>
 </property>

 <property>
     <name>yarn.resourcemanager.resource-tracker.address</name>
     <value>hadoop163.cn:8031</value>
 </property>

 <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>hadoop163.cn:8033</value>
 </property>

 <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>hadoop163.cn:8088</value>
 </property>

<property>
    <name>yarn.application.classpath</name>
    <value>/home/hadoop/hadoop-3.0.0/etc/hadoop:/home/hadoop/hadoop-3.0.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-3.0.0/share/hadoop/common/*:/home/hadoop/hadoop-3.0.0/share/hadoop/hdfs:/home/hadoop/hadoop-3.0.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-3.0.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-3.0.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-3.0.0/share/hadoop/yarn:/home/hadoop/hadoop-3.0.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-3.0.0/share/hadoop/yarn/* </value>
</property>

7) Edit the datanode node host and modify the workers file

[hadoop@hadoop163 hadoop]$ vim /home/hadoop/hadoop-3.0.0/etc/hadoop/workers

Modified to, because it is a pseudo distribution, so add the machine itself

 

3. Start Hadoop on hadoop163

Hadoop users to operate:

1) The initialization of hadoop namenode only needs to be initialized the first time, after that it is unnecessary

[hadoop@hadoop163 hadoop-3.0.0]$ /home/hadoop/hadoop-3.0.0/bin/hdfs namenode -format

 

 

See if there is a file automatically generated, if there is, it means success:

 

2) Start hdfs: ./sbin/start-dfs.sh, which starts HDFS distributed storage

The path of the startup file is: /home/hadoop/hadoop-3.0.0/sbin/

[root@hadoop163 ~]# /home/hadoop/hadoop-3.0.0/sbin/start-dfs.sh

3) Start yarn: ./sbin/start-yarn.sh That is, start distributed computing

[hadoop@ hadoop163 hadoop-3.0.0]# /home/hadoop/hadoop-3.0.0/sbin/start-yarn.sh

Note: start-dfs.sh and start-yarn.sh can be replaced by start-all.sh.

开启:
[hadoop@hadoop163 ~]$ /home/hadoop/hadoop-3.0.0/sbin/start-all.sh

关闭:

[hadoop@hadoop163 ~]$ /home/hadoop/hadoop-3.0.0/sbin/stop-all.sh

 

Started five and succeeded.

View status with jps:

4) Start the storage service and resource management master service on the master node.

Use the command: (After starting with start-all.sh, these two services have been automatically started, no need to execute again)

#启动从存储服务
[hadoop@hadoop163 ~]$ /home/hadoop/hadoop-3.0.0/sbin/hadoop-daemon.sh start datanode                                

 #启动资源管理从服务
[hadoop@hadoop163 ~]$ /home/hadoop/hadoop-3.0.0/sbin/yarn-daemon.sh start nodemanager                           

5) Start: jobhistory service, check the running status of mapreduce

[hadoop@hadoop163 hadoop-3.0.0]# /home/hadoop/hadoop-3.0.0/sbin/mr-jobhistory-daemon.sh start historyserver

 

Four, view on the web

Enter in the title bar:

http://hadoop163.cn:8088

Can be viewed, it means that it has been successful.

 

Published 6 original articles · won 3 · views 1745

Guess you like

Origin blog.csdn.net/qq_41567921/article/details/105437960