Hadoop single, pseudo-distributed, to build a distributed cluster

JDK installation

Set hostname

[root@bigdata111 ~]# vi /etc/hostname

Set machine hosts

[root@bigdata111 ~]# vi /etc/hosts

192.168.1.111 bigdata111
192.168.1.112 bigdata112
192.168.1.113 bigdata113

Create a directory jdk

[root@bigdata111 /]# cd /opt
[root@bigdata111 opt]# ll
总用量 0
drwxr-xr-x. 2 root root 6 3月  26 2015 rh
[root@bigdata111 opt]# mkdir module
[root@bigdata111 opt]# mkdir soft
[root@bigdata111 opt]# ls
module  rh  soft

Upload jdk package

Open winSCP tool, upload java jdk winscp by linux tool to the / opt / soft folder

[root@bigdata111 opt]# cd soft
[root@bigdata111 soft]# ls
jdk-8u144-linux-x64.tar.gz

Decompression jdk

Unzip the file to the module jdk folder, the command is as follows:

[root@bigdata111 opt]# cd soft
[root@bigdata111 opt]# tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt/module/
[root@bigdata111 soft]# cd /opt/module
[root@bigdata111 module]# ls
jdk1.8.0_144

Set environment variables jdk

[root@bigdata111 module]# vi /etc/profile

Add jdk environment variable at the end of the file, save and exit:

export JAVA_HOME=/opt/module/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin

Refresh environment variable

[root@bigdata111 module]# source /etc/profile

Check whether the installation is successful jdk

[root@bigdata111 module]# java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Hadoop to build local mode

Local mode is a stand-alone installation hadoop.

Installation hadoop

Upload hadoop package

Hadoop package to the next upload by winSCP / opt / soft / folder

[root@bigdata111 soft]# ls
hadoop-2.8.4.tar.gz  jdk-8u144-linux-x64.tar.gz

Decompression hadoop

Hadoop to extract the / opt / module / lower

[root@bigdata111 module]# tar -zvxf hadoop-2.8.4.tar.gz -C /opt/module/
[root@bigdata111 soft]# cd /opt/module/
[root@bigdata111 module]# ls
hadoop-2.8.4  jdk1.8.0_144

Hadoop environment variable settings

[root@bigdata111 module]# vi /etc/profile

Add the following configuration at the end, save and exit:

export HADOOP_HOME=/opt/module/hadoop-2.8.4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Refresh Profile

[root@bigdata111 module]# source /etc/profile

Check whether the installation is successful hadoop

[root@bigdata111 module]# hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
  CLASSNAME            run the class named CLASSNAME
 or
  where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
                       note: please use "yarn jar" to launch
                             YARN applications, not this command.
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  credential           interact with credential providers
  daemonlog            get/set the log level for each daemon
  trace                view and modify Hadoop tracing settings

Most commands print help when invoked w/o parameters.

Examples of test hadoop

Creating test file

In the new module directory testdoc file, enter the text:

[root@bigdata111 module]# cd /opt/module
[root@bigdata111 module]# touch testdoc
[root@bigdata111 module]# vi testdoc
[root@bigdata111 module]# cat testdoc
this is a test page!
chinese is the best country
this is a ceshi page!
i love china
listen to the music
and son on

Switching jar package directory

Switching execution to a jar hadoop directory:

[root@bigdata111 module]# cd /opt/module/hadoop-2.8.4/share/hadoop/mapreduce/
[root@bigdata111 mapreduce]# ls
hadoop-mapreduce-client-app-2.8.4.jar     hadoop-mapreduce-client-core-2.8.4.jar  hadoop-mapreduce-client-hs-plugins-2.8.4.jar  hadoop-mapreduce-client-jobclient-2.8.4-tests.jar  hadoop-mapreduce-examples-2.8.4.jar  lib           sources
hadoop-mapreduce-client-common-2.8.4.jar  hadoop-mapreduce-client-hs-2.8.4.jar    hadoop-mapreduce-client-jobclient-2.8.4.jar   hadoop-mapreduce-client-shuffle-2.8.4.jar          jdiff                                lib-examples

Wordcount program execution

[root@bigdata111 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.8.4.jar wordcount /opt/module/testdoc /opt/module/out
[root@bigdata111 mapreduce]# ls /opt/module/out
part-r-00000  _SUCCESS
[root@bigdata111 mapreduce]# cat /opt/module/out/part-r-00000
a   2
and 1
best    1
ceshi   1
china   1
chinese 1
country 1
i   1
is  3
listen  1
love    1
music   1
on  1
page!   2
son 1
test    1
the 2
this    2
to  1

Build a pseudo-distributed Hadoop

Pseudo-distributed is distributed operation disposed on a single machine.

View hadoop executable file

[root@bigdata111 mapreduce]# cd /opt/module/hadoop-2.8.4/
[root@bigdata111 hadoop-2.8.4]# ls
bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  README.txt  sbin  share
[root@bigdata111 hadoop-2.8.4]# cd bin
[root@bigdata111 bin]# ls
container-executor  hadoop  hadoop.cmd  hdfs  hdfs.cmd  mapred  mapred.cmd  rcc  test-container-executor  yarn  yarn.cmd
[root@bigdata111 bin]# cd ..
[root@bigdata111 hadoop-2.8.4]# cd sbin
[root@bigdata111 sbin]# ls
distribute-exclude.sh  hadoop-daemons.sh  hdfs-config.sh  kms.sh                   refresh-namenodes.sh  start-all.cmd  start-balancer.sh  start-dfs.sh         start-yarn.cmd  stop-all.cmd  stop-balancer.sh  stop-dfs.sh         stop-yarn.cmd  yarn-daemon.sh
hadoop-daemon.sh       hdfs-config.cmd    httpfs.sh       mr-jobhistory-daemon.sh  slaves.sh             start-all.sh   start-dfs.cmd      start-secure-dns.sh  start-yarn.sh   stop-all.sh   stop-dfs.cmd      stop-secure-dns.sh  stop-yarn.sh   yarn-daemons.sh

Switching Profiles directory

Set /opt/module/hadoop-2.8.4/etc/hadoop/ into hadoop directory:

[root@bigdata111 hadoop]# cd /opt/module/hadoop-2.8.4/etc/hadoop/
[root@bigdata111 hadoop]# ls
capacity-scheduler.xml  core-site.xml   hadoop-metrics2.properties  hdfs-site.xml            httpfs-signature.secret  kms-env.sh            log4j.properties  mapred-queues.xml.template  ssl-client.xml.example  yarn-env.sh
configuration.xsl       hadoop-env.cmd  hadoop-metrics.properties   httpfs-env.sh            httpfs-site.xml          kms-log4j.properties  mapred-env.cmd    mapred-site.xml.template    ssl-server.xml.example  yarn-site.xml
container-executor.cfg  hadoop-env.sh   hadoop-policy.xml           httpfs-log4j.properties  kms-acls.xml             kms-site.xml          mapred-env.sh     slaves                      yarn-env.cmd

Placed core-site.xml

[root@bigdata111 hadoop]# vi core-site.xml

<configuration>

<!-- 指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://bigdata111:9000</value>
</property>

<!-- 指定hadoop运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-2.8.4/data/tmp</value>
</property>

</configuration>

Configuring hdfs-site.xml

[root@bigdata111 hadoop]# vi hdfs-site.xml

<configuration>

<!--数据冗余数-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

</configuration>

Configuring yarn-site.xml

[root@bigdata111 hadoop]# vi yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->

<!-- reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>bigdata111</value>
</property>

<!-- 日志聚集功能使能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<!-- 日志保留时间设置7天(秒) -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>


</configuration>

Configuring mapred-site.xml

Rename mapred-site.xml.template as mapred-site.xml, configuration content

[root@bigdata111 hadoop]# mv mapred-site.xml.template mapred-site.xml
[root@bigdata111 hadoop]# ls
capacity-scheduler.xml  core-site.xml   hadoop-metrics2.properties  hdfs-site.xml            httpfs-signature.secret  kms-env.sh            log4j.properties  mapred-queues.xml.template  ssl-client.xml.example  yarn-env.sh
configuration.xsl       hadoop-env.cmd  hadoop-metrics.properties   httpfs-env.sh            httpfs-site.xml          kms-log4j.properties  mapred-env.cmd    mapred-site.xml             ssl-server.xml.example  yarn-site.xml
container-executor.cfg  hadoop-env.sh   hadoop-policy.xml           httpfs-log4j.properties  kms-acls.xml             kms-site.xml          mapred-env.sh     slaves                      yarn-env.cmd
[root@bigdata111 hadoop]# vi mapred-site.xml

<configuration>

<!-- 指定mr运行在yarn上-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

<!--历史服务器的地址-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>bigdata111:10020</value>
</property>

<!--历史服务器页面的地址-->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>bigdata111:19888</value>
</property>


</configuration>

Configuration hadoop-env.sh

Java_home modify an absolute path, save and exit:

[root@bigdata111 hadoop]# vi hadoop-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_144

Formatting namenode

Configured, formatted namenode (first file format only)

[root@bigdata111 hadoop]# hadoop namenode -format

Why do you want to format?

NameNode mainly used to manage the entire namespace distributed file system (actually directories and files) metadata information, and in order to ensure the reliability of the data, also joined the operation log, so, will NameNode persistent data ( saved to a local file system). For the first time HDFS, NameNode when you start, you need to run -format command before you can start the service node NameNode normal.

What format do something?

On NameNode nodes, there are two most important path, it is used to store metadata information and operation log, and the two paths from the configuration file, corresponding to their properties and are dfs.name.dir dfs.name .edits.dir, at the same time, they are the default path is / tmp / hadoop / dfs / name. When formatting, NameNode will clear all the files in two directories, then, creates a file in the directory dfs.name.dir

hadoop.tmp.dir this configuration, make dfs.name.dir and dfs.name.edits.dir will generate two files in a directory catalog

Open hdfs services and yarn

When namenode and resourcemanager in a machine, using the following command:

[root@bigdata111 hadoop]# start-all.sh

When the two is not a machine, using the following command:

[root@bigdata111 hadoop]# start-dfs.sh
[root@bigdata111 hadoop]# start-yarn.sh

Hdfs access web pages

Default Port: 50070

http://192.168.1.111:50070

Access yarn web page

Default Port: 8088

http://192.168.1.111:8088

Set up a Hadoop cluster

VMvare using clone mode, the machine 111 as a template to clone additional two machines.

Modify the host name and IP

Modifying the cloned two machines hostname and IP address so xshell connected:

[root@bigdata112 ~]# vi /etc/hostname
[root@bigdata112 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eno16777736
[root@bigdata112 ~]# service network restart
[root@bigdata112 ~]# ip addr

TYPE=Ethernet
BOOTPROTO=static
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
NAME=eno16777736
UUID=24bbe130-f59a-4b25-9df6-cf5857c89699
DEVICE=eno16777736
ONBOOT=yes
IPADDR=192.168.1.112
GATEWAY=192.168.1.2
DNS1=8.8.8.8

Delete the data directory

Delete the data directory /opt/module/hadoop-2.8.4, the purpose distributed cluster configuration.

[root@bigdata111 hadoop-2.8.4]# cd /opt/module/hadoop-2.8.4/
[root@bigdata111 hadoop-2.8.4]# rm -rf data/

Configuring hosts

Configure hosts and host name of the correspondence between IP

[root@bigdata111 hadoop-2.8.4]# vi /etc/hosts

192.168.1.111 bigdata111
192.168.1.112 bigdata112
192.168.1.113 bigdata113

SCP sends other machines

Configured to send the first two hosts file to other machines:

[root@bigdata111 hadoop-2.8.4]# scp /etc/hosts root@bigdata112:/etc/
[root@bigdata111 hadoop-2.8.4]# scp /etc/hosts root@bigdata113:/etc/

Configuring SSH-free dense Login

The use Xshell sent to all session key input function, to generate keys in three machines

[root@bigdata111 hadoop-2.8.4]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
cc:47:37:5a:93:0f:77:38:53:af:a3:57:47:55:27:59 root@bigdata111
The key's randomart image is:
+--[ RSA 2048]----+
|              .oE|
|             ..++|
|          . B = +|
|       o . + * * |
|        S o   + o|
|         .   . o.|
|            . .  |
|             .   |
|                 |
+-----------------+

The use Xshell sent to all session key input function, the keys added to the cluster library respective keys of the machine

[root@bigdata111 hadoop-2.8.4]# ssh-copy-id bigdata111
[root@bigdata111 hadoop-2.8.4]# ssh-copy-id bigdata112
[root@bigdata111 hadoop-2.8.4]# ssh-copy-id bigdata113

View keys library exists

[root@bigdata111 .ssh]# cd /root/.ssh
[root@bigdata111 .ssh]# ls
authorized_keys  id_rsa  id_rsa.pub  known_hosts
[root@bigdata111 .ssh]# cat authorized_keys 
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC7cSXZDdNJ0Cg+1wyVoCn4pWEAxy/13/ekg//YVkGwEsR6HO4XaYxxstVBij5JoTEEjSDNmz2HifTZDB098py3x882ZLVHJllJWzXYX4gVof/tmdmk5AJbhIlX3SoauTrrrzFiMtuXKdu6slvzhs9IbDp68xCUNiVI06OnWFSuhQc8Td+tekwlFPfm+v3W/PqUUgQAd+OAqOUC2vEjjnACQNw/wgGvF/lqrXDv5ZIFmYCBlB7YxwP9RykOvAzEe7w2W7TOt0K8V8oKKTui4aZuahWDbsGwlD7TAQRkilXkG59XG48AWOQoU/XFxph+XECqJzjmdxYedzY8inYW/Lfx root@bigdata111
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDYyMVfLaL9w9sGz5hQG96ksUN5ih2RHdwsiXBpL/ZRG7LasKS+OQcszmc61TJfV0Vjad7kuL9wlg2YqlVIQJvaIUQCw4+5BrO0vCy4JBrz/FiDjzxKx0Ba+ILziuMxl35RxDCVGph17i2jpYfy6jGLejYK9kpJH4ueIj8mm+4LTKabRZTcjdNNI0kYM+Tr08wEIuQ45adqVU9MpZc/j6i1FIr4R/RabyuO1FhEh0+Oc5Xbm3jSAYH0MgEvK1cuG9wmX7SaB/opO00Ts+nW/P4umeZQUy51IQSRdUF6BlMrshnCSlKHnuLv2eSCx9yv3QuQMWHnL/SOXUgTnIuzbrv9 root@bigdata112
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDoBOAT/n1QCnaVJtRS1Q9GeoP665gIayWxpSWbjEFus4DL4as5S9jAIhBQWrTnvZzm+Skb4dxGPgdPYLaMFX9tdDYPPsnnRR92sLpRw9gwvG5ROL5XPpV2X+Yxl6yACmlMT0JP1uk+Ekm623n6wtBSBP1BDtJ/fhXkRX6bo2kuXs4BvmP76cikdGBDygKNIEMPTcs6p2lfOnuVdQLSCGm+Q9NswKSBVElNyywNl5J9L/5kIzGXnoGtwhQtdrOjZ+c1tyiwhCz42I3c4z0Sb/zH3OFtHCvRG7cF72uDFxe1QwVJ4h1hJ1dmtwVCckNMbmmgK72PsN8Zg4Y8XtBXgX8n root@bigdata113

Verify SSH password-free login successful configuration

[root@bigdata111 .ssh]# ssh bigdata112
Last login: Mon Aug  5 09:23:11 2019 from bigdata112
[root@bigdata112 ~]# ssh bigdata111
Last login: Mon Aug  5 09:09:23 2019 from 192.168.1.1

Jdk and deploy hadoop

Remove the check " sent to all session key input " is transmitted from bigdata111 module folder to another two machines / opt / folder:

[root@bigdata111 module]# scp -r /opt/module/ root@bigdata112:/opt/
[root@bigdata111 module]# scp -r /opt/module/ root@bigdata113:/opt/

The environment variable / etc / profile sent to the other two machines:

[root@bigdata111 module]# scp -r /etc/profile root@bigdata112:/etc/
[root@bigdata111 module]# scp -r /etc/profile root@bigdata113:/etc/

Switch to the other two machines, refresh environment variables:

[root@bigdata112 module]# source /etc/profile
[root@bigdata112 module]# jps
2775 Jps
[root@bigdata113 module]# source /etc/profile
[root@bigdata113 module]# jps
2820 Jps

Configuring the cluster xml

Check "Send all session key input to the" configure hdfs-site, xml files yarn-site, mapred-site of:

hdfs-site.xml configured as follows (SecondaryNameNode 113 disposed on):

<configuration>

<!--数据冗余数-->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!--secondary的地址-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>bigdata113:50090</value>
</property>
<!--关闭权限-->
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

</configuration>

yarn-site.xml configured as follows (Yarn 112 disposed on):

<configuration>

<!-- Site specific YARN configuration properties -->
<!-- reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>bigdata112</value>
</property>

<!-- 日志聚集功能使能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<!-- 日志保留时间设置7天(秒) -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

</configuration>

mapred-site.xml configured as follows:

<configuration>

<!-- 指定mr运行在yarn上-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

<!--历史服务器的地址-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>bigdata112:10020</value>
</property>

<!--历史服务器页面的地址-->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>bigdata112:19888</value>
</property>

</configuration>

Configuring datanode slaves of

[root@bigdata111 ~]# cd /opt/module/hadoop-2.8.4/etc/hadoop/
[root@bigdata111 hadoop]# ls
capacity-scheduler.xml  core-site.xml   hadoop-metrics2.properties  hdfs-site.xml            httpfs-signature.secret  kms-env.sh            log4j.properties  mapred-queues.xml.template  ssl-client.xml.example  yarn-env.sh
configuration.xsl       hadoop-env.cmd  hadoop-metrics.properties   httpfs-env.sh            httpfs-site.xml          kms-log4j.properties  mapred-env.cmd    mapred-site.xml             ssl-server.xml.example  yarn-site.xml
container-executor.cfg  hadoop-env.sh   hadoop-policy.xml           httpfs-log4j.properties  kms-acls.xml             kms-site.xml          mapred-env.sh     slaves                      yarn-env.cmd
[root@bigdata111 hadoop]# vi slaves

bigdata111
bigdata112
bigdata113

Formatting namenode

Using xshell "send to all session key input" function to format namenode

[root@bigdata111 hadoop]# hadoop namenode -format
[root@bigdata112 hadoop]# hadoop namenode -format
[root@bigdata113 hadoop]# hadoop namenode -format

Start of hdfs 111

[root@bigdata111 hadoop]# start-dfs.sh

Start of yarn 112

[root@bigdata112 hadoop]# start-yarn.sh

Jps output process three machines

[root@bigdata111 hadoop]# jps
2512 DataNode
2758 NodeManager
2377 NameNode
2894 Jps

[root@bigdata112 ~]# jps
2528 NodeManager
2850 Jps
2294 DataNode
2413 ResourceManager

[root@bigdata113 ~]# jps
2465 NodeManager
2598 Jps
2296 DataNode
2398 SecondaryNameNode