1. The virtual machine is ready
Prepare three clients (installation JDK8, turn off the firewall, static ip, host name)
Modify etc / hosts file
192.168.138.102 hadoop102
192.168.138.103 hadoop103
192.168.138.104 hadoop104
2. Write a cluster distribution script xsync
2.1 scp (secure copy) secure copy
2.1.1 scp defined
scp data copy may be implemented between the server and the server;
2.1.2 The basic syntax
-r $ pdir scp / $ fname $ $ hadoop the User @ Host: $ pdir / $ fname
command recursively to copy the file path / name of the destination user @ host: destination path / name
2.1.3 On 192.168.138.55 host, software at the 192.168.138.187 / opt / module to copy the directory 192.168.138.66;
scp -r 192.168.138.55:/opt/module/hadoop 192.168.138.66:/opt/module
2.1.4 On 192.168.138.77 host, under software / opt on 0192.168.138.55 server / module to copy the directory 192.168.138.77;
scp -r 192.168.138.55:/opt/module/hadoop 192.168.138.77:/opt/module
2.2 rsync synchronization tool Yuncheng
rsync remote synchronization tool, mainly for backup and mirroring. A block having a speed to avoid copying the same contents and advantages of the support symbolic links;
rsync and scp difference: do with rsync to copy files faster than scp, rsync only difference file to do the update. scp is so files are copied to the past;
Rsync 2.2.1 View instructions
man rsync | more
2.2.2 The basic syntax
-rvl $ pdir rsync / $ fname $ $ hadoop the User @ Host: $ pdir
file path # Command parameter to be copied / name of the destination user @ host: destination path
2.2.3 Option Description
2.2.4 the / opt / sodtware directory on machine 55 to synchronize the / opt directory servers 66
rsync -rvl /opt/software/* 192.168.138.66:/opt/software/
2.2.5 original copy
rsync -rvl /opt/module 192.168.138.77:/opt/
2.2.6 script to achieve
! # / bin / the bash
# . 1 acquires the number of input parameters, without parameters, exit Pcount = $ # IF ((Pcount == 0 )); the then echo NO args; Exit; Fi # 2 acquires the file name p1 = $ . 1 fname = `$ p1` the basename echo fname = $ # fname . 3 acquires the absolute path of the parent directory to the CD` = PDir - P $ ($ dirname P1); pwd` echo PDir = $ # PDir . 4 acquires the name of the current user = user ` # whoami` . 5 cycles for ((Host = 102; Host <104; ++ Host)); do echo --------------------- Host Hadoop $ ---- ------------ rsync -rvl $ pdir / $ fname $ $ hadoop the User @ Host: $ pdir DONE
2.2.7 modify the script has execute permissions xsync
chmod 777 xsync
2.2.8 execute the script file
./xsync xsync
3. cluster configuration
1. Cluster Deployment Planning
hadoop55 hadoop66 hadoop77
HDFS NameNode DataNode SecondaryNameNode
DataNode DataNode
YARN NodeManager ResourceManager
NodeManager NodeManager
2. Configure the cluster (three machines)
2.1 core-site.xml configuration file [/ hadoop / etc / hadoop directory]
<the Configuration>
<! - the NameNode IP address and port ->
<Property>
<name> fs.defaultFS </ name>
<value> HDFS: // hadoop102: 9000 </ value>
</ Property>
! <- - generating a file storage directory specified run hadoop ->
<Property>
<name> hadoop.tmp.dir </ name>
<value> / opt / Module1 / hadoop / Data / TEMP </ value>
</ Property>
< / configuration>
2.2 hdfs-site.xml configuration file
<configuration>
<!--指定HDFS副本的数量-->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:50090</value>
</property>
</configuration>
2.3 configuration file hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64
2.4 configuration file yarn-env.sh
# some Java parameters
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64
2.5 Configuring yarn-site.xml file
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- Reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value> </property> </configuration>
2.6 Configuring mapred-env.sh ask price
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64
2.7 mapred-site.xml configuration file
<configuration>
<!-- 指定MR运行在YARN上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property> </configuration>
2.8 Distribution configured hadoop configuration files on the cluster
./xsync /opt/module/hadoop/
4. Start a single point cluster
4.1 If the cluster is the first time you start, you need to format NameNode
bin / hdfs namenode -format
4.2 NameNode start on hadoop55
hadoop-daemon.sh start namenode
4.3 hadoop55, hadoop66, start DataNode respectively on hadoop77
hadoop-daemon.sh start datanode
4.4 start SecondaryNameNode on hadoop77
hadoop-daemon.sh start secondarynamenode
4.5 Access
Under 4.6 hadoop fully distributed environment, DataNode normal start, but the display on the page nodes DataNode
Solutions are as follows:
1. Check the / etc / hosts is configured in all mappings from the node hostname to ip;
2. Modify hafs.site.xml namenode files on the machine, after joining configuration, DataNode restart
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
5.SSH no secret login configuration
5.1 Configuring ssh
5.1.1 ssh basic principles
SSH has been able to ensure security, because it uses a public key encryption, as follows:
1. remote host receives the user's login request, his own public key to the user;
2. The user using this key after the password encrypted, sent back;
3. remote host with their own private key to decrypt password, if the password is correct, agreed to user login;
5.1.2 The basic syntax
If the user name java, login to remote host name is Linux, the command is as follows:
$ssh java@linux
SSH default port is 22, that is, your login request will be sent into the 22 port of the remote host. Using parameter p, the port can be modified, for example, to port 88, the following command:
$ssh -p 88@linux
Note: If an error occurs provide: ssh: Could not resolve hostname linux: Name or service not known, it is because Linux host to add this host Name Server in, it does not recognize the need in / etc / hosts Lane added to the host and the corresponding IP can:
linux 192.168.138.102
5.2 No key configuration
5.2.1 The principle of free dense Login
When the Master as a client, to be achieved without cryptographic public key authentication, the server is connected to the take salve, a need to generate the master key pair comprising a public key and a private key, the public key and after it is copied to the salve Dun . When the master is connected via ssh salve, Salve will generate a random number and encrypts the random number using the master public key, and sends the master. After the master receives an encrypted private key to decrypt and then number and confirmation number back to the slave number is correct decryption allows to connect the master after decryption. This is a public key certification process, which does not require the user to manually enter a password.
5.2.2 generated on the master host (hadoop102) without a cryptographic key pair
ssh-keygen -t rsa
When asked directly enter a path after running to save it, the default path;
The generated key pair: id_rsa (private key), id_rsa.pub (public key), are stored in the default '/ Username /.ssh' directory;
View 5.2.3 Key Pair
cd .ssh
5.2.4 public key on the master (hadoop102) node transmits to the remote host
ssh-copy-id hadoop103
Check whether the transfer was successful hadoop103
5.2.5 Test No password
ssh hadoop103
5.3 ssh folder file features explained
(1) known_hosts: ssh recording visited public computer (public Key)
(2) id_rsa: generating a private key
(3) id_rsa.pub: generating a public key
(4) authorized_keys: no adhesion to store authorization passes login server public Key
6. rallied cluster
6.1 Configuration Slaves
cd /opt/module/hadoop/etc/hadoop/
vim slaves
Once configured, distributed to other nodes
./xsync /opt/module/hadoop/etc/hadoop/
Start Cluster 6.2
6.2.1 If the first start, you need to format NameNode
bin / hdfs namenode -format
6.2.2 start HDFS on hadoop102 machine
start-dfs.sh
6.2.3 Start yarn on hadoop103 machine
start-yarn.sh
6.2.4 Access
HDFS:http://192.168.138.102:50070
YARN:http://192.168.138.103:8088
7. Start program to test MapReduce
7.1 Create a file in the directory folder named hadoop wcinput
mkdir wcinput
7.2 wc.input create a file in the folder and compile wcinput
cd wcinput
touch wc.input
vim wc.input
7.3 returns / opt / module / hadoop directory
7.4 executor
hadoop fs -put wcinput /
7.5 implementation of MapReduce programs
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount /wcinput /wcoutput
7.6 查看运行结果
hadoop fs -cat /wcoutput/*
8.集群启动/停止方式总结
8.1 各个组件逐一启动/停止
8.1.1 分别启动/停止hdfs组件
hadoop-daemon.sh start|stop namenode|datanode|secondarynamenode
8.1.2 分别启动/停止yarn组件
yarn-daemon.sh start|stop resourcemanager|nodemanager
8.2 各个模块分开启动/停止(配置ssh前提)常用
8.2.1 整体启动/停止hdfs
start-dfs.sh
stop-dfs.sh
8.2.2 整体启动/停止yarn
start-yarn.sh
stop-yarn.sh
9.集群时间同步
时间同步的方式:找一个机器,作为时间服务器,所有的机器与这台集群时间进行定时的同步,比如,每隔十分钟,同步一次时间。
9.1 检查是否安装
rpm -qa|grep ntp
9.2 查看ntpd服务是否开启
service ntpd status
如果开启需要关闭,然后进行下面的操作;
9.3 修改ntp配置文件
vim /etc/ntp.conf
9.3.1 修改1(授权192.168.1.0网段上的所有机器可以从这台机器上查询和同步时间)
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap为
restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
9.3.2 修改2(集群在局域网中,不使用其他的网络时间)
9.3.3 添加3(当该节点丢失网络连接,依然可以作为时间服务器为集群中的其他节点提供时间同步)
server 127.127.1.0
fudge 127.127.1.0 stratum 10
9.4 修改/etc/sysconfig/ntpd文件
vim /etc/sysconfig/ntpd
增加内容如下(让硬件时间与系统时间一起同步)
SYNC_HWCLOCK=yes
9.5 重新启动ntpd服务
service ntpd start
9.6 设置ntpd服务器开启启动
9.6.1 在其他机器上配置1分钟与时间服务器同步一次
crontab -e
编写内容如下:
*/1 * * * * /usr/sbin/ntpdate hadoop102
9.6.2 修改任意机器的时间
date -s "2017-9-11 11:11:11"
9.6.3 一分钟后查看机器是否与时间服务器同步
date