Machine selection
In practical applications, there are generally two types of
The first: hardware server
The second: cloud hosting
Ready to work
Build three servers in the VMware 12 environment
Configure ip, hostname, local mapping (/etc/hosts)
The other two were cloned from the original
After cloning the machine, modify the mac address
Use root:
vim /etc/udev/rules.d/70-persistent-net.rules
(1) delete eth0
(2) Modify eth1 to eth0
(3) Copy the mac address
(4) Edit network card information, modify ip address and Mac address
vim /etc/sysconfig/network-scripts/ifcfg-eth0
vim /etc/sysconfig/network
(6) Modify the Hosts file and configure the mapping
(7) Turn off the firewall
$ sudo chkconfig iptables off
(8) Turn off Selinux (the security is too high to solve unnecessary troubles) The machine must be restarted to take effect.
$ sudo vim /etc/sysconfig/selinux Settings: SELINUX=disabled
(7) Restart the network
service network restart
(8) Check whether the configuration is successful through the ifconfig command
(9) Restart the virtual machine
Note: Here I have three virtual machine ips, and the host names are in order
192.168.59.223 bigdata-hpsk02.huadian.com 192.168.59.224 bigdata-hpsk03.huadian.com 192.168.59.225 bigdata-hpsk04.huadian.com
Installation method
1. Do not use batch installation tools
Manual distribution, distribute the configured first hadoop to each machine. Here I use the distribution method
2. Use batch tools: big data cluster monitoring operation and maintenance management tool CM
CDH: Open-source software infrastructure for Hadoop as a service.
The stable version can be downloaded at http://archive.cloudera.com/cdh5/cdh/5/
deploy
Create unified user, unified directory in Linux
create user
useradd username
Add password to user
passwd username
Create a directory
Create these four folders in the root directory, datas put test data softwares put software installation compressed package modules software installation directory tools development IDE and tools.
Modify the number of handles
Linux systems are represented by files, and it is necessary to modify the number of handles during concurrent tuning. What is set is the limit of the program that the current user is prepared to run.
If the number of file handles opened by a single process exceeds the value defined by the system, it will sometimes encounter the problem of Socket/File: Can't open so many files
ulimit -a #View linux related parameters
One of the methods is to put the ulimit modification command into /etc/profile, but this method is not good
The correct way should be to modify /etc/security/limits.conf
sudo vim /etc/security/limits.conf
Add at the end of the file:
Restart the machine and check the number of handles
ulimit -n
hadoop startup method
single process start: used to start sbin/hadoop-daemon.sh start namenode sbin/hadoop-daemon.sh start datanode sbin/yarn-daemon.sh start resourcemanager sbin/yarn-daemon.sh start nodemanager Start yarn and hdfs separately: for shutdown sbin/start-dfs.sh -》 purpose -》 datanode -》secondarynamenode sbin/start-yarn.sh -》resourcemanager -> all nodemanagers Start all processes at once sbin/start-all.sh
SSH key-free login to other machines to start the corresponding service
(1) Each machine creates a public and private key for itself
ssh-keygen -t rsa
You can see the generated id_rsa id_rsa.pub in the .ssh folder in the user directory
(2) Each machine sends its own public key to each machine including itself
ssh-copy-id own hostname ssh-copy-id another first hostname ssh-copy-id another second hostname
Verify that the 3 machines can log in to each other successfully (post a picture to save trouble):
NTP time synchronization: realize the time consistency of each machine through ntp service
Directly use the ntp service to synchronize the external network time server -"Select a machine as the intermediate synchronization service A, A synchronizes with the external network, B, C synchronizes A ->> Placement A sudo vim /etc/ntp.conf Remove the default configuration: restrict default kod nomodify notrap nopeer noquery restrict -6 default kod nomodify notrap nopeer noquery restrict 127.0.0.1 restrict -6 ::1 server 0.centos.pool.ntp.org server 1.centos.pool.ntp.org server 2.centos.pool.ntp.org -"Add to Configure which machines A allows to sync with me restrict 192.168.59.0 mask 255.255.255.0 nomodify notrap Configure who to synchronize with server 202.112.10.36 Configure local synchronization server 127.127.1.0 # local clock Note: 127.127.1.0 is the reserved IP address of the ntp time server, which is used to use this machine as the client's time server fudge 127.127.1.0 stratum 10 -"Start ntp service sudo service ntpd start - "Configuration B, C synchronization A sudo vim /etc/ntp.conf server 192.168.59.223 -> Manual synchronization sudo ntpdate 192.168.59.223 -"Open ntp service sudo service ntpd start
Lazy way:
Three machines set the time at the same time
sudo date -s "2018-04-27 15:56:00"
Install JDK
(1) Upload the compressed package to the softwares directory
(2) Unzip to the specified directory
tar -zxvf /opt/softwares/jdk-8u91-linux-x64.tar.gz -C /opt/modules/
(3) Distributed to the second and third machines
scp -r jdk1.8.0_91 [email protected]: / opt / modules / scp -r jdk1.8.0_91 [email protected]: / opt / modules /
(4) Configure environment variables (per machine)
vi /etc/profile
After entering, add at the end
##JAVA_HOME export JAVA_HOME=/opt/modules/jdk1.8.0_91 export PATH=$PATH:$JAVA_HOME/bin
install hadoop
(1) Upload the compressed package to the softwares directory
(2) Unzip to the specified directory
tar -zxvf /opt/softwares/hadoop-2.7.3.tar.gz -C /opt/modules/ z: indicates gz compression x: indicates unpacking v: indicates the compression process -C specifies the decompression address
(3) Node distribution
machine 1 datanode nodemanager namenode (working)
machine 2 datanode nodemanager nn rm (backup)
machine 3 datanode nodemanager resourcemanager (working)
datanode stores data nodemanager processes data
The local Nodemanager prioritizes local datanodes to avoid cross-network transfers. (Hadoop's own optimization)
Namenode and resourcenode are all master nodes, and they all need to receive user requests. If they are placed on node1, the load is relatively high, so they are distributed to different machines.
(4) Modify the configuration file
env.sh : configure environment variables
By default, JAVA_HOME will be found in the global variable /etc/profile first, but in order to avoid problems, the JAVA_HOME environment variable will be configured in the following three files. The configuration file is under /hadoop2.7.3/etc/hadoop/ hadoop -env mapred-env yarn-env
site-xml : configure user-defined requirements
core-site.xml: configure some properties of hadoop global
First create a temporary storage directory in the hadoop directory to store metadata
<configuration> //fs.defaultFS: The entry of hdfs configures the entry of the first machine <property> <name>fs.defaultFS</name> <value>hdfs://bigdata-hpsk02.huadian.com:8020</value> </property> //hadoop.tmp.dir hadoop temporary storage directory <property> <name>hadoop.tmp.dir</name> <value>/opt/modules/hadoop-2.7.3/tmpData</value> </property> </configuration>
hdfs-site.xml: configure the properties of hdfs
<configuration> //dfs.replication: number of file copies <property> <name>dfs.replication</name> <value>3</value> </property> //Access rights, users who are not the same as hdfs do not have permission to access. Close, let everyone access hdfs. Can't do it at work <property> <name>dfs.permissions.enabled</name> <value>true</value> </property> </configuration>
mapred-site.xml: Placement MapReduce
//Let MapReduce run on Yarn <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> // jobhistory </configuration>
yarn-site.xml
<configuration> //Specify which machine the resourcemanager is running on <property> <name>yarn.resourcemanager.hostname</name> <value>bigdata-training03.hpsk.com</value> </property> //Specify what type of program is running on yarn <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
slaves: configure the addresses of all slave nodes (one per line)
bigdata-hpsk02.huadian.com bigdata-hpsk03.huadian.com bigdata-hpsk04.huadian.com
distribution
Distribute hadoop-2.7.3 directory from first machine to another machine
scp -r hadoop-2.7.3 [email protected]:/opt/modules/
Or another machine downloads from the first one
scp -r [email protected]:/opt/modules/hadoop-2.7.3 /opt/modules/
Note to change the user permissions of the /opt/modules directory
start test
Format the file system. When formatting, new metadata will be generated. Wherever the namenode is started, it will be formatted on which machine
bin / hdfs purpose -format
Start the corresponding process with a single process command
Start namenode on the first machine
sbin/hadoop-daemon.sh start namenode
namenode saves metadata in the temporary storage directory we configured earlier in core-site.xml. It contains fsimage: file system snapshot edit logs: the sequence of changes to the file system.
Note: You cannot execute start-yarn.sh on the first machine, which will start the resourcemanager master node by default. We will start the resourcemanager on the third machine.
Start the datanode of the three machines
sbin/hadoop-daemon.sh start datanode
Start yarn on the third machine
sbin/start-yarn.sh
An error was reported.. Reason: The hostname of the resourcemanager could not be found. The hostname is incorrect. .
yarn-site.xml file
Change to the hostname of the third unit:
bigdata-hpsk04.huadian.com
Last look at the process:
First machine:
Second set:
The third one:
At this point, the distributed environment is built~
Create an Input directory on HDFS
Create a new test file
content:
Upload to the input directory
Test the wordcount program, you can see that the third one is connected, and the third one has resourcemanager
View Results
To stop all processes, you can use a unified execution command
Execute on the first machine:
sbin/stop-dfs.sh
close all namenode datanodes
Execute on the third machine:
sbin/stop-yarn.sh
close all resourcemanager nodemanager
Added: secondary namenode can be started on an idle machine