Hadoop distributed environment deployment

Machine selection

In practical applications, there are generally two types of

The first: hardware server

The second: cloud hosting

Ready to work

Build three servers in the VMware 12 environment

Configure ip, hostname, local mapping (/etc/hosts)

The other two were cloned from the original

After cloning the machine, modify the mac address

Use root:

vim /etc/udev/rules.d/70-persistent-net.rules

(1) delete eth0

(2) Modify eth1 to eth0

(3) Copy the mac address

(4) Edit network card information, modify ip address and Mac address

vim /etc/sysconfig/network-scripts/ifcfg-eth0
(5) Modify hostname
vim /etc/sysconfig/network

(6) Modify the Hosts file and configure the mapping

(7) Turn off the firewall

$ sudo chkconfig iptables off

(8) Turn off Selinux (the security is too high to solve unnecessary troubles) The machine must be restarted to take effect.

$ sudo vim /etc/sysconfig/selinux
Settings:
SELINUX=disabled

(7) Restart the network

service network restart

(8) Check whether the configuration is successful through the ifconfig command

(9) Restart the virtual machine

Note: Here I have three virtual machine ips, and the host names are in order

192.168.59.223  bigdata-hpsk02.huadian.com
192.168.59.224  bigdata-hpsk03.huadian.com
192.168.59.225  bigdata-hpsk04.huadian.com

Installation method

1. Do not use batch installation tools

Manual distribution, distribute the configured first hadoop to each machine. Here I use the distribution method

2. Use batch tools: big data cluster monitoring operation and maintenance management tool CM

CDH: Open-source software infrastructure for Hadoop as a service.

The stable version can be downloaded at http://archive.cloudera.com/cdh5/cdh/5/

deploy

Create unified user, unified directory in Linux

create user

useradd username

Add password to user

passwd username

Create a directory

Create these four folders in the root directory, datas put test data softwares put software installation compressed package modules software installation directory tools development IDE and tools.

 

Modify the number of handles

Linux systems are represented by files, and it is necessary to modify the number of handles during concurrent tuning. What is set is the limit of the program that the current user is prepared to run.

If the number of file handles opened by a single process exceeds the value defined by the system, it will sometimes encounter the problem of Socket/File: Can't open so many files

ulimit -a #View linux related parameters

One of the methods is to put the ulimit modification command into /etc/profile, but this method is not good

The correct way should be to modify /etc/security/limits.conf

sudo vim /etc/security/limits.conf

Add at the end of the file:

Restart the machine and check the number of handles

ulimit -n

hadoop startup method

single process start: used to start
    sbin/hadoop-daemon.sh start namenode
    sbin/hadoop-daemon.sh start datanode
    sbin/yarn-daemon.sh start resourcemanager
    sbin/yarn-daemon.sh start nodemanager
Start yarn and hdfs separately: for shutdown
    sbin/start-dfs.sh
        -》 purpose
        -》 datanode
        -》secondarynamenode
    sbin/start-yarn.sh
        -》resourcemanager
        -> all nodemanagers
Start all processes at once
        sbin/start-all.sh

SSH key-free login to other machines to start the corresponding service

(1) Each machine creates a public and private key for itself

ssh-keygen -t rsa

You can see the generated id_rsa id_rsa.pub in the .ssh folder in the user directory

(2) Each machine sends its own public key to each machine including itself

ssh-copy-id own hostname
ssh-copy-id another first hostname
ssh-copy-id another second hostname

Verify that the 3 machines can log in to each other successfully (post a picture to save trouble):

NTP time synchronization: realize the time consistency of each machine through ntp service

Directly use the ntp service to synchronize the external network time server
			-"Select a machine as the intermediate synchronization service A, A synchronizes with the external network, B, C synchronizes A
				->> Placement A sudo vim /etc/ntp.conf
					Remove the default configuration:
						restrict default kod nomodify notrap nopeer noquery
						restrict -6 default kod nomodify notrap nopeer noquery
						restrict 127.0.0.1
						restrict -6 ::1
						
						server 0.centos.pool.ntp.org
						server 1.centos.pool.ntp.org
						server 2.centos.pool.ntp.org
					-"Add to
						Configure which machines A allows to sync with me
						restrict 192.168.59.0 mask 255.255.255.0 nomodify notrap
						Configure who to synchronize with
						server 202.112.10.36
						Configure local synchronization
						server 127.127.1.0 # local clock Note: 127.127.1.0 is the reserved IP address of the ntp time server, which is used to use this machine as the client's time server
						fudge   127.127.1.0 stratum 10  
					-"Start ntp service
						sudo service ntpd start
				- "Configuration B, C synchronization A
					sudo vim /etc/ntp.conf
					server 192.168.59.223
					
					-> Manual synchronization
						sudo ntpdate 192.168.59.223
					-"Open ntp service
						sudo service ntpd start

Lazy way:

Three machines set the time at the same time

sudo date -s "2018-04-27 15:56:00"

Install JDK

(1) Upload the compressed package to the softwares directory

(2) Unzip to the specified directory

tar -zxvf /opt/softwares/jdk-8u91-linux-x64.tar.gz -C /opt/modules/

(3) Distributed to the second and third machines

scp -r jdk1.8.0_91 [email protected]: / opt / modules /
scp -r jdk1.8.0_91 [email protected]: / opt / modules /

(4) Configure environment variables (per machine)

vi /etc/profile
After entering, add at the end
##JAVA_HOME
export JAVA_HOME=/opt/modules/jdk1.8.0_91
export PATH=$PATH:$JAVA_HOME/bin

install hadoop

(1) Upload the compressed package to the softwares directory

(2) Unzip to the specified directory

tar -zxvf /opt/softwares/hadoop-2.7.3.tar.gz -C /opt/modules/
z: indicates gz compression x: indicates unpacking v: indicates the compression process -C specifies the decompression address

(3) Node distribution

machine 1 datanode nodemanager namenode (working)

machine 2 datanode nodemanager nn rm (backup)

machine 3 datanode nodemanager resourcemanager (working)

datanode stores data nodemanager processes data

The local Nodemanager prioritizes local datanodes to avoid cross-network transfers. (Hadoop's own optimization)

Namenode and resourcenode are all master nodes, and they all need to receive user requests. If they are placed on node1, the load is relatively high, so they are distributed to different machines.

(4) Modify the configuration file

env.sh : configure environment variables

By default, JAVA_HOME will be found in the global variable /etc/profile first, but in order to avoid problems, the JAVA_HOME environment variable will be configured in the following three files. The configuration file is under /hadoop2.7.3/etc/hadoop/ hadoop -env mapred-env yarn-env

site-xml : configure user-defined requirements

core-site.xml: configure some properties of hadoop global

First create a temporary storage directory in the hadoop directory to store metadata

<configuration>
    //fs.defaultFS: The entry of hdfs configures the entry of the first machine
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://bigdata-hpsk02.huadian.com:8020</value>
    </property>
    
    //hadoop.tmp.dir hadoop temporary storage directory
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/modules/hadoop-2.7.3/tmpData</value>
    </property>
</configuration>

hdfs-site.xml: configure the properties of hdfs

<configuration>
     //dfs.replication: number of file copies
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    //Access rights, users who are not the same as hdfs do not have permission to access. Close, let everyone access hdfs. Can't do it at work
    <property>
        <name>dfs.permissions.enabled</name>
        <value>true</value>
    </property>
</configuration>

mapred-site.xml: Placement MapReduce

//Let MapReduce run on Yarn
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    // jobhistory
</configuration>

yarn-site.xml

<configuration>
//Specify which machine the resourcemanager is running on
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>bigdata-training03.hpsk.com</value>
    </property>
//Specify what type of program is running on yarn
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

slaves: configure the addresses of all slave nodes (one per line)

bigdata-hpsk02.huadian.com
bigdata-hpsk03.huadian.com
bigdata-hpsk04.huadian.com

distribution

Distribute hadoop-2.7.3 directory from first machine to another machine

scp -r hadoop-2.7.3 [email protected]:/opt/modules/

Or another machine downloads from the first one

scp -r [email protected]:/opt/modules/hadoop-2.7.3  /opt/modules/

Note to change the user permissions of the /opt/modules directory

start test

Format the file system. When formatting, new metadata will be generated. Wherever the namenode is started, it will be formatted on which machine

bin / hdfs purpose -format

Start the corresponding process with a single process command

Start namenode on the first machine

sbin/hadoop-daemon.sh start namenode

namenode saves metadata in the temporary storage directory we configured earlier in core-site.xml. It contains fsimage: file system snapshot edit logs: the sequence of changes to the file system.

Note: You cannot execute start-yarn.sh on the first machine, which will start the resourcemanager master node by default. We will start the resourcemanager on the third machine.

Start the datanode of the three machines

sbin/hadoop-daemon.sh start datanode

Start yarn on the third machine

sbin/start-yarn.sh

An error was reported.. Reason: The hostname of the resourcemanager could not be found. The hostname is incorrect. .

yarn-site.xml file

Change to the hostname of the third unit:

bigdata-hpsk04.huadian.com

Last look at the process:

First machine:

Second set:

The third one:

At this point, the distributed environment is built~

Create an Input directory on HDFS

Create a new test file

content:

Upload to the input directory

Test the wordcount program, you can see that the third one is connected, and the third one has resourcemanager

View Results

To stop all processes, you can use a unified execution command

Execute on the first machine:

sbin/stop-dfs.sh

close all namenode datanodes

Execute on the third machine:

sbin/stop-yarn.sh

close all resourcemanager nodemanager

Added: secondary namenode can be started on an idle machine

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325056413&siteId=291194637