Build HADOOP pseudo-distributed cluster environment in Linux

Proceed as follows:

 For the basic configuration of the Linux environment, please refer to my last blog http://www.cnblogs.com/whcwkw1314/p/8921352.html

1. Preparation

1. Directory planning and software installation

Create directories: Create 4 folders in the opt folder under the root directory 

datas put test data softwares put software installation compressed package modules software installation directory tools development IDE and tools  

Check if the creation is successful

 

Upload software:

Here I use the upload and download software that comes with linux

$ sudo yum install -y lrzsz
 in
   rz: upload file
   sz: means download

Upload the compressed package of hadoop2.x version under the softwares folder jdk has been installed in the previous blog

Add execute permission to the archive

Unzip to the specified directory modules

After decompression, check the properties of the hadoop-2.7.3 folder

Change the user and user group to the current user to make subsequent operations more secure. Misoperations under the root user may have a great impact

 

Configure *-env.sh environment variable file hadoop-env.sh yarn-env.sh mapred-env.sh 

Entering /etc/hadoop in the decompressed hadoop-2.7.3, you can see these three files and some configuration files to be configured later

Here I use nodepad++ to connect to the virtual machine to change the configuration. The operation is intuitive and convenient

First change hadoop-env.sh

Change the JAVA_HOME path to the path where the JDK was previously installed

export JAVA_HOME=/opt/modules/jdk1.8.0_91

Then change the 2 files of yarn-env.sh mapred-env.sh

Give these three files execute permission

2. HDFS installation

 First configure the HDFS environment

1. Create a temporary data storage directory (create a data folder containing a tmpData folder in the current directory of hadoop-2.7.3) ./ means in the current directory

2. Configure the core-site.xml file

 3. Configure hdfs-site.xml

4. Configure the slaves file: specify which machines the DataNode runs on

A line in this file represents the name of the host on which the DataNode will run

 

Next start the service

Format the system for the first time for the HDFS file system

You can see in the following directory that the namenode -format command is used to format the HDFS file system

 The initialization succeeds if no error is reported.

 

Start HDFS service master node and slave node

Then verify if it was successful

The first way: view the process, enter jps on the command line, and the NameNode DataNode will be successful.

The second way: View http://bigdata-hpsk01.huadian.com:50070 through the web ui interface. Here, access by hostname ensures that the mapping is configured in the local hosts file. (I have already configured the mapping in the previous blog)

 The following interface appears to be successful

 

Test HDFS

1. Check the help document
$ bin/hdfs dfs

You can see the commands for HDFS operations

 

2. Create directory
$ bin/hdfs dfs -mkdir -p /datas/tmp

3. Upload file
$ bin/hdfs dfs -put etc/hadoop/core-site.xml /datas/tmp

4. List directory files
$ bin/hdfs dfs -ls /datas/tmp

5. View the contents of the file
$ bin/hdfs dfs -text /datas/tmp/core-site.xml

 Finally, we go to the webui interface to see if there is any content in the HDFS file system

 There is a tmp folder in the datas folder

Then we uploaded the core-site.xml file

The test is successful~

 

3. YARN installation

Configuring a YARN cluster
For distributed resource management and task scheduling frameworks:
programs running multiple applications on YARN
- MapReduce
parallel data processing framework
- Spark
based in-memory distributed computing framework
- Storm/Flink
real-time streaming computing framework
- Tez
analysis Data, the speed of MapReduce is fast
Master node:
ResourceManager
Slave node:
NodeManagers

Configure yarn-site.xml file

Configure the slaves file
Specify the name of the host where NodeManager runs. Since NodeManager and DataNode belong to the same machine, the previous configuration has been completed.

 

start the service

RM master node:
$ sbin/yarn-daemon.sh start resourcemanager
NM slave node:
$ sbin/yarn-daemon.sh start nodemanager

Enter jps to see that it has been successfully started, and you can also view all java-related processes ps -ef|grep java 

Finally, through the web ui interface to verify, the following interface appears OK

4. Running MapReduce

MapReduce: Parallel Computing Framework (Hadoop 2.x)
Thought: Divide and Conquer
Core:
Map: Divide
    Process data in parallel, divide the data, and process part by part
Reduce: Combine
    the data results processed by Map, including some business logic in it


A classic case in the big data computing framework (let's run this case below):
word frequency statistics (WordCount)
count the number of occurrences of words in a file


1. Configure MapReduce related properties:
go to etc/hadoop under hadoop-2.7.3 and copy the MapReduce template configuration file under this directory
$ cd etc/hadoop/
$ cp mapred-site.xml.template mapred-site.xml

 

Then modify the mapred-site.xml file to specify that the MapReduce program runs on YARN

 

2. Submit the MapReduce program to run on YARN
- prepare test data Create a test data file wc.input in the pre-created /opt/datas folder (used to store test data) by cd

Edit the data file and save:

Recursively create the /user/huadian/mapreduce/wordcount/input folder in the HDFS file system. The bin/hdfs dfs command should be operated in the hadoop-2.7.3 directory.

$ bin/hdfs dfs -mkdir -p /user/huadian/mapreduce/wordcount/input

Put our test data file in the input directory

$ bin/hdfs dfs -put /opt/datas/wc.input /user/huadian/mapreduce/wordcount/input

 

3. Submit and run
Here we use the official JAR package of the MapReduce program. The path is as follows:
${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar

$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-
examples-2.7.3.jar wordcountUsage: wordcount <in> [<in>...] <out>
Parameter description:
<in> -> MapReduce The location of the data to be processed by the program
<out> -> The location where the data results processed by MapReduc are stored. This path cannot exist.

View Results

 

 Finally start the log service

Placement mapred-site.xml:

<!-- configure history server -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>bigdata-hpsk01.huadian.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>bigdata-hpsk01.huadian.com:19888</value>
</property>

 

Start the service:

$ sbin/mr-jobhistory-daemon.sh start historyserver

 

Log aggregation function:
After the MapReduce program runs on YARN, upload the generated log files to the HDFS directory for subsequent monitoring and viewing the
configuration yarn-site.xml:

<!-- Configure YARN log aggregation function-->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

Restart YARN and JobHistoryServer services
in order to re-read configuration properties

OK~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324882243&siteId=291194637