Hadoop platform construction (pseudo-distributed)

Similar to the many distributions of Linux, there are many distributions of Hadoop, but they are basically divided into two categories, namely, open source community edition and commercial paid edition. The community version refers to the version maintained by the Apache Software Foundation, which is an officially maintained version system; the commercial version refers to a third-party commercial company that has made some modifications, integrations, and compatibility tests of various service components based on the community version of Hadoop. The more well-known stable versions are Cloudera's CDP, CDH, Hortonworks' Hortonworks Data Platform (HDP), mapR, etc.

In these commercial Hadoop distributions, in order to attract users to use, manufacturers also provide some open source products as bait, such as Cloudera's CDH distribution, Hortonworks' HDP distribution, etc., so at present, the non-charged Hadoop version is mainly There are three, namely Apache Hadoop, Cloudera's CDH version, Hortonworks' HDP.

Apache Hadoop release version

Apache Hadoop is the original Hadoop release. There are currently three major releases, namely Hadoop1.x, Hadoop2.x, and Hadoop3.x. The features of each version are shown in the following table:

The Apache Hadoop distribution provides two forms of source package and binary package to download the binary package, which is more convenient. Click here ( https://archive.apache.org/dist/hadoop/common/ ) to download it.

Hortonworks Hadoop distribution

The main product of Hortonworks is HDP, which is also a 100% open source product. It is the closest version to Apache Hadoop. In addition, HDP also includes Ambari, which is an open source Hadoop management system. It can realize unified deployment, automatic configuration, automatic expansion, real-time status monitoring, etc. It is a fully functional big data operation and maintenance management platform.

When using the HDP release version, you can use the Ambari management function to achieve rapid installation and deployment of Hadoop, and it is also very helpful to the operation and maintenance of the big data platform. It can be said that Ambari has achieved seamless integration with HDP.

HDP has also released three versions so far, namely HDP1.x, HDP2.x and HDP3.x, which correspond to the major version released by Apache Hadoop. The installation of the HDP distribution is based on Ambari, through the rpm file provided by HDP. Automatic installation and expansion can be realized on the Ambari platform.

Cloudera Hadoop distribution

Cloudera is the first company to commercialize Hadoop. At present, its products mainly include CDH, Cloudera Manager, Cloudera Data Platform (CDP), etc. The following table briefly introduces the characteristics of these products.

How to choose a release

As a user, how should I choose?

Whether it is an open source product (whether it is free), this is very important;

Is there a stable release version, the development version cannot be used in production;

Whether it has been tested in practice to see if it is used by a large company (you can't be a guinea pig);

Whether there is active community support and sufficient information, because we encounter problems, we can solve the problem through the network resources such as community and search.

Among the large domestic Internet companies, the CDH or HDP release version is used more frequently. I personally recommend the HDP release version because of simple deployment and stable performance.

Pseudo-distributed installation of Hadoop cluster

In order to let you quickly understand the functions and uses of Hadoop, first install a Hadoop cluster through pseudo-distribution. Here, the binary package of the Apache Hadoop distribution is used for rapid deployment. Fully distributed Hadoop cluster will be introduced in more depth later.

Installation planning

Pseudo-distributed installation of Hadoop requires only one machine, and the minimum hardware configuration is 4 core CPU and 8G memory. We use Hadoop-3.2.1 version. This version requires Java version at least JDK8, here we use JDK1.8.0_171, CentOS7 .6 is introduced as an example. According to the operation and maintenance experience and subsequent upgrades and automated operation and maintenance needs, install the Hadoop program in the /opt/hadoop directory, and put the Hadoop configuration files in the /etc/hadoop directory.

Installation process

Click ( https://mirror.bit.edu.cn/apache/hadoop/core/hadoop-3.2.1/ ) to download the hadoop-3.2.1.tar.gz binary version file of the Apache Hadoop distribution. The installation is very simple , Just unzip the file to complete the installation, the operation process is as follows:

[root@namenodemaster ~]#useradd hadoop
[root@namenodemaster ~]#mkdir /opt/hadoop
[root@namenodemaster ~]#cd /opt/hadoop
[root@namenodemaster hadoop]#tar zxvf hadoop-3.2.1.tar.gz
[root@namenodemaster hadoop]#ln -s hadoop-3.2.1 current
[root@namenodemaster hadoop]#chown -R hadoop:hadoop /opt/hadoop

 Soft linking the unzipped hadoop-3.2.1.tar.gz directory to current is for the convenience of subsequent operation and maintenance, because it may involve Hadoop version upgrades, automated operation and maintenance and other operations. After this setting, the operation and maintenance workload can be greatly reduced.

After the Hadoop program is installed, you also need to copy the configuration file to the /etc/hadoop directory, and perform the following operations

[root@namenodemaster ~]# mkdir /etc/hadoop
[root@namenodemaster ~]# cp -r /opt/hadoop/current/etc/hadoop /etc/hadoop/conf
[root@namenodemaster ~]#  chown -R hadoop:hadoop  /etc/hadoop
[root@namenodemaster ~]# 

In this way, the configuration file is placed in the /etc/hadoop/conf directory.

Next, you also need to install a JDK, here is JDK 1.8.0_171, install it to the /usr/java directory, the operation process is as follows:

[root@namenodemaster ~]# mkdir /usr/java
[root@namenodemaster ~]# cd /usr/java/
[root@namenodemaster java]# ls
jdk-8u211-linux-x64.tar.gz
[root@namenodemaster java]# tar zxvf jdk-8u211-linux-x64.tar.gz 
[root@namenodemaster java]# ln -s jdk1.8.0_211 default
[root@namenodemaster java]# ls
default  jdk1.8.0_211  jdk-8u211-linux-x64.tar.gz
[root@namenodemaster java]# 

The last step of this operation process, making this soft connection, is also for the convenience of subsequent operation and maintenance automation configuration and upgrade.

In the last step, you also need to set the environment variables of the Hadoop user. The configuration is as follows:

[root@namenodemaster ~]# more /home/hadoop/.bashrc 
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
	. /etc/bashrc
fi

# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=

# User specific aliases and functions
export JAVA_HOME=/usr/java/default
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/opt/hadoop/current
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export HADOOP_YARN_HOME=${HADOOP_HOME}
export CATALINA_BASE=${HTTPFS_CATALINA_HOME}
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HTTPFS_CONFIG=/etc/hadoop/conf
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
[root@namenodemaster ~]# 

The Hadoop user created here is the administrator user who will manage the Hadoop platform in the future. All management operations on Hadoop need to be completed by this user. This should be noted.

In addition, in the configured environment variables, pay special attention to the following two, if there is no configuration or configuration error, it will cause some services to fail to start:

  • HADOOP_HOME is the directory where the Hadoop installation program is specified

  • HADOOP_CONF_DIR is to specify the Hadoop configuration file directory

So far, Hadoop has basically been installed.

Configure Hadoop parameters

After the Hadoop installation is complete, let's first understand several important directories and files in its installation directory. Here, Hadoop is installed in the /opt/hadoop/current directory, open this directory, and the directories that need to be mastered are shown in the following table

After understanding the function of the directory, start the configuration operation. The configuration of Hadoop is quite complicated, but these are the content to be discussed later. In pseudo-distribution mode, only one configuration file needs to be modified. The file is core-site.xml. This file is currently located in the /etc/hadoop/conf directory. Add the following content under this file tag: 

<property>
  <name>fs.defaultFS</name>
    <value>hdfs://hadoop3server</value>
</property>

Among them, the fs.defaultFS property describes the URI to access the HDFS file system plus an RPC port. If the port is not added, the default is 8020. In addition, namenodemaster can be the host name of the server, or any character, but it needs to be parsed in the server's /etc/hosts, that is, add the following content:

[root@namenodemaster ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.31 namenodemaster 
[root@namenodemaster ~]# 

192.168.1.31 here is the IP address of the server where the Hadoop software is installed

Start the Hadoop service

After the configuration operation is completed, the Hadoop service can be started below. Although it is a pseudo-distributed mode, all Hadoop services must be started. The services that need to be started are as follows

The functions and uses of the service are introduced so much first, and will be explained in more depth later. Next, to start the services of the Hadoop cluster, it must be executed as a Hadoop user, and the startup of each service is in order. Start each service in turn below 

(1) Start the NameNode service

First, you need to format the NameNode, the command is as follows

[root@namenodemaster java]# su - hadoop
[hadoop@namenodemaster ~]$ cd /opt/hadoop/current/bin
[hadoop@namenodemaster bin]$ hdfs  namenode -format

 

Then you can start the NameNode service, the operation process is as follows

[hadoop@namenodemaster conf]$ pwd
/etc/hadoop/conf
[hadoop@namenodemaster conf]$ hdfs --daemon start namenode
[hadoop@namenodemaster conf]$ jps|grep NameNode
3956 NameNode
[hadoop@namenodemaster conf]$ 

Use the jps command to check whether the NameNode process started normally. If it fails to start normally, you can check the NameNode startup log file to check whether any abnormal information is thrown. The log file path here is: /opt/hadoop/current/logs/hadoop-hadoop- namenode-namenodemaster.log

After the NameNode startup is complete, you can view the status through the Web page. By default, an http port 9870 will be activated. You can view the status of the NameNode service through the access address: http://192.168.1.31:9870, as shown in the figure below

In the above figure, the key information marked in the red box needs attention. The first one is that the access address of namenode in Hadoop is hdfs://namenodemaster:8020, which is specified in the configuration file; there is also Hadoop The version, operating mode, capacity, "Live node" and "Dead node" of the system are explained one by one below.

The operating mode displays "Safe mode is off", which means that the namenode safe mode is currently off.

Safe mode is on This indicates why the namenode security mode is currently enabled. In fact, the reason has been explained in the figure. When the Namenode is started, it will check the status of the DataNode. If the number of blocks reported by the DataNode reaches the number of blocks in the metadata record Only 0.999 times can leave the safe mode, otherwise it has been running in safe mode. Safe mode is also called read-only mode. In this mode, data on HDFS cannot be written. Because the DataNode service has not been started yet, it must be in safe mode.

HDFS capacity , Configured Capacity is currently displayed as 0, which is also caused by the DataNode service has not been started yet. After starting, the capacity should be displayed.

" Live node " and " Dead node " respectively display the active DataNode nodes and failed (dead) DataNode nodes in the current cluster. Operation and maintenance often monitors the value of "Dead node" on this page to determine whether the cluster is abnormal.

(2) Start the secondarynamenode service

[hadoop@namenodemaster conf]$  hdfs --daemon start secondarynamenode
[hadoop@namenodemaster conf]$  jps|grep SecondaryNameNode
4652 SecondaryNameNode
[hadoop@namenodemaster conf]$ 

Similar to NameNode, if the secondarynamenode process cannot be started, you can use the /opt/hadoop/current/logs/hadoop-hadoop-secondarynamenode-namenodemaster.log file to check whether there is an abnormality in the secondarynamenode startup log

(3) Start the DataNode service

[hadoop@namenodemaster conf]$ hdfs --daemon start datanode
[hadoop@namenodemaster conf]$  jps|grep DataNode
4514 DataNode
[hadoop@namenodemaster conf]$ 

If it fails to start, you can check the datanode startup process for abnormalities by viewing the /opt/hadoop/current/logs/hadoop-hadoop-datanode-namenodemaster.log file.

Up to this point, the distributed file system HDFS service has been started, and the HDFS file system can be read and written. Now check the NameNode service status page through http://192.168.1.31:9870 again, as shown in the figure

As can be seen from the figure, the security mode in the HDFS cluster has been turned off, and the cluster capacity and active nodes already have data. This is because the datanode service has been started normally

(4) Start the ResourceManager service

Next, you also need to start the distributed computing service. The first to start is the ResourceManager. The startup method is as follows

[hadoop@namenodemaster ~]$ yarn --daemon start resourcemanager
[hadoop@namenodemaster ~]$ jps|grep ResourceManager
4838 ResourceManager
[hadoop@namenodemaster ~]$ 

Note that the command to start the resourcemanager service has become yarn instead of hdfs, remember this detail.

Similarly, if the ResourceManager process cannot be started, you can check the /opt/hadoop/current/logs/hadoop-hadoop-resourcemanager-namenodemaster.log log file to troubleshoot the ResourceManager startup problem.

After the ResourceManager service is started, an http port 8088 will be started by default. You can view the ResourceManager web status page by visiting http://192.168.1.31:8088, as shown in the figure below

In the above figure, what needs to be focused on is the memory resources, the number of CPU resources and the number of active nodes available in ResourceManager. At present, these data are all 0 because the NodeManager service has not been started yet

(5) Start the NodeManager service

After starting the ResourceManager service, you can start the NodeManager service. The operation process is as follows:

[hadoop@namenodemaster ~]$  yarn --daemon start nodemanager
[hadoop@namenodemaster ~]$ jps|grep NodeManager
5160 NodeManager
[hadoop@namenodemaster ~]$ 

If there is an abnormality, you can check the NodeManager problem by checking the /opt/hadoop/current/logs/hadoop-hadoop-nodemanager-namenodemaster.log file.

(6) Start Jobhistoryserver service

After waiting for the ResourceManager and NodeManager services to start, finally a Jobhistoryserver service needs to be started. The operation process is as follows

[hadoop@namenodemaster ~]$ mapred  --daemon start historyserver
[hadoop@namenodemaster ~]$ jps|grep JobHistoryServer
5336 JobHistoryServer
[hadoop@namenodemaster ~]$ 

Note that the command to start the Jobhistoryserver service has become mapred instead of yarn. This is because the Jobhistoryserver service is based on MapReduce. After the Jobhistoryserver service is started, it will run an http port. The default port number is 19888. You can access this port to view the historical operation of each task, as shown in the following figure

At this point, the Hadoop pseudo-distribution has been running, and the startup information of each process can be viewed through the jps command

[hadoop@namenodemaster ~]$ jps
4514 DataNode
3956 NameNode
5493 Jps
4838 ResourceManager
5160 NodeManager
5336 JobHistoryServer
4652 SecondaryNameNode
[hadoop@namenodemaster ~]$ 

If nothing unexpected happens, the process name information of each service will be output. These outputs indicate that the Hadoop services have been started normally.

Distributed storage using Hadoop HDFS commands

Hadoop's HDFS is a distributed file system. To operate HDFS, you need to execute HDFS Shell, which is very similar to Linux commands. Therefore, as long as you are familiar with Linux commands, you can quickly master the operation of HDFS Shell.

Looking at a few examples below, you can quickly know the usage of HDFS Shell. Note that it is recommended to execute HDFS Shell under Hadoop users or other general users.

(1) To view the hdfs root directory data, you can use the following command:

[hadoop@namenodemaster ~]$  hadoop fs -ls /
Found 1 items
drwxrwx---   - hadoop supergroup          0 2021-01-31 11:50 /tmp

 From the output of this command, you can see that the HDFS file system just created does not have any data, but you can create files or directories yourself.

(2) Create a logs directory in the hdfs root directory and execute the following commands

[hadoop@namenodemaster ~]$ hadoop fs -mkdir /logs

(3) Upload a file from the local to the /logs directory of hdfs, execute the following command:

[hadoop@namenodemaster ~]$ hadoop fs -put /data/test.txt /logs
[hadoop@namenodemaster ~]$ hadoop fs -put /data/db.gz  /logs
2021-01-31 12:01:23,014 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
[hadoop@namenodemaster ~]$ hadoop fs -ls /logs
Found 2 items
-rw-r--r--   3 hadoop supergroup      10240 2021-01-31 12:01 /logs/db.gz
-rw-r--r--   3 hadoop supergroup          4 2021-01-31 11:59 /logs/test.txt
[hadoop@namenodemaster ~]$ 

Note that /data/test.txt and db.gz here are local files under the operating system. By executing the put command, you can see that the files have been transferred from the local disk to HDFS.

(4) To view the content of a text file in hdfs, execute the following command:

[hadoop@namenodemaster ~]$ hadoop fs -cat /logs/test.txt
2021-01-31 12:02:55,297 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
123
[hadoop@namenodemaster ~]$ hadoop fs -text /logs/db.gz

As you can see, compressed files on HDFS can also be viewed directly through the "-text" parameter, because by default Hadoop will automatically recognize common compression formats.

(5) To delete the last file of hdfs, execute the following command:

[hadoop@namenodemaster ~]$ hadoop fs  -rm  -r /logs/test.txt
Deleted /logs/test.txt
[hadoop@namenodemaster ~]$ hadoop fs -cat /logs/test.txt
cat: `/logs/test.txt': No such file or directory
[hadoop@namenodemaster ~]$ 

Note that files on HDFS can only be created and deleted. An existing file cannot be updated. If you want to update a file on HDFS, you need to delete this file first, and then submit the latest file.

Run MapreDuce program in Hadoop

To experience the distributed computing function of Hadoop, here is a demo program of mapreduce included in the Hadoop installation package to do a simple MR calculation.

This demo program is located in the $HADOOP_HOME/share/hadoop/mapreduce path. The path in this environment is /opt/hadoop/current/share/hadoop/mapreduce. In this directory, find a named hadoop-mapreduce-examples-3.2. 1.jar jar file, with this file the following operations are much simpler

Word count is one of the simplest and most able to embody the idea of ​​MapReduce. It can be called the "Hello World" of the MapReduce version. The hadoop-mapreduce-examples-3.2.1.jar file contains a wordcount function. Its main function is to use To count the number of occurrences of each word in a series of text files. Now start to perform analysis calculations.

(1) Create a new file

[hadoop@namenodemaster /opt]$ vim demo.txt
[hadoop@namenodemaster /opt]$ cat !$
cat demo.txt
Linux Unix windows
hadoop Linux spark
hive hadoop Unix
MapReduce hadoop  Linux hive
windows hadoop spark

[hadoop@namenodemaster ~]$ 

(2) Save the created file into HDFS

[hadoop@namenodemaster ~]$ hadoop fs -mkdir /demo
[hadoop@namenodemaster ~]$ hadoop fs -put /opt/demo.txt /demo
2021-01-31 12:06:54,615 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
[hadoop@namenodemaster ~]$ hadoop fs -ls /demo
Found 1 items
-rw-r--r--   3 hadoop supergroup        106 2021-01-31 12:06 /demo/demo.txt
[hadoop@namenodemaster ~]$ 

Here, a directory /demo is created on HDFS, and then the local file created just now is put on HDFS. The example here is a file. If you want to count the contents of multiple files, upload multiple files to the /demo directory of HDFS That's it.

(3) Perform analysis and calculation tasks

Now start the analysis task:

[hadoop@namenodemaster ~]$ hadoop jar /opt/hadoop/current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar  wordcount /demo  /output
2021-01-31 12:08:32,922 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-01-31 12:08:32,965 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-01-31 12:08:32,965 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-01-31 12:08:33,255 INFO input.FileInputFormat: Total input files to process : 1
2021-01-31 12:08:33,272 INFO mapreduce.JobSubmitter: number of splits:1
2021-01-31 12:08:33,346 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1276790053_0001
2021-01-31 12:08:33,346 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-01-31 12:08:33,416 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2021-01-31 12:08:33,416 INFO mapreduce.Job: Running job: job_local1276790053_0001
2021-01-31 12:08:33,417 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2021-01-31 12:08:33,422 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2021-01-31 12:08:33,422 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2021-01-31 12:08:33,423 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2021-01-31 12:08:33,453 INFO mapred.LocalJobRunner: Waiting for map tasks
2021-01-31 12:08:33,453 INFO mapred.LocalJobRunner: Starting task: attempt_local1276790053_0001_m_000000_0
2021-01-31 12:08:33,470 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2021-01-31 12:08:33,470 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2021-01-31 12:08:33,575 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2021-01-31 12:08:33,600 INFO mapred.MapTask: Processing split: hdfs://namenodemaster/demo/demo.txt:0+106
2021-01-31 12:08:34,799 INFO mapreduce.Job: Job job_local1276790053_0001 running in uber mode : false
2021-01-31 12:08:34,799 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2021-01-31 12:08:34,799 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2021-01-31 12:08:34,799 INFO mapred.MapTask: soft limit at 83886080
2021-01-31 12:08:34,799 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2021-01-31 12:08:34,799 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2021-01-31 12:08:34,800 INFO mapreduce.Job:  map 0% reduce 0%
2021-01-31 12:08:34,802 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2021-01-31 12:08:34,866 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-31 12:08:35,091 INFO mapred.LocalJobRunner: 
2021-01-31 12:08:35,093 INFO mapred.MapTask: Starting flush of map output
2021-01-31 12:08:35,093 INFO mapred.MapTask: Spilling map output
2021-01-31 12:08:35,093 INFO mapred.MapTask: bufstart = 0; bufend = 168; bufvoid = 104857600
2021-01-31 12:08:35,093 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214336(104857344); length = 61/6553600
2021-01-31 12:08:35,159 INFO mapred.MapTask: Finished spill 0
2021-01-31 12:08:35,202 INFO mapred.Task: Task:attempt_local1276790053_0001_m_000000_0 is done. And is in the process of committing
2021-01-31 12:08:35,207 INFO mapred.LocalJobRunner: map
2021-01-31 12:08:35,207 INFO mapred.Task: Task 'attempt_local1276790053_0001_m_000000_0' done.
2021-01-31 12:08:35,213 INFO mapred.Task: Final Counters for attempt_local1276790053_0001_m_000000_0: Counters: 24
	File System Counters
		FILE: Number of bytes read=316693
		FILE: Number of bytes written=841987
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=106
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=5
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=1
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Map input records=6
		Map output records=16
		Map output bytes=168
		Map output materialized bytes=95
		Input split bytes=100
		Combine input records=16
		Combine output records=7
		Spilled Records=7
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=0
		Total committed heap usage (bytes)=303562752
	File Input Format Counters 
		Bytes Read=106
2021-01-31 12:08:35,213 INFO mapred.LocalJobRunner: Finishing task: attempt_local1276790053_0001_m_000000_0
2021-01-31 12:08:35,215 INFO mapred.LocalJobRunner: map task executor complete.
2021-01-31 12:08:35,218 INFO mapred.LocalJobRunner: Waiting for reduce tasks
2021-01-31 12:08:35,219 INFO mapred.LocalJobRunner: Starting task: attempt_local1276790053_0001_r_000000_0
2021-01-31 12:08:35,253 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2021-01-31 12:08:35,253 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2021-01-31 12:08:35,253 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2021-01-31 12:08:35,277 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@5a9ca056
2021-01-31 12:08:35,278 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-01-31 12:08:35,364 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=1299919616, maxSingleShuffleLimit=324979904, mergeThreshold=857947008, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2021-01-31 12:08:35,366 INFO reduce.EventFetcher: attempt_local1276790053_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2021-01-31 12:08:35,441 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1276790053_0001_m_000000_0 decomp: 91 len: 95 to MEMORY
2021-01-31 12:08:35,444 INFO reduce.InMemoryMapOutput: Read 91 bytes from map-output for attempt_local1276790053_0001_m_000000_0
2021-01-31 12:08:35,446 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 91, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->91
2021-01-31 12:08:35,448 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2021-01-31 12:08:35,464 INFO mapred.LocalJobRunner: 1 / 1 copied.
2021-01-31 12:08:35,464 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2021-01-31 12:08:35,464 WARN io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
	at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
	at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:271)
	at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:148)
	at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:209)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2021-01-31 12:08:35,535 INFO mapred.Merger: Merging 1 sorted segments
2021-01-31 12:08:35,535 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 83 bytes
2021-01-31 12:08:35,536 INFO reduce.MergeManagerImpl: Merged 1 segments, 91 bytes to disk to satisfy reduce memory limit
2021-01-31 12:08:35,536 INFO reduce.MergeManagerImpl: Merging 1 files, 95 bytes from disk
2021-01-31 12:08:35,536 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2021-01-31 12:08:35,536 INFO mapred.Merger: Merging 1 sorted segments
2021-01-31 12:08:35,537 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 83 bytes
2021-01-31 12:08:35,537 INFO mapred.LocalJobRunner: 1 / 1 copied.
2021-01-31 12:08:35,587 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2021-01-31 12:08:35,886 INFO mapreduce.Job:  map 100% reduce 0%
2021-01-31 12:08:35,894 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-31 12:08:36,319 INFO mapred.Task: Task:attempt_local1276790053_0001_r_000000_0 is done. And is in the process of committing
2021-01-31 12:08:36,321 INFO mapred.LocalJobRunner: 1 / 1 copied.
2021-01-31 12:08:36,321 INFO mapred.Task: Task attempt_local1276790053_0001_r_000000_0 is allowed to commit now
2021-01-31 12:08:36,337 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1276790053_0001_r_000000_0' to hdfs://namenodemaster/output
2021-01-31 12:08:36,338 INFO mapred.LocalJobRunner: reduce > reduce
2021-01-31 12:08:36,338 INFO mapred.Task: Task 'attempt_local1276790053_0001_r_000000_0' done.
2021-01-31 12:08:36,338 INFO mapred.Task: Final Counters for attempt_local1276790053_0001_r_000000_0: Counters: 30
	File System Counters
		FILE: Number of bytes read=316915
		FILE: Number of bytes written=842082
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=106
		HDFS: Number of bytes written=61
		HDFS: Number of read operations=10
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Combine input records=0
		Combine output records=0
		Reduce input groups=7
		Reduce shuffle bytes=95
		Reduce input records=7
		Reduce output records=7
		Spilled Records=7
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=291
		Total committed heap usage (bytes)=303562752
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Output Format Counters 
		Bytes Written=61
2021-01-31 12:08:36,339 INFO mapred.LocalJobRunner: Finishing task: attempt_local1276790053_0001_r_000000_0
2021-01-31 12:08:36,339 INFO mapred.LocalJobRunner: reduce task executor complete.
2021-01-31 12:08:36,887 INFO mapreduce.Job:  map 100% reduce 100%
2021-01-31 12:08:36,887 INFO mapreduce.Job: Job job_local1276790053_0001 completed successfully
2021-01-31 12:08:36,897 INFO mapreduce.Job: Counters: 36
	File System Counters
		FILE: Number of bytes read=633608
		FILE: Number of bytes written=1684069
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=212
		HDFS: Number of bytes written=61
		HDFS: Number of read operations=15
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Map input records=6
		Map output records=16
		Map output bytes=168
		Map output materialized bytes=95
		Input split bytes=100
		Combine input records=16
		Combine output records=7
		Reduce input groups=7
		Reduce shuffle bytes=95
		Reduce input records=7
		Reduce output records=7
		Spilled Records=14
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=291
		Total committed heap usage (bytes)=607125504
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=106
	File Output Format Counters 
		Bytes Written=61
[hadoop@namenodemaster ~]$ hadoop fs -ls /output
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2021-01-31 12:08 /output/_SUCCESS
-rw-r--r--   3 hadoop supergroup         61 2021-01-31 12:08 /output/part-r-00000
[hadoop@namenodemaster ~]$ hadoop fs -text /output/part-r-00000
2021-01-31 12:09:31,329 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
Linux	3
MapReduce	1
Unix	2
hadoop	4
hive	2
spark	2
windows	2
[hadoop@namenodemaster ~]$ 

In the above operation, the execution of the task can be completed by executing "hadoop jar" followed by the jar package sample file, and giving the executed function is wordcount. Please note that the last two paths are paths on HDFS , The first path is the directory where the analysis reads the file and must exist; the second path is the storage path of the output results of the analysis task, must not exist, the analysis task will automatically create this directory

After the task is executed, you can view two files in the /output directory, among them:

  • _SUCCESS, the task completion flag, which means the execution is successful;

  • part-r-00000, indicates the output file name, common names are part-m-00000, part-r-00001, among them, the file with the m mark is the mapper output, the file with the r mark is the reduce output, and 00000 is job Job number, part-r-00000 The whole file is the result output file.

By viewing the content of the part-r-00000 file, you can see the statistical results of wordcount. The left column is the counted words, and the right column is the number of times the word appears in the file.

(4) Display running tasks on the ResourceManager web page

Careful you may have discovered that although the wordcount statistics task executed on the command line above shows that it was executed successfully and the statistics result is normal, it is not displayed on the ResourceManager web page.

The reason is actually very simple: this is because the mapreduce task is not actually submitted to yarn, because the default mapreduce operating environment is local (local), to make mapreduce run on yarn, you need to do a few parameter configurations. .

There are two configuration files that need to be modified, namely mapred-site.xml and yarn-site.xml. Find them in your configuration file directory.

Open the mapred-site.xml file and add the following content in the tag:

[hadoop@namenodemaster conf]$ pwd
/etc/hadoop/conf
[hadoop@namenodemaster conf]$ vim mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>
<property>
   <name>yarn.app.mapreduce.am.env</name>
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
   <name>mapreduce.map.env</name>
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
   <name>mapreduce.reduce.env</name>
   <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
[hadoop@namenodemaster conf]$ 

Among them, the mapreduce.framework.name option is used to specify the runtime environment of mapreduce, which can be specified as yarn. The following three options are to specify some environment information of mapreduce runtime.

Finally, modify another file yarn-site.xml and add the following content to the tag

[hadoop@namenodemaster conf]$ vim yarn-site.xml 
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
  <name>yarn.namenodemaster.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
</configuration>
[hadoop@namenodemaster conf]$ 

Among them, the yarn.nodemanager.aux-services option represents the extended services that can be run on the NodeManager, and it needs to be configured as mapreduce_shuffle to run the MapReduce program.

After the configuration modification is completed, the ResourceManager and nodemanager services need to be restarted to make the configuration effective.

Now, we run the wordcount statistics of the mapreduce just now again, and all the tasks executed will be displayed on the ResourceManager web page, as shown in the following figure:

It can be clearly seen from the figure that the ID name of the task, the user performing the task, the program name, the task type, the queue, the priority, the start time, the completion time, and the final status, etc. can be clearly seen. From the perspective of operation and maintenance, this page has a lot of information that needs attention, such as whether the final status of the task has failed. If so, you can click the History link under the second-to-last column "Tracking UI" to view the log for troubleshooting.

Namenode web pages and ResourceManager web pages are often used in big data operation and maintenance work. These web interfaces are mainly used for status monitoring, troubleshooting, and more details and skills.

Guess you like

Origin blog.csdn.net/yanghuadong_1992/article/details/113487548
Recommended