Ubuntu 16.04——Hadoop cluster installation and configuration

Hadoop cluster installation and configuration

Hadoop cluster installation configuration is divided into two parts, one part isMaster node (Master)andFrom the node (Slave), the configurations that need to be completed in the two parts are somewhat different; in general, the Master needs to do more than the Slave. The following will demonstrate what needs to be done in the two parts; since the situation of each host will be different, there will be some differences when reporting errors, so this article is for reference only.

environment

Host version: Windows11

Virtual machine version: ubuntukylin-16.04-desktop-amd64

VMware version: VMware® Workstation 17 Pro

NIC: bridge mode

jdk version: jdk-8u162-linux-x64

hadoop version: hadoop-3.1.3

Note: The hardware version used is compatible with VMware 12.X.

Before reading the following, the default is that the IP address has been configured to be able to access the Internet, because the network adapters used by everyone may be different, so there may be some situations that are not friendly to NAT network users, remind you again, this article is only for refer to.

node configuration

We need to configure both the Master node and the Slave node, because many things can be done directly by the Master, so we first let the Master do everything that can be done first, and then try to make it easier for the Slave. The central idea is that after a Master is configured, it directly uses the transfer command to package and send the configured software to the Slave, so relatively speaking, the Master will do more, but it will also save resources.

Configuration prerequisites

Considering that many people may be doing pseudo-distribution, there will be more or less problems, so here are some simple premises for everyone.

modify hostname

Before modifying the host name, we need to check the content of another file, because this will have a certain impact on our modification.

First of all, we need to open /etc/hosts to view the content. Generally, there can only be one 127.0.0.1 in the hosts file, and its corresponding host name is localhost. If there are redundant 127.0.0.1 mappings, they should be deleted, especially "127.0.0 . 1.1 [hostname] " mapping record. The Linux system needs to be restarted after modification.

sudo vim /etc/hosts

remove localmap

After opening, we can see that there will be a place surrounded by a red frame. If there is no place, you can ignore it. If there is, please comment it out or delete it directly. What I choose here is to comment it out. After commenting, the effect is as shown in the figure below.

remove localmap 2

Since you need to comment it out, it must be because it has some other functions. The main function is to avoid errors or other warnings when we use commands. For example, as shown in the figure below, the error of the host cannot be resolved, because we modify After the host name, the mapping relationship between them has changed, but the previous mapping relationship has not been overwritten or deleted, so the system cannot map the modified host name, and eventually such an error occurs.

Keeping host mappings causing errors

sudo vim /etc/hostname

After executing the above command, the file "/etc/hostname" is opened, and the host name is recorded in this file. For example, when installing the Ubuntu system, the host name set is "hadoop01". Therefore, after opening this file, inside Only the line "hadoop01" can be deleted directly and changed to "Master" (note that it is case-sensitive), then save and exit the vim editor, thus completing the modification of the host name, and the Linux system needs to be restarted See the hostname change.

Master

modify host name

Pay attention to observe the changes before and after the modification of the host name. Before modifying the host name, if you use hadoop to log in to the Linux system, open the terminal, and enter the Shell command prompt state, the following content will be displayed:

hadoop@hadoop01:~$

After modifying the host name and restarting the system, use hadoop to log in to the Linux system, open the terminal, and enter the Shell command prompt state, the following content will be displayed:

hadoop@Master:~$

It can be seen that at this time, it is easy to recognize that the current operation is performed on the Master node, and there will be no confusion with the Slave node.

changed host name

Then, execute the following command on the Master node to open and modify the "/etc/hosts" file on the Master node:

sudo vim /etc/hosts

After opening the file, we add content at the end of the file. Suppose we are using three hosts to connect, then there are two slave nodes. The format is as follows:

[Master的IP] Master
[Slave1的IP] Slave1
[Slave2的IP] Slave2

After the configuration is complete, the effect is similar to the figure below, but because I have four hosts, I have three Slave nodes. As for the line Master is commented, it is because a bridged network is needed, and the virtual machine is between the computer in the computer room and my In the notebook, there will be two different ones, but because of the comments, only the first one will take effect.

Modify IP mapping

After modification, please restart the Linux system of each node. This completes the configuration of the Master node and the Slave node. Then, you need to execute the following command on each node to test whether they can ping each other. If the ping fails, the subsequent configuration will not be successful:

ping Master -c 3   # 只ping 3次就会停止,否则要按Ctrl+c中断ping命令
ping Slave1 -c 3

Here are only two demonstrations. Please modify them according to your actual situation (how many hosts are there). You need to be able to ping each other to be able to connect. Regarding the secret-free login between each host, I won’t go into details here. , you can log in without password by default.

Master communicates with Slave1

Master communicates with Slave2

Master communicates with Slave3

Master Configuration

On the Master, you need to configure environment variables, and then configure Hadoop. There are six configuration files in total, namely , workers, core-site.xml, hdfs-site.xml, mapred-site.xml. yarn-site.xmlThe specific explanations will be explained below. In my humble opinion, if there is something wrong, you can comment district pointed out.

Configure the PATH variable

If the PATH variable has not been configured, it needs to be configured on the Master node. First execute the command "vim ~/.bashrc", that is, use the vim editor to open the "bashrc" file

sudo vim ~/.bashrc

Then, add the following line at the top of the file:

export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

After configuring the environment variables of jdk and hadoop, there will be such a few lines in the ~/.bashrc file, and the path will vary with the installation path.

Modify environment variables

After saving, execute the command "source ~/.bashrc" to make the configuration take effect.

source ~/.bashrc

Configure cluster/distributed environment

When configuring the cluster/distributed mode, you need to modify the configuration file under the "/usr/local/hadoop/etc/hadoop" directory, here only set the necessary settings for normal startup, including workers, core-site.xml, hdfs There are 5 files -site.xml, mapred-site.xml, yarn-site.xml, and more setting items can be found in the official description. Next, our configuration files will be under the Hadoop installation directory by default.

cd /usr/local/hadoop/etc/hadoop # 进入文件所在目录文件夹
Modify the configuration file
(1) Modify the workers file

The host names of all data nodes need to be written into the file, one per line, and the default is localhost (that is, the local machine is used as a data node). Therefore, in the pseudo-distributed configuration, this default configuration is adopted, so that the node Serves as both a name node and a data node. When performing distributed configuration, you can keep localhost and let the Master node act as a name node and a data node at the same time, or you can delete the line of localhost so that the Master node can only be used as a name node.

vim workers 

The Master node is only used as a name node, so delete the original localhost in the workers file and add the following content:

Slave1
Slave2
Slave3

Because I have four hosts connected together, and then I choose the Master as the Namenode, and the rest of the Slave nodes as the Datanode, so I need to fill in the host names of the three nodes in the file. The specific effect is as follows:

Modify the Workers file

(2) Modify the configuration file core-site.xml

Hadoop's configuration file is in xml format, and each configuration is implemented in the form of declarative <property>and <name>> <value. geditIt will be more convenient to edit

gedit ./etc/hadoop/core-site.xml

Please modify the core-site.xml file to the following content:

  1. fs.defaultFS , the format of the value is: file:///文件:///, which is the name of the default file system. A URI whose scheme and permissions determine the filesystem implementation. The uri's scheme determines the configuration property (fs.SCHEME.impl) naming the FileSystem implementation class. The authority of the uri is used to determine the host, port, etc. of the file system.
  2. hadoop.tmp.dir , the format of the value is: /tmp/hadoop-${user.name}, is the base directory of other temporary directories.
<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://Master:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>file:///usr/local/hadoop/tmp</value>
                <description>A base for other temporary directories.</description>
        </property>
</configuration>

before fixing:

before fixing
After modification:

after modification

The purpose of posting before and after revisions is to allow everyone to compare clearly the places that need to be modified, so as to prevent everyone from changing things that should not be changed.

(3) Modify the file

For Hadoop's distributed file system HDFS, redundant storage is generally used, and the redundancy factor is usually 3, that is, one data saves three copies. If only one Slave node is used as the data node, that is, there is only one data node in the cluster, and only one copy of the data can be saved, so the dfs.replicationvalue of is still set to 1. Therefore, it needs to be determined according to the number of nodes when modifying.

vim hdfs-site.xml

The specific content of the modified hdfs-site.xml is as follows:

  1. dfs.namenode.secondary.http-address , the format of the value is: 0.0.0.0:9868, the secondary namenode http server address and port.
  2. dfs.replication , because I have three Datanodes, so here my value is set to 3the default data block replication. The actual number of copies can be specified when creating the file. If replication is not specified at creation time, the default is used.
  3. dfs.namenode.name.dir , with values ​​in the format: file://${hadoop.tmp.dir}/dfs/name, determines where on the local filesystem the DFS namenode should store the name table (fsimage). If this is a comma-separated list of directories, the name table will be copied to all directories, for redundancy.
  4. dfs.datanode.data.dir , with values ​​in the format: file://${hadoop.tmp.dir}/dfs/data, determines where on the local filesystem a DFS datanode should store its blocks. If this is a comma-separated list of directories, the data will be stored in all named directories, usually on different devices. The directory should be marked with the appropriate storage type ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) storage policy for HDFS. If the directory is a disk, the default storage type is DISK No storage type explicitly marked. Directories that do not exist will be created if the local file system permissions allow it.
<configuration>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>Master:50090</value>
        </property>
        <property>
                <name>dfs.replication</name>
                <value>3</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:///usr/local/hadoop/tmp/dfs/name</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:///usr/local/hadoop/tmp/dfs/data</value>
        </property>
</configuration>

before fixing:

before fixing

After modification:

after modification

(4) Modify the file mapred-site.xml

/usr/local/hadoop/etc/hadoopThere is one in the directory mapred-site.xml.template. You need to modify the file name and rename it to mapred-site.xml. Of course, it may not be there. I don’t have it here.

vim mapred-site.xml

Configure mapred-site.xmlthe file as follows:

  1. mapreduce.framework.name , whose value is adjusted according to the specific situation, is used to execute the runtime framework of the MapReduce job. Value can be one of local, classicor yarn.
  2. mapreduce.jobhistory.address , the format of the value is: 0.0.0.0:10020, **MapReduce JobHistory Server IPC (Inter-Process Communication)** The host and port are detailed information that can be accessed on the network MapReduce JobHistory Server. The IPC host is usually JobHistory Serverthe IP address or hostname of the computer running . A port number is a unique identifier used by the network to direct traffic to the correct process. These values ​​are required for clients such as Hadoop Web UI to access JobHistory Serverand view detailed information about completed MapReduce jobs. By default, JobHistory Serverport 10020 is used for IPC communication. If the JobHistory server is running on a remote machine, you will need to specify the IPC host and port when configuring the client or web UI to communicate with it. For example, in the Hadoop Web UI, the host and port can be specified by mapreduce.jobhistory.address setting the configuration property to . You can also specify these values ​​when using command-line Hadoop utilities, eg . Note that the address and port configuration can be found in files in the Hadoop configuration directory . If you need to change these values, you will need to update the configuration file and reboot for the changes to take effect.JobHistory Server IPC<hostname>:<port>mapred job -history <hostname>:<port>JobHistoryServerIPCmapred-site.xmlJobHistory Server
  3. mapreduce.jobhistory.webapp.address , the format of the attribute value is <hostname>:<port>. hostnameRefers to the hostname or IP address of the machine running JobHistoryServer, and portis the port number on which the JobHistoryServer web application listens for incoming requests. For example, if JobHistoryServer “historyserver.example.com”is running on the computer and listening on port 19888, you can mapreduce.jobhistory.webapp.addressset the property “historyserver.example.com:19888”to access the web application from a web browser on another computer. It is important to note that the JobHistoryServer web application requires several other Hadoop services to be running in order to function properly. These services include ResourceManager, HDFS NameNode, and HDFS DataNodes. If you're having trouble accessing a web application, it's worth checking that all of these services are up and running, and that the correct configuration options are set.
  4. yarn.app.mapreduce.am.env , is a list of environment variables for MapReduce jobs running under the YARN master application manager environment. This variable contains some configuration parameters, such as Hadoop classpath, JAVA_HOME, etc., which can be used to set environment variables for easy use in the MapReduce task running environment. This parameter is very useful when configuring the YARN MapReduce environment.
  5. mapreduce.map.env , which is an environment variable used to specify other environment variables that need to be set in the Map task of the MapReduce job. When the Map task is running, the system will automatically set this environment variable and set other necessary environment variables according to its value. The value of this environment variable is a comma-separated list of key-value pairs, each key-value pair represents an environment variable to be set and its value. For example: export mapreduce_map_env="JAVA_HOME=/usr/local/java,PATH=$PATH:/usr/local/bin"This example specifies two environment variables, JAVA_HOME and PATH. Among them, JAVA_HOME is set to /usr/local/java, and PATH is set to the original value plus: /usr/local/bin. When the Map task is running, the values ​​of these environment variables will be set to the specified values, which is convenient for the application.
    1. mapreduce.reduce.env , is a configuration setting in Hadoop MapReduce that specifies a list of environment variables to pass to reduce tasks. Reduce tasks can use these variables to customize their behavior or access specific resources. For example, reduce tasks may require access to a database or file system that is not available on every node in the cluster. Reduce tasks can easily access these resources by setting appropriate environment variables in mapreduce.reduce.env. The syntax of mapreduce.reduce.env is as follows: mapreduce.reduce.env=<var1>=<value1>,<var2>=<value2>,... <var1>,<var2>,...are variable names, <value1>,<value2>...and are their corresponding values. Multiple variables can be specified, separating them with commas. The recommended way to set environment variables for Hadoop jobs is to use the -D option of the hadoop command. For example: hadoop jar myjob.jar -D mapreduce.reduce.env="VAR1=value1,VAR2=value2"...this will set the environment variables VAR 1 and VAR 2 to value 1 and value 2 respectively for all reduce tasks in the job.
<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>Master:10020</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>Master:19888</value>
        </property>
        <property>
                <name>yarn.app.mapreduce.am.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property>
        <property>
                <name>mapreduce.map.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property>
        <property>
                <name>mapreduce.reduce.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property> 
</configuration>

before fixing:

before fixing

After modification:

after modification

(5) Modify the file yarn-site.xml

yarn-site.xml is the configuration file for Apache Hadoop YARN, which stands for Yet Another Resource Negotiator. YARN is a distributed framework for parallel processing of large amounts of data across clusters of computers, consisting of a set of properties that define the behavior of YARN applications.

vim yarn-site.xml

Please configure the yarn-site.xml file as follows:

  1. yarn.resourcemanager.hostname , with a value of the format: 0.0.0.0, is a configuration property used on Apache Hadoop YARN clusters to define the hostname or IP address of the YARN resource manager. This property is configured in the yarn-site.xml file. The resource manager is a YARN service whose main responsibility is to manage resources in the cluster, including memory, CPU, disk, etc., and allocate resources to running applications. When applications run in a YARN cluster, they communicate with the Resource Manager to request and receive resource allocations through the hostname or IP address specified by yarn.resourcemanager.hostname. If there is only one resource manager in the cluster, setting the yarn.resourcemanager.hostname property on the YARN client is generally not necessary, since the client will automatically find the resource manager on the host it is on. But if there are multiple resource managers in the cluster, this property needs to be set explicitly on the client to determine which resource manager it will connect to.
  2. yarn.nodemanager.aux-services is an important configuration parameter of NodeManager in YARN (Yet Another Resource Negotiator), which is used to specify the auxiliary services started by NodeManager. Auxiliary services refer to services that run outside NodeManager and provide support for MapReduce applications, such as Hive, HBase, etc. Typically, auxiliary service configuration files are located $HADOOP_HOME/etc/hadoop/in the yarn-site.xml file in the directory.
<configuration>
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>Master</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
</configuration>

before fixing:

before fixing

After modification: one line is a comment, so you can actually ignore it.

after modification

Transfer files to the Slave node

After all the above five files are configured, you need to “/usr/local/hadoop”copy the folders on the Master node to each node. If you have run the pseudo-distributed mode before, it is recommended to delete the temporary files generated in the pseudo-distributed mode before switching to the cluster mode.

Because we want to package the entire Hadoop directory and then send it to other nodes, we need to leave the Hadoop installation directory first, so first use the directory to reach the upper level of the Hadoop installation directory, then cd /usr/localdelete the temporary files, and then pack and compress and send to the Slave node. Note: You do not need to enter a password after using scpthe command at this time. If you need to enter a password, it means that the ssh password-free configuration is unsuccessful and needs to be reconfigured.

cd /usr/local
sudo rm -r ./hadoop/tmp     # 删除 Hadoop 临时文件
sudo rm -r ./hadoop/logs/*   # 删除日志文件
tar -zcf ~/hadoop.master.tar.gz ./hadoop   # 先压缩再复制

Because I have not formatted the Namenode, and have not started the DFS service, there are no tmpand logsdirectory files.

Delete the tmp directory and the logs directory

cd ~
scp ./hadoop.master.tar.gz Slave1:/home/hadoop # 将压缩包发送到 Slave1 节点的 home 目录上
scp ./hadoop.master.tar.gz Slave2:/home/hadoop # 将压缩包发送到 Slave2 节点的 home 目录上
scp ./hadoop.master.tar.gz Slave3:/home/hadoop # 将压缩包发送到 Slave3 节点的 home 目录上

Slave configuration

modify hostname

Before modifying the host name, we need to check the content of another file, because this will have a certain impact on our modification.

Modify the /etc/hosts file

First of all, we need to open this file to view the content. Generally, there can only be one 127.0.0.1 in the hosts file, and its corresponding host name is localhost. If there are redundant 127.0.0.1 mappings, they should be deleted, especially "127.0.1.1 [ hostname] " mapping record. The Linux system needs to be restarted after modification.

sudo vim /etc/hosts

After opening, we can see that there will be a place surrounded by a red frame. If there is no place, you can ignore it. If there is, please comment it out or delete it directly. What I choose here is to comment it out. After commenting, the effect is as shown in the figure below.

Modify the mapping relationship of the Slave node

Then modify /etc/hostnamethe file

sudo vim /etc/hostname

After executing the above command, the file "/etc/hostname" is opened, and the host name is recorded in this file. For example, when installing the Ubuntu system, the host name set is "hadoop01". Therefore, after opening this file, inside Only the line "hadoop01" can be deleted directly and changed to "Master" (note that it is case-sensitive), then save and exit the vim editor, thus completing the modification of the host name, and the Linux system needs to be restarted See the hostname change.

Slave1

modify host name

Pay attention to observe the changes before and after the modification of the host name. Before modifying the host name, if you use hadoop to log in to the Linux system, open the terminal, and enter the Shell command prompt state, the following content will be displayed:

hadoop@hadoop02:~$

After modifying the host name and restarting the system, use hadoop to log in to the Linux system, open the terminal, and enter the Shell command prompt state, the following content will be displayed:

hadoop@Slave1:~$ 

host name after duration

After the modification is complete, please restart the Linux system of the node. This completes the configuration of the Slave node. Then, you need to execute the following command on each node to test whether they can ping each other. If the ping fails, the subsequent configuration will not be successful:

ping Master -c 3   # 只ping 3次就会停止,否则要按Ctrl+c中断ping命令
ping Slave1 -c 3

Here are only two demonstrations. Please modify them according to your actual situation (how many hosts are there). You need to be able to ping each other to be able to connect. Regarding the secret-free login between each host, I won’t go into details here. , you can log in without password by default.

Slave1 communicates with Master
Slave1 communicates with Slave2
Slave1 communicates with Slave3

Configure Hadoop

On the premise that the Master node has been configured and the Hadoop installation package has been sent to the Slave, perform the following configurations. First, delete the old Hadoop installation package. If you have installed a single-machine distribution or a pseudo-distribution before, if not, you can ignore this step. Then it is to decompress the sent package, decompress it to /usr/localthe directory, and finally give the permission to modify the directory.

sudo rm -r /usr/local/hadoop    # 删掉旧的(如果存在)
sudo tar -zxvf ~/hadoop.master.tar.gz -C /usr/local # zxvf中的 V 可以看到解压的过程。
sudo chown -R hadoop /usr/local/hadoop

Similarly, if there are other Slave nodes, hadoop.master.tar.gzthe operation of transferring to the Slave node and decompressing the file at the Slave node is also performed.

Start the Hadoop cluster

When starting the Hadoop cluster for the first time, you need to format the name node on the Master node first (you only need to perform this once, and do not format the name node again when you start Hadoop later), the command is as follows:

hdfs namenode -format

Format HDFS

Now you can start Hadoop. The startup needs to be done on the Master node. Execute the following command:

start-dfs.sh
start-yarn.sh
mapred --daemon start historyserver

You can view the processes started by each node by command jps. NameNodeIf it has been started correctly, you can see the , ResourceManager, SecondrryNameNodeand processes on the Master node JobHistoryServer, as shown in the figure below.

Check the status of the master service

DataNodeYou can see and processes on the Slave node NodeManager, as shown in the figure below:

Check the status of Slave1 service activation

Check the status of Slave2 service activation
Check the status of Slave3 service activation

The absence of either process indicates an error. In addition, you need to check whether the data node is started normally through the command on the Master node hdfs dfsadmin -report. If the screen information Live datanodesis 3 (adjust according to the actual situation), it means that the cluster starts successfully.

hdfs dfsadmin -report

Since I have 3 Slave nodes acting as data nodes, after the data nodes are successfully started, the information shown in the following figure will be displayed:

Data Node Activity

You can also enter the address in the browser of the host Windows system http:// + [Maste节点的IP] +:9870/or enter the address in the browser of the Linux system http://master:9870/, and view the status of the name node and data node through the Web page. If it fails, you can troubleshoot the cause through the startup logs (logs).
Here again, the following points need to be paid attention to when switching between pseudo-distributed mode and distributed mode:
(a) When switching from distributed to pseudo-distributed, don’t forget to modify the slaves configuration file;
(b) When switching between the two, If you encounter a situation that cannot be started normally, you can delete the temporary folder of the involved node, so that although the previous data will be deleted, it can ensure that the cluster starts correctly. Therefore, if the cluster can be started before but cannot be started later, especially the data nodes cannot be started, you may try to delete the /usr/local/hadoop/tmpfolder on all nodes (including slave nodes), execute it again hdfs namenode -format, and start again.

Schematic diagram of active nodes

Execute distributed instance

The process of executing the distributed instance is the same as the pseudo-distributed mode. First, create a user directory on HDFS. The command is as follows:

hdfs dfs -mkdir -p /user/hadoop

Then, create an input directory in HDFS, and copy /usr/local/hadoop/etc/hadoopthe configuration files in the directory as input files to the input directory, the command is as follows:

hdfs dfs -mkdir input
hdfs dfs -put /usr/local/hadoop/etc/hadoop/*.xml input

Copy the configuration file to the input directory

Then you can run the MapReduce job, the command is as follows:

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep input output 'dfs[a-z.]+'

The output information at runtime is similar to pseudo-distributed, and it will display the progress of the MapReduce job, as shown in the following figure:

MapReduce job progress

The execution process may be a bit slow, but if there is no progress for a long time, for example, if you don't see a progress change in 5 minutes, you may wish to restart Hadoop and test again. If restarting still fails, it is likely to be caused by insufficient memory. It is recommended to increase the memory of the virtual machine or change the memory configuration of YARN to solve it.

Working diagram

During the execution process, you can enter in the address bar of the Windows system browser http://[Master的IP地址]:8088/clusteror open the browser in the Linux system, enter in the address bar http://master:8088/cluster, and check the task progress through the web interface. Click the History link in this column on the web interface Tracking UIto see the task The running information, as shown in the figure below:

Job completion diagram

Then we can use the command to view the output:

./bin/hdfs dfs -cat output/*

The output result is shown in the figure below:

Schematic diagram of the output result

Finally, to shut down the Hadoop cluster, you need to execute the following command on the Master node:

stop-yarn.sh
stop-dfs.sh
mapred --daemon stop historyserver

So far, the Hadoop cluster has been built.

Guess you like

Origin blog.csdn.net/m0_68192925/article/details/130150683