Hadoop的伪分布式模式

1 启动HDFS并运行MapReduce程序

a)配置:hadoop-env.sh

Linux系统中获取JDK的安装路径:

[[email protected] hadoop-2.7.2]# echo $JAVA_HOME
/opt/module/jdk1.8.0_144
[root@hadoop001 hadoop]# vim hadoop-env.sh

修改JAVA_HOME 路径:

(b)配置:core-site.xml

[[email protected] hadoop]# cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- 指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
</property>

<!-- 指定Hadoop运行时产生文件的存储目录 -->
<property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-2.7.2/data/tmp</value>
</property>


</configuration>
[root@hadoop001 hadoop]#

(c)配置:hdfs-site.xml

[[email protected] hadoop]# cat hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>
</configuration>
[root@hadoop001 hadoop]#

(2)启动集群

          (a)格式化NameNode(第一次启动时格式化,以后就不要总格式化)

[[email protected] hadoop-2.7.2]# bin/hdfs namenode -format
20/01/14 16:29:01 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = iZbp1efx14jd8471u20gpaZ/172.16.233.215
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.7.2
STARTUP_MSG:   classpath = /opt/module/hadoop-2.7.2/etc/hadoop:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/curator-client-2.7.1.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/zookeeper-3.4.6.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/commons-compress-1.4.1.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/commons-math3-3.1.1.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/netty-3.6.2.Final.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/commons-httpclient-3.1.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/guava-11.0.2.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/log4j-1.2.17.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/stax-api-1.0-2.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/hadoop-auth-2.7.2.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/jersey-json-1.9.jar:/opt/mo

              (b)启动NameNode

[[email protected] hadoop-2.7.2]# sbin/hadoop-daemon.sh start namenode
starting namenode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-root-namenode-hadoop001.out

              (c)启动DataNode

[[email protected] hadoop-2.7.2]# sbin/hadoop-daemon.sh start datanode
starting datanode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-root-datanode-hadoop001.out

(3)查看集群

              (a)查看是否启动成功

[[email protected] hadoop-2.7.2]# jps
23608 NameNode
23769 Jps
23691 DataNode
[root@hadoop001 hadoo

 

注意:jps是JDK中的命令,不是Linux命令。不安装JDK不能使用jps

             (b)web端查看HDFS文件系统

(c)查看产生的Log日志

                说明:在企业中遇到Bug时,经常根据日志提示信息去分析问题、解决Bug。

当前目录:/opt/module/hadoop-2.7.2/logs

[[email protected] hadoop-2.7.2]# cd /opt/module/hadoop-2.7.2/logs
[[email protected] logs]# ll
total 60
-rw-r--r-- 1 root root 23889 Jan 14 16:29 hadoop-root-datanode-hadoop001.log
-rw-r--r-- 1 root root   717 Jan 14 16:29 hadoop-root-datanode-hadoop001.out
-rw-r--r-- 1 root root 27523 Jan 14 16:29 hadoop-root-namenode-hadoop001.log
-rw-r--r-- 1 root root   717 Jan 14 16:29 hadoop-root-namenode-hadoop001.out
-rw-r--r-- 1 root root     0 Jan 14 16:29 SecurityAuth-root.audit

(d)思考:为什么不能一直格式化NameNode,格式化NameNode,要注意什么?

[[email protected] hadoop-2.7.2]# cd data/tmp/dfs/name/current/
[[email protected] current]# ll
total 1040
-rw-r--r-- 1 root root 1048576 Jan 14 16:29 edits_inprogress_0000000000000000001
-rw-r--r-- 1 root root     351 Jan 14 16:29 fsimage_0000000000000000000
-rw-r--r-- 1 root root      62 Jan 14 16:29 fsimage_0000000000000000000.md5
-rw-r--r-- 1 root root       2 Jan 14 16:29 seen_txid
-rw-r--r-- 1 root root     206 Jan 14 16:29 VERSION
[root@hadoop001 current]# cat VERSION
#Tue Jan 14 16:29:02 CST 2020
namespaceID=1486374433
clusterID=CID-4fa9a123-10f6-4bdd-95a6-c6faaa926c85
cTime=0
storageType=NAME_NODE
blockpoolID=BP-770615407-172.16.233.215-1578990542229
layoutVersion=-63
[root@hadoop001 current]# cd ../../data/current/
[[email protected] current]# ll
total 8
drwx------ 4 root root 4096 Jan 14 16:29 BP-770615407-172.16.233.215-1578990542229
-rw-r--r-- 1 root root  229 Jan 14 16:29 VERSION
[root@hadoop001 current]# cat VERSION
#Tue Jan 14 16:29:47 CST 2020
storageID=DS-b07b5055-82c9-4010-a583-dd75fe1e1097
clusterID=CID-4fa9a123-10f6-4bdd-95a6-c6faaa926c85
cTime=0
datanodeUuid=9002013a-cc17-4ae7-886f-3f0ef4d44595
storageType=DATA_NODE
layoutVersion=-56

注意:格式化NameNode,会产生新的集群id,导致NameNode和DataNode的集群id不一致,集群找不到已往数据。所以,格式NameNode时,一定要先删除data数据和log日志,然后再格式化NameNode。

(4)操作集群

              (a)在HDFS文件系统上创建一个input文件夹

[[email protected] hadoop-2.7.2]# bin/hdfs dfs -mkdir -p /user/topcheer/input

              (b)将测试文件内容上传到文件系统上

[[email protected] hadoop-2.7.2]# bin/hdfs dfs -put wcinput/wc.input    /user/topcheer/input/

 

              (c)查看上传的文件是否正确

[[email protected] hadoop-2.7.2]# bin/hdfs dfs -lsr /
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x   - root supergroup          0 2020-01-14 16:45 /user
drwxr-xr-x   - root supergroup          0 2020-01-14 16:45 /user/topcheer
drwxr-xr-x   - root supergroup          0 2020-01-14 16:53 /user/topcheer/input
-rw-r--r--   1 root supergroup         48 2020-01-14 16:53 /user/topcheer/input/wc.input

              (d)运行MapReduce程序

[[email protected] hadoop-2.7.2]# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount  /user/topcheer/input/ /user/topcheer/output/
20/01/14 17:08:35 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
20/01/14 17:08:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
20/01/14 17:08:35 INFO input.FileInputFormat: Total input paths to process : 1
20/01/14 17:08:35 INFO mapreduce.JobSubmitter: number of splits:1
20/01/14 17:08:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local330020084_0001
20/01/14 17:08:36 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
20/01/14 17:08:36 INFO mapreduce.Job: Running job: job_local330020084_0001
20/01/14 17:08:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null

              (e)查看输出结果

命令行查看:

[[email protected] hadoop-2.7.2]# bin/hdfs dfs -cat /user/topcheer/output/*
hadoop  2
mapreduce       1
topcheer        2
yarn    1
[root@hadoop001 hadoop-2.7.2]#

浏览器查看,如图所示

(f)将测试文件内容下载到本地

[[email protected] hadoop-2.7.2]$ hdfs dfs -get /user/topcheer/output/part-r-00000 ./wcoutput/ 

 2 启动YARN并运行MapReduce程序

(1)配置集群

 (a)配置yarn-env.sh

         配置一下JAVA_HOME

        

(b)配置yarn-site.xml

<configuration>

<!-- Reducer获取数据的方式 -->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

<!-- 指定YARN的ResourceManager的地址 -->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop001</value>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<!-- 日志保留时间设置7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

</configuration>

(c)配置:mapred-env.sh

配置一下JAVA_HOME

[[email protected] hadoop]# cat mapred-env.sh
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# export JAVA_HOME=/home/y/libexec/jdk1.6.0/
export JAVA_HOME=/opt/module/jdk1.8.0_144

export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=1000

export HADOOP_MAPRED_ROOT_LOGGER=INFO,RFA

#export HADOOP_JOB_HISTORYSERVER_OPTS=
#export HADOOP_MAPRED_LOG_DIR="" # Where log files are stored.  $HADOOP_MAPRED_HOME/logs by default.
#export HADOOP_JHS_LOGGER=INFO,RFA # Hadoop JobSummary logger.
#export HADOOP_MAPRED_PID_DIR= # The pid files are stored. /tmp by default.
#export HADOOP_MAPRED_IDENT_STRING= #A string representing this instance of hadoop. $USER by default
#export HADOOP_MAPRED_NICENESS= #The scheduling priority for daemons. Defaults to 0.
[[email protected] hadoop]#

 

(d)配置: (对mapred-site.xml.template重新命名为) mapred-site.xml

[[email protected] hadoop]# mv mapred-site.xml.template mapred-site.xml
[[email protected] hadoop]# vim mapred-site.xml 
[[email protected] hadoop]# cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- 指定MR运行在YARN上 -->
      <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
       </property>

<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop001:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop001:19888</value>
</property>

</configuration>
[root@hadoop001 hadoop]#

 

(2)启动集群

(a)启动前必须保证NameNode和DataNode已经启动

(b)启动ResourceManager

[[email protected] hadoop-2.7.2]# sbin/yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-root-resourcemanager-hadoop001.out

 

(c)启动NodeManager

[[email protected] hadoop-2.7.2]# sbin/yarn-daemon.sh start nodemanager
starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-root-nodemanager-hadoop001.out

 

(3)集群操作

(a)YARN的浏览器页面查看,如图2-35所示

              (b)删除文件系统上的output文件

[[email protected] hadoop-2.7.2]# bin/hdfs dfs -rm -R /user/topcheer/output
20/01/14 20:23:56 INFO fs.TrashPolicyDefault: Namenode trash configuration: Dele                                                                                     tion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/topcheer/output

 

              (c)执行MapReduce程序

[[email protected] hadoop-2.7.2]# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapr                                                                                     educe-examples-2.7.2.jar wordcount /user/topcheer/input  /user/topcheer/output
20/01/14 20:25:53 INFO client.RMProxy: Connecting to ResourceManager at hadoop00                                                                                     1/172.16.233.215:8032
20/01/14 20:25:54 INFO input.FileInputFormat: Total input paths to process : 1
20/01/14 20:25:54 INFO mapreduce.JobSubmitter: number of splits:1
20/01/14 20:25:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_15                                                                                     79004405060_0001
20/01/14 20:25:54 INFO impl.YarnClientImpl: Submitted application application_15

 

              (d)查看运行结果,如图所示

 

 

3 配置历史服务器

为了查看程序的历史运行情况,需要配置一下历史服务器。具体配置步骤如下:

1.    配置mapred-site.xml

[[email protected] hadoop]# vim mapred-site.xml
[[email protected] hadoop]#

 

在该文件里面增加如下配置。

2.    启动历史服务器

[[email protected] hadoop-2.7.2]# sbin/mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /opt/module/hadoop-2.7.2/logs/mapred-root-historyserver-hadoop001.out
[root@hadoop001 hadoop-2.7.2]# jps
24836 ResourceManager
25749 JobHistoryServer
25093 NodeManager
23608 NameNode
25787 Jps
23691 DataNode

3.    查看JobHistory

http://hadoop001:19888/jobhistory

 

 

44 配置日志的聚集

日志聚集概念:应用运行完成以后,将程序运行日志信息上传到HDFS系统上。

日志聚集功能好处:可以方便的查看到程序运行详情,方便开发调试。

注意:开启日志聚集功能,需要重新启动NodeManager 、ResourceManager和HistoryManager。

开启日志聚集功能具体步骤如下:

  1. 配置yarn-site.xml
[[email protected] hadoop]# vi yarn-site.xml
[[email protected] hadoop]# cd ../../

 

在该文件里面增加如下配置。

  1. 关闭NodeManager 、ResourceManager和HistoryManager
[email protected] hadoop-2.7.2]# sbin/yarn-daemon.sh stop resourcemanager
stopping resourcemanager
[root@hadoop001 hadoop-2.7.2]# jps
25905 Jps
25749 JobHistoryServer
25093 NodeManager
23608 NameNode
23691 DataNode
[root@hadoop001 hadoop-2.7.2]# sbin/yarn-daemon.sh stop nodemanager
stopping nodemanager
nodemanager did not stop gracefully after 5 seconds: killing with kill -9
[root@hadoop001 hadoop-2.7.2]# sbin/mr-jobhistory-daemon.sh stop historyserver
stopping historyserver

 

启动NodeManager 、ResourceManager和HistoryManager

[[email protected] hadoop-2.7.2]# sbin/yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-root-resourcemanager-hadoop001.out
[root@hadoop001 hadoop-2.7.2]# sbin/yarn-daemon.sh start nodemanager
starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-root-nodemanager-hadoop001.out
[root@hadoop001 hadoop-2.7.2]# sbin/mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /opt/module/hadoop-2.7.2/logs/mapred-root-historyserver-hadoop001.out
[root@hadoop001 hadoop-2.7.2]# jps
26390 JobHistoryServer
25991 ResourceManager
23608 NameNode
23691 DataNode
26428 Jps
26237 NodeManager

 

删除HDFS上已经存在的输出文件

[[email protected] hadoop-2.7.2]# bin/hdfs dfs -rm -R /user/topcheer/output
20/01/14 20:54:21 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/topcheer/output

 

执行WordCount程序

[[email protected] hadoop-2.7.2]# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/topcheer/input  /user/topcheer/output
20/01/14 20:54:43 INFO client.RMProxy: Connecting to ResourceManager at hadoop001/172.16.233.215:8032
20/01/14 20:54:43 INFO input.FileInputFormat: Total input paths to process : 1
20/01/14 20:54:43 INFO mapreduce.JobSubmitter: number of splits:1
20/01/14 20:54:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1579006407905_0001
20/01/14 20:54:44 INFO impl.YarnClientImpl: Submitted application application_1579006407905_0001

 

  1. 查看日志,如图所示

 

 

 

 

 

 

5 配置文件说明

Hadoop配置文件分两类:默认配置文件和自定义配置文件,只有用户想修改某一默认配置值时,才需要修改自定义配置文件,更改相应属性值。

(1)默认配置文件:

表2-1

要获取的默认文件

文件存放在Hadoop的jar包中的位置

[core-default.xml]

hadoop-common-2.7.2.jar/ core-default.xml

[hdfs-default.xml]

hadoop-hdfs-2.7.2.jar/ hdfs-default.xml

[yarn-default.xml]

hadoop-yarn-common-2.7.2.jar/ yarn-default.xml

[mapred-default.xml]

hadoop-mapreduce-client-core-2.7.2.jar/ mapred-default.xml

       (2)自定义配置文件:

       core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml四个配置文件存放在$HADOOP_HOME/etc/hadoop这个路径上,用户可以根据项目需求重新进行修改配置。

猜你喜欢

转载自www.cnblogs.com/dalianpai/p/12193455.html