1 Install JAVA
# tar -xzvf jdk-8u181-linux-x64.tar.gz -C /usr/local/
# vi /root/.bashrc
export JAVA_HOME=/usr/local/jdk1.8.0_181
export PATH=$JAVA_HOME/bin:$PATH
# source /root/.bashrc
# java -version
2 Install and configure SSH password-free login
# rpm -qa | grep ssh查看是否安装ssh
# yum install openssh-clients
# sudo yum install openssh-server
# ssh localhost测试ssh是否可用,需要输入密码
# cd /root/.ssh若没有该目录,请先执行一次ssh localhost
# ssh-keygen -t rsa会有提示,都按回车就可以
# cat id_rsa.pub >> authorized_keys加入授权
# chmod 600 authorized_keys 修改文件权限
3 hadoop pseudo distribution
3.1 Unzip hadoop
$ sudo tar -xzvf /home/hadoop/Desktop/hadoop-2.8.0.tar.gz -C /usr/local
$ cd /usr/local
$ sudo mv ./hadoop-2.8.0/ ./hadoop
$ sudo chown -R hadoop:hadoop ./hadoop/
# /usr/local/hadoop/bin/hadoop version检查是否可用,成功则显示 Hadoop 版本信息
$ sudo gedit /home/hadoop/.bashrc配置环境变量
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_CLASSPATH=/usr/local/hadoop/share/hadoop/common/lib
$ source /home/hadoop/.bashrc
Consider whether to add the following:
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
3.2 Configure pseudo-distribution hadoop
(1) Hadoop can run in a pseudo-distributed manner on a single node. The Hadoop process runs as a separate Java process. The
node acts as both a NameNode and a DataNode. At the same time, it reads files in HDFS.
$ cd /usr/local/hadoop/etc/hadoop/
(2) Need to pay attention to the following file
hadoop-env.sh
log4j.properties
slaves
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml.template
(3) Pseudo-distribution needs to modify 3 configuration files core-site.xml and hdfs-site.xml and slaves
3.2.1 core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
3.2.2 hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/data</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>localhost:50090</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/namesecondary</value>
</property>
</configuration>
3.2.3 mapred-site.xml [Required if yarn is activated]
$ cp mapred-site.xml.template mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
</configuration>
3.2.4 yarn-site.xml [Required if yarn is started]
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<!--打开HDFS上日志记录功能-->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!--在HDFS上聚合的日志最长保留多少秒。3天-->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>259200</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>whether virtual memory limits will be enforced for containers</description>
</property>
</configuration>
3.2.5 slaves
localhost
3.2.6 hadoop-env.sh
In a cluster environment, even if each node is correctly configured with JAVA_HOME, an error may be reported.
Solution: In hadoop-env.sh, explicitly re-declare JAVA_HOME
export JAVA_HOME=/usr/local/jdk1.8.0_144
3.3 Format start
Perform NameNode formatting
[If successful, you will see the prompts of successfully formatted and Exiting with status 0]
Storage directory /usr/local/hadoop/tmp/dfs/name has been successfully formatted.
$ hdfs purpose -format
The content in the basket shows that the formatting is successful.
(1)#start-dfs.sh start
Starting Hadoop through ./sbin/start-dfs.sh only starts the MapReduce environment.
NameNode
SecondaryNameNode
DataNode
http://localhost:50070, view the cluster status
http://llocalhost:8088, view the operation of the job
(2)#stop-dfs.sh close
4 Start YARN
[Pseudo-distribution does not start YARN, and generally does not affect program execution)]
We can start YARN and let YARN be responsible for resource management and task scheduling.
[Need to modify two more configuration files mapred-site.xml and yarn-site.xml]
(1) Modify the configuration file
#cd /usr/local/hadoop/etc/hadoop/
#mv mapred-site.xml.template mapred-site.xml
#vi mapred-site.xml [Configuration file]
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
#vi yarn-site.xml【Configuration file】
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
[If you don't want to start YARN, be sure to rename the configuration file mapred-site.xml]
[Change it to mapred-site.xml.template, just change it back when you need it. 】
(2) Start yarn
#start-yarn.sh start YARN
#mr-jobhistory-daemon.sh start historyserver Start the history server to view the task running status on the Web
[YARN is mainly to provide better resource management and task scheduling for the cluster, but this does not reflect the value on a single machine]
[On the contrary, it will make the program run slightly slower. Therefore, whether to enable YARN on a single machine depends on the actual situation. 】
(3)关闭YARN
#stop-yarn.sh
#mr-jobhistory-daemon.sh stop historyserver