Hadoop cluster configuration minimalist Start Basics Tutorial

Cluster Programming

  • Firstly cluster planning before, some configuration files need to cluster-based planning, configuring, you need to plan well in Hadoop various components should be installed on the server which hosts to achieve load balancing , avoid downtime caused irreversible loss, cluster planning is one of the most important steps before to build a distributed environment, this experiment specifically planned as follows:
hadoop101 hadoop102 hadoop103
HDFS NameNode
DateNode
DateNode DateNode
SecondaryNameNode
YARN NodeManager ResourceManager
NodeManager
NodeManager
  • Operating System: CentOS-7-x86_64
  • Hadoop version: hadoop-2.7.7

1. The core profile

  • Profiles /hadoop-2.7.7/etc/hadoop under path
  • Note: You need to establish mapped IP host name in the / etc / hosts file before executing the command

(1) arranged core-site.xml

  • Edit core-site.xml file, configuration information is inserted configuration, the specific configuration is as follows:
<configuration>
	<!-- 指定HDFS中NameNode的地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop101:9000</value>
    </property>
    
    <!-- 指定Hadoop运行时产生文件的存储目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-2.7.7/data/tmp</value>
    </property>
</configuration>

2. HDFS profile

  • Profile path above

(1) Configuration hadoop-env.sh

  • Edit hadoop-env.sh file, which will be JAVA_HOME environment variable to modify the actual Java-based installation path machine, such as:
# 修改前:export JAVA_HOME=${JAVA_HOME},修改后:
export JAVA_HOME=/opt/module/jdk1.8.0_221/

(2) Configuration hdfs-site.xml

  • Edit hdfs-site.xml file, configuration information is inserted configuration, the specific configuration is as follows:
<configuration>
    <!-- 指定HDFS副本因子数-->
    <!--由于实验主机磁盘空间不足,本次实验中设置为1,一般需要设置为3 -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

    <!-- 指定Hadoop辅助节点SecondaryNameNode主机配置 -->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop103:50090</value>
    </property>
</configuration>

3. YARN profile

  • Profile path above

(1) Configuration yarn-env.sh

  • Edit yarn-env.sh file, which will be JAVA_HOME environment variable to modify the actual Java-based installation path machine, such as:
# 修改前:# export JAVA_HOME=/home/y/libexec/jdk1.6.0/,修改后:
export JAVA_HOME=/opt/module/jdk1.8.0_221/

(2) placed yarn-site.xml

  • Edit yarn-site.xml file, configuration information is inserted configuration, the specific configuration is as follows:
<configuration>

    <!-- Site specific YARN configuration properties -->
    
    <!-- 设置Reducer获取数据的方式 -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- 指定YARN的ResourceManager的地址 -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop102</value>
    </property>
    
    <!--此参数指的是nodemanager的可用内存大小,单位为Mb,设置为主机内存大小-->
    <!--本次实验主机内存大小为2GB,此参数根据各机器分配的物理内存大小设置-->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>2048</value>
    </property>
</configuration>

4. MapReduce profile

  • Profile path above
  • (1) Configuration mapred-env.sh

# 修改前:# export JAVA_HOME=/home/y/libexec/jdk1.6.0/,修改后:
export JAVA_HOME=/opt/module/jdk1.8.0_221/
  • (2) placed mapred-site.xml

  • The mapred-site.xml.template copy of the document entitled mapred-site.xml
cp mapred-site.xml.template mapred-site.xml
  • Edit mapred-site.xml file, configuration information is inserted configuration, the specific configuration is as follows:
<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- 指定MR运行在Yarn上 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

4. slaves profile

  • Profile path above
  • (1) Configuration slaves file

  • This file is used mainly as a follow-up parameter when starting the cluster script host this file declarations form a cluster,
  • Note: The line content must file a hostname, does not allow any extra spaces, before using the hostname configuration file, should be pre-set to host IP mapping the host name in the / etc / hosts file , the specific configuration is as follows:
hadoop101
hadoop102
hadoop103

5. Synchronize the cluster configuration

  • Synchronize each host in the cluster configuration file
  • So far the core configuration file, HDFS, YARN, MapReduce, slaves were eight files have been configured, follow-up may increase the allocation on the basis of

6. Start the cluster test

  • The best setting before starting the cluster ssh login-free secret , to avoid starting the process requires frequent password
    implement ssh login-free secret between the host
  • (1) Format node NameNode

  • If this is the first time you start, remember to use the format command hdfs NameNode node. If this is not the first time remember to delete the data and logs folder (in the folder are hadoop file). Formatting commands are as follows:
[tomandersen@hadoop101 hadoop]$ hdfs namenode -format
  • (2) Start HDFS cluster

  • Note: The configuration on which hosts NameNode, which host is the HDFS client, can only start HDFS cluster on that host, otherwise it will not start NameNode normal, HDFS will start as DataNode start on another node , resulting in a lack of cluster ResourceManager (should also do so when closed). One important reason for this is deprecated "start-all.sh" and other tools
[tomandersen@hadoop101 hadoop]$ start-dfs.sh
  • (3) Start YARN cluster

  • Note: The configuration in which host ResourceManager, which hosts a client is YARN, YARN cluster can only be activated on that host, otherwise it will not start normally ResourceManager, will start on another node YARN its start as a NodeManager , resulting in a lack of cluster ResourceManager (when turned off should be the case)
[tomandersen@hadoop102 hadoop]$ start-yarn.sh
  • (4) View Java process on each node

  • Use jps command to view the Java process, whether the observation process and the role of cluster planning before matches this experiment each node process states are as follows:
  • hadoop101:
[tomandersen@hadoop101 hadoop]$ jps
23505 NodeManager
23915 NameNode
24270 Jps
  • hadoop102:
[tomandersen@hadoop102 hadoop]$ jps
26327 Jps
25784 NodeManager
25631 ResourceManager
  • hadoop103:
[tomandersen@hadoop103 hadoop]$ jps
18177 Jps
17699 NodeManager
18093 SecondaryNameNode
  • Finally, may also be used to run tests Hadoop own example, this procedure is used to calculate pi
 hadoop jar /opt/module/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar pi 10 10

End~


Published 15 original articles · won praise 5 · views 10000 +

Guess you like

Origin blog.csdn.net/TomAndersen/article/details/104220449