Big data platform construction | Hadoop cluster construction

1. Environmental description

2. Hadoop architecture

  • Hadoop narrow sense consists of three main components 分布式存储HDFS, 分布式计算MapReduce,分布式资源管理和调度YARN

2.1 HDFS architecture

  • Mainly responsible for data storage
    Insert picture description here

  • NameNode: Manage namespaces, store data block mapping information (metadata), and handle client access to HDFS.

  • SecondaryNameNode: NameNode's hot standby will periodically merge the namespace mirror fsimage and the namespace mirror edit log fsedits. When the primary NameNode fails, it can quickly switch to the new Active NameNode

  • DataNode: Responsible for the storage of the actual file data, the file will be split into multiple blocks and stored in different DataNodes in multiple copies

2.2 Yarn architecture

  • Mainly responsible for job scheduling and resource management
    Insert picture description here
  • ResourceManager(RM):
    • Process submitted job requests and resource application requests.
    • Monitor the status of NodeManager
    • Start and monitor ApplicationMaster
  • NodeManager(NM):
    • Manage the resources running on each node
    • Regularly report the resource usage on this node and the running status of each Container to RM
    • Handle requests from AM to start/stop of each Container
  • Container:
    • That is, the container in which the task runs is also Yarn's abstraction of resources, which encapsulates multi-dimensional resources on a certain node, such as memory, CPU, disk, network, etc. The resource returned by RM for AM is represented by Container. YARN will assign a Container to each task and the task can only use the resources described in the Container
  • ApplicationMaster(AM):
    • Each job will start an AM in the NM, and then the AM is responsible for sending the start request of the MapTask and the ReduceTask task to the NM, and requesting the resources required for the task execution from the RM.
    • Interact with RM to apply for resource Container (such as resources for job execution, resources for task execution)
    • Responsible for starting and stopping tasks, and monitoring the running status of all tasks. When the task fails, reapply for resources for the task and restart the task

3. Cluster planning

Unless otherwise specified, each server must maintain the same configuration

Hadoop300 Hadoop301 Hadoop302
NameNode V
DataNode V V V
SecondaryNameNode V
ResourceManger V
NodeEat V V V

4. Download and unzip

4.1 Installation package placement

  • hadoop3.1.3Unzip the downloaded file and create a shortcut to the ~/appdirectory, the same is true for hadoop301 and hadoop302
[hadoop@hadoop300 app]$ pwd
/home/hadoop/app
[hadoop@hadoop300 app]$ ll
lrwxrwxrwx   1 hadoop hadoop  47 2月  21 12:33 hadoop -> /home/hadoop/app/manager/hadoop_mg/hadoop-3.1.3

4.2 Configure Hadoop environment variables

  • vim ~/.bash_profile
# ============ java =========================
export JAVA_HOME=/home/hadoop/app/jdk
export PATH=$PATH:$JAVA_HOME/bin

# ======================= Hadoop ============================
export HADOOP_HOME=/home/hadoop/app/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

5. Hadoop configuration

5.1 env file

  • The modified ${HADOOP_HOME}/etc/hadoop下hadoop-env.sh, mapred-env.sh, yarn-env.sh all add JDK environment variables
export JAVA_HOME=/home/hadoop/app/jdk

5.2 core-site.xml

  • Modify ${HADOOP_HOME}/etc/hadoop/core-site.xmlfile
  • For the configuration of proxy users, see the official website Proxy User
<!-- 指定HDFS中NameNode的地址, -->
<property>
	<name>fs.defaultFS</name>
    <value>hdfs://hadoop300:8020</value>
</property>

<!-- 指定Hdfs的NameNode、DataNode数据存储路径, 默认在/tmp/hadoop-${user.name}下 -->
<!--
	 <property>
		<name>hadoop.tmp.dir</name>
		<value>/tmp/hadoop-${user.name}</value>
	</property>
 -->

	 <!-- 配置HDFS管理页面登陆的静态用户为 hadoop -->
	 <property>
			<name>hadoop.http.staticuser.user</name>
			<value>hadoop</value>
	 </property>

	<!-- 配置该hadoop用户允许通过代理访问的主机节点-->
	<property>
		<name>hadoop.proxyuser.hadoop.hosts</name>
		<value>*</value>
	</property>
	<!-- 配置该hadoop用户允许代理的用户所属组-->
	<property>
		<name>hadoop.proxyuser.hadoop.groups</name>
	 	<value>*</value>
	</property>
	<!-- 配置该hadoop用户允许代理的用户, *代表所有-->
	<property>
		<name>hadoop.proxyuser.hadoop.users</name>
	 	<value>*</value>
	</property>

5.3 hdfs-site.xml file (hdfs configuration)

  • Configure HDFS related attributes
<!-- 指定HDFS副本的数量-->
<property>
	<name>dfs.replication</name>
	<value>2</value>
</property>

<!-- SecondaryNameNode启动地址 -->
<property>
	<name>dfs.namenode.secondary.http-address</name>
	<value>hadoop302:9868</value>
</property>

5.4 yarn-site.xml (yarn placement)

  • Configure yarn related properties
<!-- Reducer获取数据的方式, 使用shuffle -->
<property>
 		<name>yarn.nodemanager.aux-services</name>
 		<value>mapreduce_shuffle</value>
</property>

<!-- 指定YARN的ResourceManager的主机地址 -->
<property>
	<name>yarn.resourcemanager.hostname</name>
  <value>hadoop301</value>
</property>

<!-- 容器向RM的内存申请请求允许的最小值, 即最小分配为该内存大小-->
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>200</value>
</property>
<!-- 容器向RM的内存申请请求允许的最大值, 超出将会抛出 InvalidResourceRequestException异常-->
<property>
	<name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
</property>

<!-- 设置Yarn可使用的内存大小, 即可分配给容器的物理内存量(以MB为单位)。如果设置为-1并且yarn.nodemanager.resource.detect-hardware-capabilities为true,则会自动计算(在Windows和Linux中)。在其他情况下,默认值为8192MB。-->
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>4096</value>
</property>

<!--  关闭Yarn对物理内存和虚拟内存的限制检查
	  因为内存计算方式不一样, 可能会误认为内存不够然后把作业kill掉
 -->
<property>
  <name>yarn.nodemanager.pmem-check-enabled</name>
  <value>false</value>
</property>
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>

<!-- 配置任务历史服务地址  -->
<property>
        <name>yarn.log.server.url</name>
        <value>http://hadoop300:19888/jobhistory/logs/</value> 
</property>

<!-- 开启日志聚集功能使能 
		 日志聚集就是应用运行完成以后,将容器本地运行的日志信息收集上传到HDFS系统上。方便查看
-->
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>

<!-- 日志保留时间设置7天 -->
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>604800</value>
</property>

5.5 mapred-site.xml (MapReduce placement)

  • Configure MapReduce related settings
<!-- 指定MR运行在YARN上, 默认运行在local本地 -->
<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
</property>
	
	<!-- 指定jobhistory地址 -->
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop300:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop300:19888</value>
  </property>

<!-- 指定hadoop环境变量 -->
<property>
  <name>yarn.app.mapreduce.am.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.map.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.reduce.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>

5.6 workers file configuration

  • Modify ${HADOOP_HOME}/etc/hadoop/workersfiles, settings, Hadoop cluster node list
    • tip: 注意不要出现空行和空格
hadoop300
hadoop301
hadoop302

6. Start the test

6.1 Format NameNode

  • Execute in hadoop300
[hadoop@hadoop300 app]$ hdfs namenode -format

6.2 Start HDFS

  • Start in hadoop300
[hadoop@hadoop300 ~]$ start-dfs.sh
Starting namenodes on [hadoop300]
Starting datanodes
Starting secondary namenodes [hadoop302]

6.3 Start Yarn

  • Start in hadoop301
[hadoop@hadoop301 ~]$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers

6.4 Start JobHistory

[hadoop@hadoop300 hadoop]$ mapred --daemon start historyserver

6.5 Effect

  • Jps view process after successful startup
  • At this time, the nn, dn, and sn of hdfs are all started.
  • And yarn's RM and NM are also activated
  • mr's JobHistory is also activated
[hadoop@hadoop300 hadoop]$ xcall jps
--------- hadoop300 ----------
16276 JobHistoryServer
30597 DataNode
19641 Jps
30378 NameNode
3242 NodeManager
--------- hadoop301 ----------
24596 DataNode
19976 Jps
27133 ResourceManager
27343 NodeManager
--------- hadoop302 ----------
24786 SecondaryNameNode
27160 NodeManager
24554 DataNode
19676 Jps

The NameNode interface for accessing HDFS is athadoop300:9870
Insert picture description here

Access the SecondaryNameNode interface of HDFS athadoop300:9868
Insert picture description here

Access the Yarn management interface: inhadoop301:8088
Insert picture description here

Visit the interface of JobHistory, inhadoop300:19888
Insert picture description here

7. Hadoop cluster unified startup script

  • vim hadoop.sh
#!/bin/bash

case $1 in
"start"){
    
    
		echo ---------- Hadoop 集群启动 ------------
		echo "启动Hdfs"
        ssh hadoop300 "source ~/.bash_profile;start-dfs.sh"
        echo "启动Yarn"
        ssh hadoop300 "source ~/.bash_profile;mapred --daemon start historyserver"
        echo "启动JobHistory"
        ssh hadoop301 "source ~/.bash_profile;start-yarn.sh"
};;
"stop"){
    
    
		echo ---------- Hadoop 集群停止 ------------
		echo "关闭Hdfs"
        ssh hadoop300 "source ~/.bash_profile;stop-dfs.sh"
        echo "关闭Yarn"
        ssh hadoop300 "source ~/.bash_profile;mapred --daemon stop historyserver"
        echo "关闭JobHistory"
        ssh hadoop301 "source ~/.bash_profile;stop-yarn.sh"
};;
esac

10. Reward

If you find the article useful, you can encourage the author (Alipay)

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_41347419/article/details/113916436