大数据生态系统基础：Hadoop（二）：Hadoop 3.0.0集群安装和验证

一、目标

本文描述了如何安装和配置Hadoop集群，从几个节点到具有数千个节点的超大集群。要使用Hadoop，首先需要将它安装在一台机器上(请参阅单节点设置，Hadoop（一）)。

二、要求

安装 Java 和 Hadoop 的稳定版本。Mac OS/x 的编译安装

三、安装

安装Hadoop集群通常需要在集群中的所有机器上打开软件，或者通过打包系统安装，一定要适合您的操作系统。将硬件划分为功能区是很重要的。
通常，集群中的一台机器被指定为NameNode，另一台机器则作为ResourceManager专用。这些都是唯一的，成为 Master。其他服务(例如Web应用程序代理服务器和MapReduce作业历史服务器)通常在专用的硬件上运行，或者基于共享的基础设施运行，这取决于负载。
集群中的其他机器都充当了DataNode和nodeManager。这些都是worker，或者称之为 slave。

所有的服务器上安装的 Hadoop 一定要保持目录一致，比如都在 $HOME/hadoop 下，用户名和密码也最好保持一致。所以，前面配置的 Hadoop 都可以通过 scp 拷贝到其它的服务器上，注意，集群服务器一定要做 SSH 的免密登录处理。

四、非安全模式下配置 Hadoop

Hadoop的Java配置由两种重要的配置文件驱动:

只读默认配置——core-default.xml, hdfs-default.xml, yarn-default.xmll和mapred-default.xml。
特定站点配置——etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml 和 etc/hadoop/mapred-site.xml.

另外，您可以通过etc/Hadoop/Hadoop-env和 etc/hadoop/yarn-env.sh设置特定于站点的值来控制在分布的bin/目录中找到的Hadoop脚本。
要配置Hadoop集群，您需要配置Hadoop守护进程执行的环境，以及Hadoop守护进程的配置参数。

HDFS守护进程是NameNode、二级节点和DataNode。Yarn 的守护进程是 ResourceManager, NodeManager和WebAppProxy。如果要使用MapReduce，那么MapReduce作业历史服务器也将运行。对于大型安装，这些通常是在独立的主机上运行。

1、配置 hadoop服务的环境

管理员应该使用etc/hadoop/hadoop-env.sh、etc/sh/mapv-env.sh和 etc/ hadoop / yarn-env.sh脚本对Hadoop守护进程的进程环境进行特定于站点的定制。
至少，您必须指定JAVA_HOME，以便在每个远程节点上正确地定义它。
管理员可以使用如下所示的配置选项来配置单个守护进程:

守护进程	环境参数
NameNode	HDFS_NAMENODE_OPTS
DataNode	HDFS_DATANODE_OPTS
Secondary NameNode	HDFS_SECONDARYNAMENODE_OPTS
ResourceManager	YARN_RESOURCEMANAGER_OPTS
NodeManager	YARN_NODEMANAGER_OPTS
WebAppProxy	YARN_PROXYSERVER_OPTS
Map Reduce Job History Server	MAPRED_HISTORYSERVER_OPTS

例如，配置 NameNode 使用 parallelGC 和一个4Gb 的 java堆，在 hadoop-env.sh 中应该添加下面一句话：

export HDFS_NAMENODE_OPTS="-XX:+UseParallelGC -Xmx4g"

其它的有用的配置参数有：

HADOOP_PID_DIR - 存储守护进程的进程id文件的目录
HADOOP_LOG_DIR - 存储守护进程的日志文件的目录。如果不存在日志文件，就会自动创建日志文件。
HADOOP_HEAPSIZE_MAX - 为Java heapsize使用的最大内存。JVM支持的单元也在这里支持。如果没有单元，则假定该数字为兆字节。默认情况下，Hadoop将让JVM决定使用多少。使用上面列出的适当的选择变量，这个值可以在每个守护进程的基础上被覆盖。例如，设置了一个配置为5GB堆的NameNode，设置为“-Xmx5g”的设置为“-Xmx5g”。

在大多数情况下，您应该指定HADOOP_PID_DIR和HADOOP_LOG_DIR 目录，这样它们只能由将要运行hadoop守护进程的用户编写。否则，就有可能出现符号链接攻击。

还有，要注意配置.bash_profile 或者/etc/profile 的内容，前面已经讲解过了。

2、配置 Hadoop 守护进程

etc/hadoop/core-site.xml

参数	值	注意
`fs.defaultFS`	NameNode URI	hdfs://host:port/
`io.file.buffer.size`	131072	Size of read/write buffer used in SequenceFiles.

etc/hadoop/hdfs-site.xml

下表是 NameNode 的配置参数

参数	值	注意
`dfs.namenode.name.dir`	路径。在本地文件系统上，NameNode将持久存储名称空间和事务日志。	如果这是一个以逗号分隔的目录列表，那么将在所有目录中复制名称表，以用于冗余。
`dfs.hosts` / `dfs.hosts.exclude`	List of permitted/excluded DataNodes.	If necessary, use these files to control the list of allowable datanodes.
`dfs.blocksize`	268435456	HDFS blocksize of 256MB for large file-systems.
`dfs.namenode.handler.count`	100	More NameNode server threads to handle RPCs from large number of DataNodes.

下表是 DataNode 配置参数

Parameter	Value	Notes
`dfs.datanode.data.dir`	Comma separated list of paths on the local filesystem of a `DataNode` where it should store its blocks.	If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

etc/hadoop/yarn-site.xml

下面是ResourceManager 和 NodeManager的配置参数

Parameter	Value	Notes
`yarn.acl.enable`	`true` / `false`	Enable ACLs? Defaults to false.
`yarn.admin.acl`	Admin ACL	ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.
`yarn.log-aggregation-enable`	false	Configuration to enable or disable log aggregation

下面是ResourceManager 的配置参数

Parameter	Value	Notes
`yarn.resourcemanager.address`	`ResourceManager`host:port for clients to submit jobs.	host:port If set, overrides the hostname set in `yarn.resourcemanager.hostname`.
`yarn.resourcemanager.scheduler.address`	`ResourceManager`host:port for ApplicationMasters to talk to Scheduler to obtain resources.	host:port If set, overrides the hostname set in `yarn.resourcemanager.hostname`.
`yarn.resourcemanager.resource-tracker.address`	`ResourceManager`host:port for NodeManagers.	host:port If set, overrides the hostname set in `yarn.resourcemanager.hostname`.
`yarn.resourcemanager.admin.address`	`ResourceManager`host:port for administrative commands.	host:port If set, overrides the hostname set in `yarn.resourcemanager.hostname`.
`yarn.resourcemanager.webapp.address`	`ResourceManager`web-ui host:port.	host:port If set, overrides the hostname set in `yarn.resourcemanager.hostname`.
`yarn.resourcemanager.hostname`	`ResourceManager`host.	host Single hostname that can be set in place of setting all `yarn.resourcemanager*address` resources. Results in default ports for ResourceManager components.
`yarn.resourcemanager.scheduler.class`	`ResourceManager`Scheduler class.	`CapacityScheduler` (recommended), `FairScheduler` (also recommended), or `FifoScheduler`. Use a fully qualified class name, e.g., `org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler`.
`yarn.scheduler.minimum-allocation-mb`	Minimum limit of memory to allocate to each container request at the `Resource Manager`.	In MBs
`yarn.scheduler.maximum-allocation-mb`	Maximum limit of memory to allocate to each container request at the `Resource Manager`.	In MBs
`yarn.resourcemanager.nodes.include-path` / `yarn.resourcemanager.nodes.exclude-path`	List of permitted/excluded NodeManagers.	If necessary, use these files to control the list of allowable NodeManagers.

Nodemanager 的配置

Parameter	Value	Notes
`yarn.nodemanager.resource.memory-mb`	Resource i.e. available physical memory, in MB, for given `NodeManager`	Defines total available resources on the `NodeManager` to be made available to running containers
`yarn.nodemanager.vmem-pmem-ratio`	Maximum ratio by which virtual memory usage of tasks may exceed physical memory	The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
`yarn.nodemanager.local-dirs`	Comma-separated list of paths on the local filesystem where intermediate data is written.	Multiple paths help spread disk i/o.
`yarn.nodemanager.log-dirs`	Comma-separated list of paths on the local filesystem where logs are written.	Multiple paths help spread disk i/o.
`yarn.nodemanager.log.retain-seconds`	10800	Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
`yarn.nodemanager.remote-app-log-dir`	/logs	HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
`yarn.nodemanager.remote-app-log-dir-suffix`	logs	Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.
`yarn.nodemanager.aux-services`	mapreduce_shuffle	Shuffle service that needs to be set for Map Reduce applications.
`yarn.nodemanager.env-whitelist`	Environment properties to be inherited by containers from NodeManagers	For mapreduce application in addition to the default values HADOOP_MAPRED_HOME should to be added. Property value should JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME

配置历史服务器

Parameter	Value	Notes
`yarn.log-aggregation.retain-seconds`	-1	How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
`yarn.log-aggregation.retain-check-interval-seconds`	-1	Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.

etc/hadoop/mapred-site.xml

MapReduce 应用的配置

Parameter	Value	Notes
`mapreduce.framework.name`	yarn	Execution framework set to Hadoop YARN.
`mapreduce.map.memory.mb`	1536	Larger resource limit for maps.
`mapreduce.map.java.opts`	-Xmx1024M	Larger heap-size for child jvms of maps.
`mapreduce.reduce.memory.mb`	3072	Larger resource limit for reduces.
`mapreduce.reduce.java.opts`	-Xmx2560M	Larger heap-size for child jvms of reduces.
`mapreduce.task.io.sort.mb`	512	Higher memory-limit while sorting data for efficiency.
`mapreduce.task.io.sort.factor`	100	More streams merged at once while sorting files.
`mapreduce.reduce.shuffle.parallelcopies`	50	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

MapReduce JobHistory server 配置

Parameter	Value	Notes
`mapreduce.jobhistory.address`	MapReduce JobHistory Server host:port	Default port is 10020.
`mapreduce.jobhistory.webapp.address`	MapReduce JobHistory Server Web UI host:port	Default port is 19888.
`mapreduce.jobhistory.intermediate-done-dir`	/mr-history/tmp	Directory where history files are written by MapReduce jobs.
`mapreduce.jobhistory.done-dir`	/mr-history/done	Directory where history files are managed by the MR JobHistory Server.

3、监视 Nodemanager 的健康状况

管理员可以通过在脚本中执行任何检查来确定该节点是否处于健康状态。如果脚本检测到该节点处于不健康状态，那么它必须从字符串ERROR错误开始打印一条到标准输出的行。NodeManger定期生成脚本并检查它的输出。如果脚本的输出包含字符串错误EROR，如上所述，该节点的状态被报告为不健康的，并且该节点是由ResourceManager列出的。该节点将不会再分配其他任务。但是，NodeManger继续运行脚本，这样，如果节点恢复正常，就会自动从Resourcemanager的黑名单节点中删除。如果不健康，该节点的健康以及脚本的输出，可以在ResourceManager web界面中提供给管理员。节点是健康的，也显示在web界面上。

etc/hadoop/yarn-site.xml

Parameter	Value	Notes
`yarn.nodemanager.health-checker.script.path`	Node health script	Script to check for node’s health status.
`yarn.nodemanager.health-checker.script.opts`	Node health script options	Options for script to check for node’s health status.
`yarn.nodemanager.health-checker.interval-ms`	Node health script interval	Time interval for running health script.
`yarn.nodemanager.health-checker.script.timeout-ms`	Node health script timeout interval	Timeout for health script execution.

4、Slaves 文件

在 etc/hadoop/workers 文件中，把你的所有的 worker 服务器的名称添加进去。

老版本是 slaves 文件。

例如：

slave1

slave2

slave3

5、Hadoop 集群操作

启动HDFS 和 YARN 集群，第一次启动 HDFS需要格式化，比如第一次的分布式系统是 hdfs

[hdfs]$ $HADOOP_HOME/bin/hdfs namenode -format <cluster_name>

启动 HDFS的 namenode

[hdfs]$ $HADOOP_HOME/bin/hdfs --daemon start name node

再启动 datanode

[hdfs]$ $HADOOP_HOME/bin/hdfs --daemon start datanode

如果 etc/hadoop/workers 中的服务器做了 ssh 免密，就可以直接在主服务器上启动

[hdfs]$ $HADOOP_HOME/sbin/start-dfs.sh

[yarn]$ $HADOOP_HOME/sbin/start-yarn.sh

分别启动：

[yarn]$ $HADOOP_HOME/bin/yarn --daemon start resourcemanager

[yarn]$ $HADOOP_HOME/bin/yarn --daemon start nodemanager

[yarn]$ $HADOOP_HOME/bin/yarn --daemon start proxyserver

停止。

[hdfs]$ $HADOOP_HOME/sbin/stop-dfs.sh

[hdfs]$ $HADOOP_HOME/bin/hdfs --daemon stop namenode

[hdfs]$ $HADOOP_HOME/bin/hdfs --daemon stop datanode

[yarn]$ $HADOOP_HOME/sbin/stop-yarn.sh

[yarn]$ $HADOOP_HOME/bin/yarn --daemon stop resourcemanager

[yarn]$ $HADOOP_HOME/bin/yarn --daemon stop nodemanager

[yarn]$ $HADOOP_HOME/bin/yarn stop proxyserver

[mapred]$ $HADOOP_HOME/bin/mapred --daemon stop historyserver

Web接口是：

Daemon	Web Interface	Notes
NameNode	http://nn_host:port/	Default HTTP port is 9870.
ResourceManager	http://rm_host:port/	Default HTTP port is 8088.
MapReduce JobHistory Server	http://jhs_host:port/	Default HTTP port is 19888.

caridle

发布了52 篇原创文章 · 获赞 4 · 访问量 5万+

私信关注

大数据生态系统基础：Hadoop（二）：Hadoop 3.0.0集群安装和验证

猜你喜欢