hadoop cluster tuning practice summary

Tuning Overview #

Almost in many scenarios, MapRdeuce or distributed architecture, will read IO is limited, hard disk or network bottleneck data processing data bottleneck CPU-bound. Hard disk read and write large amounts of data is a common situation analysis of massive data.

IO is limited examples:

索引
分组
数据倒入导出
数据移动和转换
 

CPU-bound examples:

聚类/分类
复杂的文本挖掘
特征提取
用户画像
自然语言处理
 

We need to combine many aspects to achieve improved performance and efficiency from planning hardware and software planning.

Hardware Planning #

Assessment of cluster size #

hadoop cluster nodes How much we need to build? Factors considered in answering this question more: Budget? The amount of data? Computing resources?

How much computing resources may not be particularly good assessment, recommendation scale, with the scale of business and application development to consider expansion. It can start in accordance with the amount of data to assess the scale of the data, estimate the incremental data every day? Save the data cycle is how much? There is no cold data plan?

Assume that data is growing every day as 600G, 3 backup storage, one-year plan, for example, is probably stored as 600G 3 360 Tian = 633T, then consider an increase of 20% set aside, consider the trend in data growth in the future, consider the application of computing space demand. To save space, consider storing compressed (probably save 70% of the space).

Taking into account a certain amount of redundancy, if part of a cluster node is not available but also ensure normal use (based on the proportion of redundant cluster Scale).

Then combined node hardware planning and budgeting, to determine the cluster size. If we need to store 650T may be used 12 x 2TB storage configuration 30 or 60 6 x 2TB configuration, but looking through the number of nodes, so that the same meet storage needs. Note that this change actually increased computing power, however, we need to add more power, cooling, rack space, network port density. So it's a trade-off, considering the actual demand.

Node hardware planning #

CPU with higher or lower with? Memory, storage, CPU what proportion?

General principles:

  1. CPU frequency selection medium, usually no more than two. The general balance of price and power consumption, making full use of CPU performance.

  2. Consider the ratio of CPU and memory costs, guarantee high CPU utilization running. 48G may be a good choice, more parallel process, can also increase the cache to improve performance.

  3. Consider high-capacity hard disk SATA drives (typically 7200RPM), hadoop generally storage-intensive, high performance is not required to be too hard. Multi-disk IO can balance the dispersion pressure, taking into account fault tolerance, clustering large hard disk corruption is very common (if single storage capacity is too high, if the node goes down will cause the internal data replication jitter). 4T general configuration of the hard disk 12 (not exclusively, can be adjusted according to the situation).

  4. Each network node is recommended at 2 Gbps network throughput (considering the balance cost and demand), the network topology not too deep, relatively high hadoop generally longitudinal and transverse network requirements.

hadoop cluster hardware requirements for the various components are different, the following describes the hardware components according to claim characteristics:

  1. NameNode
    NameNode responsible for coordinating data storage on the entire cluster, the original information block required NameNode RAM stores the data within a cluster, a more reliable experience, 1GB of RAM NameNode above 1 million may support the information block, 64GB of RAM 100 may support million of block information.

<Secondary NameNode memory> =  <NameNode memory> = <HDFS cluster management memory> + <2GB for the NameNode process> + <4GB for the OS> + other processes(zookeeper,jn,RM一般每个进程2G)
 

Primary computing hardware memory requirements (needs a storage capacity according to the number of blocks / number of file / block size estimation) according to the storage capacity requirements, taking into account hdfs lasting security metadata (typically a hard disk or in combination do raid attached storage). CPU is not too high demands.

  1. ResourceManager
    load the entire cluster resource scheduling, less resource-intensive, and generally common to Namenode.

  2. Standby
    configuration best and Namenode same.

  3. JournalNode
    resources do not ask

  4. Zookeeper
    resources do not ask, do not load the machine is too high, it normally takes 3 to 5 sets

  5. Datanode / NodeManger
    main storage and computing nodes, based on the principle of binding budget planning standalone hardware cost.

Estimated Datanode the memory:
if the I / O type of job, assign each core 2 ~ 4GB RAM
if CPU type of job, assigned to each core 6 ~ 8GB RAM

Apart from the above job consumes memory, the entire model also requires additional increase:
Datanode process management of HDFS block, need RAM 2GB
Task NodeManger process management node running, you need RAM 2GB
OS, you need to 4GB RAM

A formula is given below:

<DataNode memory for I/O bound profile> = 4GB * <number of physical cores> + <2GB for the DataNode process> + <2GB for the NodeManger process> + <4GB for the OS><DataNode memory for CPU bound profile> = 8GB * <number of physical cores> + <2GB for the DataNode process> + <2GB for the NodeManger process> + <4GB for the OS>
 

Recommended single-piece configuration :
the NameNode / the ResourceManager / Standby types of machine:

4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS p_w_picpath [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node)2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz64-128GB of RAMBonded Gigabit Ethernet or 10Gigabit Ethernet
 

DataNode / TaskTrackers types of machine:

12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz64-512GB of RAMBonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)
 

Here are some tasks for different types of hardware load suggestions:

  • Lightweight process type (1U / machine): two six-core processor, 24-64GB memory, a hard disk 8 (single disc 2TB or 4TB)

  • Equilibrium-type configuration (1U / machine): two six-core processor, 48-128GB memory, 12--16 hard disk (single disk 4TB or 6TB), and the controller is connected directly to the motherboard. If this node fails, the cluster will lead to data jitter, generated a lot of traffic

  • Storage-intensive (2U / machine): two six-core processor, 48-96GB memory, hard disk 16-24 (single disk 6TB or 8TB), if this node fails, the data within the cluster will cause jitter in a lot of traffic

  • Computationally intensive (2U / machine): two six-core processor, 64-512GB memory, hard disk 4-8 ​​(single disc 2TB or 4TB)

NOTE: The above are the minimum CPU configuration, recommended to use the 2 × 8,2 × 10,2 × 12 core processor configuration (not including Hyper-Threading).

If the new cluster you still can not predict your final work, we still recommend the use of a balanced hardware configuration.

Heterogeneous Cluster #

Hadoop is currently developing an all-encompassing data platform, so just MapReudce use multiple models of computation pluggable Hadoop and seamless integration, Hadoop2.x Yarn Explorer version, compatible with a variety of technical models; such as : memory computing represented saprk, impala, tez, drill, presto disk calculate the representative of the hive, mapreduce, pig for a heterogeneous cluster, while there will be a variety of computing models! In the above hardware configuration will need high memory, large disk; Impala recommended minimum configuration 128G memory, in order to take advantage of; a typical spark CPU-intensive, requiring more CPU and memory. Hive, MR disk computing, multi-disk read and write more frequently! When you're in a software component for the cluster hardware selection, to consider include Apache HBase, Cloudera Impala, Presto Or Drill, Apache Phoenix and Apache spark.

Yarn may consider the introduction of a pool of resources, Label based scheduling scheduling mechanism based on labels. Tag-based scheduling feature hadoop yarn new strategy is introduced, it allows YARN better running in a heterogeneous cluster, and thus better management and scheduling of mixed types of applications.

Network topology #

Hadoop is IO hungry is both a disk IO hungry, but also network IO hungry. Although Hadoop Map stage when a scheduled task, the task will try to localize, but for the shuffle / sort and Reducer output, it will generate a lot of IO.

Although Hadoop is not necessarily required to deploy 10 Gb network, but higher bandwidth will certainly lead to better performance. Once you feel need more than two 1Gb NIC teaming to increase the bandwidth of the time, it is to consider the deployment of a 10Gb.

Network topology for Hadoop is somewhat influential. Since the shuffle / sort of have a lot of stuff to the stage / horizontal network access, so that the characteristics of the network bandwidth requirements between any nodes are high. This is the traditional form of Web services from north to south / vertical high-bandwidth needs very different. If the network topology design vertical depth is large (many levels) will reduce network performance.

For Hadoop, high demand for lateral bandwidth. For this reason, the traditional tree topology network is not very suitable for the characteristics of Hadoop, more suitably spine fabric topology.

Typical Deployment Case #

2 two 8-core processor / memory 64G / 1T hard disk 6 (os: 1 block, fsp_w_picpath: 2 block raid1, RM: 1 block, zookeeper: 1 block JN: 1 block)
a two stage core processor 6 / 24G, memory / hard disk 4 1T (os: 1 block, zookeeper: 1 block JN: 1 block)
30 two-core processor 10, 64GB memory / hard disk 12 4T

For Hadoop, high demand for lateral bandwidth. For this reason, the traditional tree topology network is not very suitable for the characteristics of Hadoop, more suitably spine fabric topology.

Software planning #

Operating system #

The vast majority use Hadoop itself is written in Java, but also C / C ++ code. In addition, due to the time of this writing, the basic Linux-designed system, which is full of a lot of ideological code using Linux architecture, so in general will choose to deploy on Linux.

The current system, RedHat Enterprise Linux, CentOS, Ubuntu Server Edition, SuSE Enterprise Linux and Debian can be a good deploy Hadoop in a production environment. Therefore, selection system is more dependent on your system management tools supported systems, hardware support capabilities, as well as commercial software you are using supports, there is a very important consideration is which system administrators are most familiar with.

Configuration system is very time-consuming, and error-prone, recommended software configuration management system for maintenance, rather than to manually configure. It is now more popular Puppet and Chef.

Hadoop version #

Apache version of Hadoop is not the only version, there are a lot of companies are focused only on their release, the most popular non-Apache Hadoop distribution is Cloudera's Hadoop version, which is CDH. 

  • ClouderaHadoop release
    Cloudera is a provider of commercial support for Hadoop, advanced tools and professional services companies. CDH their release is a free open source. Follow Apache2.0. CDH for the user without a lot of branches version, the version number is continuous and has good compatibility with the current version of CDH CDH5, and having characteristics Apache2.0 1.0. Including NameNode HA and the Federation, while supporting MRv1 and MRv2. This is the current version of Apache does not have. Another feature is that CDH CDH integrates different Hadoop ecosystem projects, HDFS and Hadoop MapReduce was the core component, and on top of this there has been an increasing number of components. These components make Hadoop more friendly to use, shorten the development cycle. It makes writing MapReduce task easier.

Here we must mention at Impala in CDH project, the project can completely bypass MapReduce layer, to obtain data directly from HDFS, to perform real-time queries against Hadoop, the Hadoop ecosystem CDH solve many dependent components. It provides component of most ecosystem. Addresses compatibility between components. This is for the user to select CDH is a great advantage, but also makes CDH has become so popular, in order to assist CDH, Cloudera released a web-based management tool Cloudera Manager, to plan, configure and monitor Hadoop cluster, Cloudera Manager has a free version and a paid enterprise edition.

  • HortonworksHadoop release
    another version of the popular Hadoop is Horonworks produced Horonworks Data Platform (HDP), and Cloudera similar, Horonworks provides an integrated version installed, and provides commercial support and services. Provide HDP1.2 and 2.0. HDP1.2 which provides some of the characteristics of other distributions do not have, Horonworks realized NameNode of HA on the basis of Hadoop1.0 (Note: This use of Linux HA HA technology, rather than using JournalNode), HDP includes HCatalog, to provide an integrated service Pig and Hive such projects. In order to attract users, Horonworks very careful and combine traditional BI. HDP provides ODBC drivers for the Hive. Most BI tools make it possible to drink there is to be a fit. Another feature is that you can run HDP in the windows platform, but this stability is still being tested. HDP use Ambari do to manage and monitor cluster. Ambari similar Cloudera Manager is a tool of the Web side. The difference is 100% free and open source.

  • MapR release
    except Cloudera and Hortoworks outside MapR is a provider of Hadoop-based platform company. They have different product versions. M3 is a free version has limited functionality, M5 and M7 are Enterprise Edition. Hortoworks different software and Cloudera and MapR offers are not free. But offers some enterprise-class features, the main difference is that MapR and MapR Apache Hadoop HDFS is not used but the use of MapR-FS file system. MapR-FS is implemented in C ++ than Java to write HDFS provides low latency and high concurrency. Although the API-compatible, but completely different implementations. In addition, MapR-FS provides the ability to NFS volumes, snapshots clusters and cluster monitoring, MapR-FS realization of these capabilities are based.

Select suggestion:
according to their own conditions and operations teams to decide which version to use, if the team does not hadoop customization capabilities, there is no strong business need, consider CDH version, will be relatively more stable and more relevant component integration also convenient, to free version of CDH Manager to manage and monitor cluster; if you need custom development community can choose the version, so you can easily exchange the development community, of course, you can also CDH version of the patch updates to your branch, or based on a development version of CDH .

Java version #

The most basic, Hadoop requires JDK before running. JDK version is concern, if you are using older Java 6, you need to install the Oracle (Sun) of the JDK; if it is Java 7 you can use the system's default OpenJDK 7. Specific compatibility after the official test some users posted:
http://wiki.apache.org/hadoop/HadoopJavaVersions

It is generally selected 64-bit system, because the general configuration of the memory is much larger than 4GB.

Parameter optimization #

OS parameter optimization #

Adjustment parameters, may be optimized according to the actual situation

  • Selinux closed, empty iptables in the server configuration completed successfully after the normal service, in the open selinux

    [root@localhost ~]# sed –i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config #永久关闭[root@localhost ~]# grep SELINUX=disabled /etc/selinux/config #查看[root@localhost ~]# setenforce 0 #临时关闭[root@localhost ~]# getenforce #查看状态Permissive[root@localhost ~]# service iptables stop #关闭防火墙[root@localhost ~]# service iptables off  #开机不自动启动
     
  • Streamline the boot from the start the service
    crond, network, syslog, sshd service can be, leaving only the back of customized according to the needs

    #关闭全部服务[root@localhost ~]# for s in `chkconfig --list|grep 3:on|awk '{print $1}'`;do chkconfig --level 3 $s off;done#开启需要的服务[root@localhost ~]# for s in crond rsyslog sshd network;do chkconfig --level 3 $s on;done#检查结果[root@localhost ~]# chkconfig –list|grep 3:
     
  • Adjust the size of the file descriptor

#永久生效
vi /etc/security/limits.conf 
* soft nofile 65535 * hard nofile 65535 * soft nproc 65535 * hard nproc 65535 * soft nofile 65535* hard nofile 65535 #临时修改[root@localhost ~]# ulimit -SHn 65535
 
  • Remove unnecessary system users and groups

    #删除不必要的用户userdel adm
    userdel lp
    userdel sync
    userdel shutdown
    userdel halt
    userdel news
    userdel uucp
    userdel operator
    userdel games
    userdel gopher
    userdel ftp#删除不必要的群组groupdel adm
    groupdel lp
    groupdel news
    groupdel uucp
    groupdel games
    groupdel dip
    groupdel pppusers
     
  • Time synchronization system

    [root@localhost ~]# ntpdate cn.pool.ntp.org ;hwclock–w  #同步时间并写入blos硬件时间[root@localhost ~]# crontab –e     #设置任务计划每天零点同步一次0 * * * * /usr/sbin/ntpdate cn.pool.ntp.org ; hwclock -w
     
  • Kernel parameter optimization

[root@localhost ~]# vi /etc/sysctl.conf    #末尾添加如下参数net.ipv4.tcp_tw_reuse = 1             #1是开启重用,允许讲TIME_AIT sockets重新用于新的TCP连接,默认是0关闭net.ipv4.tcp_tw_recycle = 1            #TCP失败重传次数,默认是15,减少次数可释放内核资源net.ipv4.ip_local_port_range = 4096 65000  #应用程序可使用的端口范围net.ipv4.tcp_max_tw_buckets = 5000     #系统同时保持TIME_WAIT套接字的最大数量,如果超出这个数字,TIME_WATI套接字将立刻被清除并打印警告信息,默认180000net.ipv4.tcp_max_syn_backlog = 4096    #进入SYN宝的最大请求队列,默认是1024net.core.netdev_max_backlog =  10240  #允许送到队列的数据包最大设备队列,默认300net.core.somaxconn = 2048              #listen挂起请求的最大数量,默认128net.core.wmem_default = 8388608        #发送缓存区大小的缺省值net.core.rmem_default = 8388608        #接受套接字缓冲区大小的缺省值(以字节为单位)net.core.rmem_max = 16777216           #最大接收缓冲区大小的最大值net.core.wmem_max = 16777216           #发送缓冲区大小的最大值net.ipv4.tcp_synack_retries = 2        #SYN-ACK握手状态重试次数,默认5net.ipv4.tcp_syn_retries = 2           #向外SYN握手重试次数,默认4net.ipv4.tcp_tw_recycle = 1            #开启TCP连接中TIME_WAIT sockets的快速回收,默认是0关闭net.ipv4.tcp_max_orphans = 3276800     #系统中最多有多少个TCP套接字不被关联到任何一个用户文件句柄上,如果超出这个数字,孤儿连接将立即复位并打印警告信息net.ipv4.tcp_mem = 94500000 915000000 927000000vm.swappiness = 0 #关闭swap#iptables 防火墙net.nf_conntrack_max = 25000000net.netfilter.nf_conntrack_max = 25000000net.netfilter.nf_conntrack_tcp_timeout_established = 180net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120[root@localhost ~]# /sbin/sysctl -p
 
  • File system
    is recommended hadoop ext3 formatted disk, ext3 extensive testing (yahoo cluster use ext3), ext4 and xfs has lost data risk.

Disable Linux logical volume management (LVM)

Disable file and directory atime atime when mounting the data partition

LABEL=/data1    /data1  ext3    noatime,nodiratime  1   2
 

-M formatting parameters may be increased to reduce the headspace

[root@localhost ~]# mkfs -t ext3 -j -m 1 -O sparse_super,dir_index /dev/sdXN-m 1: 默认情况下系统会为 root 保留 5% 的空间,这对于很大的文件系统而言,这个比例不小。对于 Hadoop 数据分区而言,这完全没必要。 这个选项将其降为 1% 。

或者使用tune2fs 命令修改[root@localhost ~]# tune2fs -m 1 /dev/sdXN
 
  • Close THP

    [root@localhost ~]# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled[root@localhost ~]# echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
     

    It can be added to the boot entry.

hadoop parameter optimization #

Verify hadoop version 2.6.4, as appropriate, other versions of the reference
core-site.xml

parameter name Defaults Explanation
hadoop.tmp.dir /tmp/hadoop-${user.name} Hadoop best intermediate temporary file directory is specified separately, such as mr split information, stag information
io.file.buffer.size 4096 IO operation set the buffer size, the larger the cache can provide higher data transmission, but it also means more memory consumption and latency. The parameters to be set to a multiple of the system page size, a byte as a unit, the default is 4KB, under normal circumstances, can be set to 64KB (65536byte)
fs.trash.interval 0 Recommend Recycle Bin function is turned on, this parameter defines .Trash directory file retention time before they are permanently deleted
topology.script.file.name -- Cluster nodes relatively long time, it is recommended to configure rack awareness. Script Example: rack_awareness

hdfs-site.xml

parameter name Defaults Explanation
dfs.namenode.handler.count 10 The number of service threads, turn up a number of general principles is to set it to the cluster size of the natural logarithm is multiplied by 20, that 20logN, N cluster size
dfs.datanode.handler.count 10 The number of service threads, according to the actual tests to determine the number of CPU cores box, usually more than a few audit
dfs.datanode.max.transfer.threads 4096 datanode allowed and the number of simultaneously executed tasks of receiving the transmission, similar to the file handle constraints on linux
dfs.namenode.name.dir file://${hadoop.tmp.dir}/dfs/name A plurality of redundant position, a local, a further NFS
dfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/data Multiple locations distributed storage, distributed as much as several partitions directory
dfs.datanode.failed.volumes.tolerated 0 DataNode define the entire statement before the number of hard disk failure allowed to fail. When any local disk failure, its default behavior found throughout DataNode failure. The number of hard disk can be dubbed by 30%
dfs.client.read.shortcircuit false Proposal to change the short path to true open reading
dfs.domain.socket.path   Short path setting read socket path, / var / run / hadoop-hdfs / dn._PORT assurance / var / run / hadoop-hdfs / writable group, group root
dfs.blocksize 134217728 The new default file block size, a default 128M, cluster size can be adjusted, mapper count the number of basic block decisions by the input file, block small rise to many small tasks
dfs.hosts -- The file format is dfs.hosts line breaks to separate the host name or IP address of the host that are not in the list are not allowed to join the cluster.
dfs.host.exclude -- Similarly dfs.hosts, HDFS can be excluded from the relevant node by specifying the file can be safely uninstalled node
dfs.datanode.balance.bandwidthPerSec 1048576 In the balancer movement between DataNode data block to ensure load balancing. If we do not balance the operating bandwidth limit, then it will soon occupy all network resources, impact Mapreduce operations and other services, too small, too slow balanced. This parameter sets the maximum bandwidth per second, the unit value is byte, network bandwidth are generally used to describe bit. When set, first calculate.
dfs.datanode.du.reserved 0 The sum of the size of the space datanode will report configuration directory, defaults are used dfs store, part of the space can be reserved for other services, it can also reduce some unnecessary alarm monitoring

mapred-site.xml

parameter name Defaults Explanation
mapreduce.cluster.local.dir ${hadoop.tmp.dir}/mapred/local MR intermediate data storage, preferably multisection directory pass, comma
mapreduce.shuffle.readahead.bytes 4194304 默认为4M,ShuffleHandler在发送文件时使用posix_fadvise管理操作系统cache,可以增加预取长度提高shuffle效率
mapreduce.ifile.readahead.bytes 4194304 默认为4M,ifile预取长度
mapreduce.tasktracker.outofband.heartbeat false 建议设成ture,在完成任务时让TaskTracker发送一个 out-of-band心跳来减少延迟
mapreduce.jobtracker.heartbeat.interval.min 300 增加TaskTracker-to-JobTracker 心跳间隔,对小集群可以增加MR性能,可以改成1000
mapred.reduce.slowstart.completed.maps 0.05 此属性设置当map任务完成多少的时候启动reduce任务,许多小任务可以设成0,大任务设成0.5
mapreduce.map.speculative true map任务推测执行,如果计算资源紧张,任务执行本身很耗资源情况下可以考虑设置成false。需要时通过任务参数制定 。
mapreduce.reduce.speculative true reduce任务推测执行,建议关闭,需要时通过任务参数制定
mapreduce.task.io.sort.mb 100 以MB为单位,默认100M,根据map输出数据量的大小,可以适当的调整buffer的大小,注意是适当的调整,不是越大越好。
mapreduce.map.sort.spill.percent 0.8 buffer中达到80%时,进行spill
mapreduce.map.output.compress false map输出是否压缩,建议开启减少io和网络消耗
mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.DefaultCodec 建议使用snappy压缩 org.apache.hadoop.io.compress.SnappyCodec
mapreduce.output.fileoutputformat.compress.type RECORD 输出SequenceFiles是的压缩类型,建议改成BLOCK
mapreduce.map.java.opts -- 可以指定一些JVM参数用于调优
mapreduce.jobtracker.handler.count 10 jobtracker rpc的线程数,一般推荐为tasktracker节点数的40%
mapreduce.tasktracker.http.threads 40 获取map输出的工作线程数,可根据集群规模和硬件配置调整
mapreduce.tasktracker.map.tasks.maximum 2 tasktracker同时运行map任务数,一般可配为CPU核数或1.5倍核数
mapreduce.tasktracker.reduce.tasks.maximum 2 tasktracker同时运行reduce任务数,一般可配为CPU核数或1.5倍核数
mapreduce.reduce.shuffle.input.buffer.percent 0.7 reduce用于接受map输出buffer占堆大小的比例,类似于map端的mapreduce.task.io.sort.mb,shuffle最大使用的内存量。如果 map 输出很大而且在 reduce 到排序阶段本地磁盘 I/O 很频繁,应该尝试增加这个值。
mapreduce.reduce.shuffle.parallel.copies 5 shuffle阶段copy线程数,默认是5,一般可以设置为 4*logN N为集群大小
mapreduce.job.jvm.num.tasks 1 默认为1,设置为 -1,重用jvm

yarn-site.xml

参数名 默认值 说明
yarn.scheduler.minimum-allocation-mb 1024 一次申请分配内存资源的最小数量
yarn.scheduler.maximum-allocation-mb 8192 一次申请分配内存资源的最大数量
yarn.nodemanager.resource.memory-mb 8192 默认值为8192M,节点所在物理主机的可用物理内存总量
yarn.nodemanager.resource.cpu-vcores 8 NodeManager总的可用虚拟CPU个数,根据硬件配置设定,简单可以配置为CPU超线程个数

如何调优#

一般系统调优的基本步骤

衡量系统现状,了解现有硬件和软件环境,目前的关键系统指标。
设定调优目标,确定优先解决的问题,评估设计调优目标
寻找性能瓶颈,根据现有监控数据,找出瓶颈点。
性能调优,找出收益比(效果/代价)比较高的策略实施
衡量是否到达目标(如果未到达目标,需重新寻找性能瓶颈)
性能调优结束
 

在集群安装部署时应收集业务对系统的需求及特点(数据量,读写特点,计算量等),同时做好硬件的规划和初始测试(对服务器的IO/net/cpu做基本测试,保证加入集群服务器硬件没有问题)。下面主要从硬件规划和软件调优方面讨论hadoop集群的调优。

设计基准测试用例#

怎么看你的调整有效果?怎么看你是否找到了瓶颈点?要有一个对比的基线,才能比较出你的调整能改善多少性能。Hadoop提供线程测试基线应用。比如用于 HDFS I/O 测试的 TestDFSIO 和 dfsthroughput(包含在 hadoop--test.jar 中)、用于总体硬件测试的 Sort(包含在 hadoop--examples.jar 中)。可以根据自己的测试需求选择任何基准。

在所有这些基准中,当输入数据很大时,Sort 可以同时反映 MapReduce 运行时性能(在 “执行排序” 过程中)和 HDFS I/O 性能(在 “把排序结果写到 HDFS” 过程中)。另外,Sort 是 Apache 推荐的硬件基准。(http://wiki.apache.org/hadoop/Sort

可以先测试HDFS的写入和读写性能,然后通过Sort基线测试计算和整体性能。

IO测试,会在最后输出相关统计数据

写入10个文件,每个文件5G[root@localhost ~]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.6.4-tests.jar TestDFSIO -write -nrFiles 10 -size 5GB读10个文件,每个文件5G[root@localhost ~]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.6.4-tests.jar TestDFSIO -read -nrFiles 10 -size 5GB
 

Sort基线测试

生成测试数据[root@localhost ~]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar randomwriter -Dtest.randomwriter.maps_per_host=10 -Dtest.randomwrite.bytes_per_map=50G random-data运行排序[root@localhost ~]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar sort random-data sorted-data
 

基线配置参数可以都使用默认的配置参数,然后一轮一轮修改参数,通过对比结果和服务器监控数据来做对比。

监控数据分析#

在做调优之前集群应该已经有基本的资源监控 比如 CPU、文件IO、网络IO、内存。如果没有则需要安装监控工具或监控脚步完成主机性能参数收集,常用工具(nmon/nmonanalyser,dstat等等)。

性能瓶颈的表象:资源消耗过多、外部处理系统的性能不足、资源消耗不多但程序的响应速度却仍达不到要求。

我们分析性能的主要来源数据就是主机监控数据和MR计算过程中的计数器数据等等,常见的分析点:

  1. work节点的CPU资源是否有效利用,一般维持在70%左右,如果不到40%表面没有充分利用CPU

  2. 观察磁盘IO是否有压力,CPU wa比例是不是很大

  3. 观察网络流量是否正常

  4. MR性能主要看shuffle阶段的计数器的表现,spill IO是否为瓶颈等

  5. 内存方面结合JVM优化进行调整

结合前面讲的一些优化参数点,反复测试分析结果。解决高层次性能瓶颈后再考虑应用级别的优化,比如文件格式,序列化性能,这方向调优可以使用JVM/HotSpot原生profile能力。

 

Guess you like

Origin www.cnblogs.com/zzjhn/p/11525140.html