Tuning Overview #
Almost in many scenarios, MapRdeuce or distributed architecture, will read IO is limited, hard disk or network bottleneck data processing data bottleneck CPU-bound. Hard disk read and write large amounts of data is a common situation analysis of massive data.
IO is limited examples:
CPU-bound examples:
We need to combine many aspects to achieve improved performance and efficiency from planning hardware and software planning.
Hardware Planning #
Assessment of cluster size #
hadoop cluster nodes How much we need to build? Factors considered in answering this question more: Budget? The amount of data? Computing resources?
How much computing resources may not be particularly good assessment, recommendation scale, with the scale of business and application development to consider expansion. It can start in accordance with the amount of data to assess the scale of the data, estimate the incremental data every day? Save the data cycle is how much? There is no cold data plan?
Assume that data is growing every day as 600G, 3 backup storage, one-year plan, for example, is probably stored as 600G 3 360 Tian = 633T, then consider an increase of 20% set aside, consider the trend in data growth in the future, consider the application of computing space demand. To save space, consider storing compressed (probably save 70% of the space).
Taking into account a certain amount of redundancy, if part of a cluster node is not available but also ensure normal use (based on the proportion of redundant cluster Scale).
Then combined node hardware planning and budgeting, to determine the cluster size. If we need to store 650T may be used 12 x 2TB storage configuration 30 or 60 6 x 2TB configuration, but looking through the number of nodes, so that the same meet storage needs. Note that this change actually increased computing power, however, we need to add more power, cooling, rack space, network port density. So it's a trade-off, considering the actual demand.
Node hardware planning #
CPU with higher or lower with? Memory, storage, CPU what proportion?
General principles:
-
CPU frequency selection medium, usually no more than two. The general balance of price and power consumption, making full use of CPU performance.
-
Consider the ratio of CPU and memory costs, guarantee high CPU utilization running. 48G may be a good choice, more parallel process, can also increase the cache to improve performance.
-
Consider high-capacity hard disk SATA drives (typically 7200RPM), hadoop generally storage-intensive, high performance is not required to be too hard. Multi-disk IO can balance the dispersion pressure, taking into account fault tolerance, clustering large hard disk corruption is very common (if single storage capacity is too high, if the node goes down will cause the internal data replication jitter). 4T general configuration of the hard disk 12 (not exclusively, can be adjusted according to the situation).
-
Each network node is recommended at 2 Gbps network throughput (considering the balance cost and demand), the network topology not too deep, relatively high hadoop generally longitudinal and transverse network requirements.
hadoop cluster hardware requirements for the various components are different, the following describes the hardware components according to claim characteristics:
-
NameNode
NameNode responsible for coordinating data storage on the entire cluster, the original information block required NameNode RAM stores the data within a cluster, a more reliable experience, 1GB of RAM NameNode above 1 million may support the information block, 64GB of RAM 100 may support million of block information.
Primary computing hardware memory requirements (needs a storage capacity according to the number of blocks / number of file / block size estimation) according to the storage capacity requirements, taking into account hdfs lasting security metadata (typically a hard disk or in combination do raid attached storage). CPU is not too high demands.
-
ResourceManager
load the entire cluster resource scheduling, less resource-intensive, and generally common to Namenode. -
Standby
configuration best and Namenode same. -
JournalNode
resources do not ask -
Zookeeper
resources do not ask, do not load the machine is too high, it normally takes 3 to 5 sets -
Datanode / NodeManger
main storage and computing nodes, based on the principle of binding budget planning standalone hardware cost.
Estimated Datanode the memory:
if the I / O type of job, assign each core 2 ~ 4GB RAM
if CPU type of job, assigned to each core 6 ~ 8GB RAM
Apart from the above job consumes memory, the entire model also requires additional increase:
Datanode process management of HDFS block, need RAM 2GB
Task NodeManger process management node running, you need RAM 2GB
OS, you need to 4GB RAM
A formula is given below:
Recommended single-piece configuration :
the NameNode / the ResourceManager / Standby types of machine:
DataNode / TaskTrackers types of machine:
Here are some tasks for different types of hardware load suggestions:
-
Lightweight process type (1U / machine): two six-core processor, 24-64GB memory, a hard disk 8 (single disc 2TB or 4TB)
-
Equilibrium-type configuration (1U / machine): two six-core processor, 48-128GB memory, 12--16 hard disk (single disk 4TB or 6TB), and the controller is connected directly to the motherboard. If this node fails, the cluster will lead to data jitter, generated a lot of traffic
-
Storage-intensive (2U / machine): two six-core processor, 48-96GB memory, hard disk 16-24 (single disk 6TB or 8TB), if this node fails, the data within the cluster will cause jitter in a lot of traffic
-
Computationally intensive (2U / machine): two six-core processor, 64-512GB memory, hard disk 4-8 (single disc 2TB or 4TB)
NOTE: The above are the minimum CPU configuration, recommended to use the 2 × 8,2 × 10,2 × 12 core processor configuration (not including Hyper-Threading).
If the new cluster you still can not predict your final work, we still recommend the use of a balanced hardware configuration.
Heterogeneous Cluster #
Hadoop is currently developing an all-encompassing data platform, so just MapReudce use multiple models of computation pluggable Hadoop and seamless integration, Hadoop2.x Yarn Explorer version, compatible with a variety of technical models; such as : memory computing represented saprk, impala, tez, drill, presto disk calculate the representative of the hive, mapreduce, pig for a heterogeneous cluster, while there will be a variety of computing models! In the above hardware configuration will need high memory, large disk; Impala recommended minimum configuration 128G memory, in order to take advantage of; a typical spark CPU-intensive, requiring more CPU and memory. Hive, MR disk computing, multi-disk read and write more frequently! When you're in a software component for the cluster hardware selection, to consider include Apache HBase, Cloudera Impala, Presto Or Drill, Apache Phoenix and Apache spark.
Yarn may consider the introduction of a pool of resources, Label based scheduling scheduling mechanism based on labels. Tag-based scheduling feature hadoop yarn new strategy is introduced, it allows YARN better running in a heterogeneous cluster, and thus better management and scheduling of mixed types of applications.
Network topology #
Hadoop is IO hungry is both a disk IO hungry, but also network IO hungry. Although Hadoop Map stage when a scheduled task, the task will try to localize, but for the shuffle / sort and Reducer output, it will generate a lot of IO.
Although Hadoop is not necessarily required to deploy 10 Gb network, but higher bandwidth will certainly lead to better performance. Once you feel need more than two 1Gb NIC teaming to increase the bandwidth of the time, it is to consider the deployment of a 10Gb.
Network topology for Hadoop is somewhat influential. Since the shuffle / sort of have a lot of stuff to the stage / horizontal network access, so that the characteristics of the network bandwidth requirements between any nodes are high. This is the traditional form of Web services from north to south / vertical high-bandwidth needs very different. If the network topology design vertical depth is large (many levels) will reduce network performance.
For Hadoop, high demand for lateral bandwidth. For this reason, the traditional tree topology network is not very suitable for the characteristics of Hadoop, more suitably spine fabric topology.
Typical Deployment Case #
2 two 8-core processor / memory 64G / 1T hard disk 6 (os: 1 block, fsp_w_picpath: 2 block raid1, RM: 1 block, zookeeper: 1 block JN: 1 block)
a two stage core processor 6 / 24G, memory / hard disk 4 1T (os: 1 block, zookeeper: 1 block JN: 1 block)
30 two-core processor 10, 64GB memory / hard disk 12 4T
For Hadoop, high demand for lateral bandwidth. For this reason, the traditional tree topology network is not very suitable for the characteristics of Hadoop, more suitably spine fabric topology.
Software planning #
Operating system #
The vast majority use Hadoop itself is written in Java, but also C / C ++ code. In addition, due to the time of this writing, the basic Linux-designed system, which is full of a lot of ideological code using Linux architecture, so in general will choose to deploy on Linux.
The current system, RedHat Enterprise Linux, CentOS, Ubuntu Server Edition, SuSE Enterprise Linux and Debian can be a good deploy Hadoop in a production environment. Therefore, selection system is more dependent on your system management tools supported systems, hardware support capabilities, as well as commercial software you are using supports, there is a very important consideration is which system administrators are most familiar with.
Configuration system is very time-consuming, and error-prone, recommended software configuration management system for maintenance, rather than to manually configure. It is now more popular Puppet and Chef.
Hadoop version #
Apache version of Hadoop is not the only version, there are a lot of companies are focused only on their release, the most popular non-Apache Hadoop distribution is Cloudera's Hadoop version, which is CDH.
-
ClouderaHadoop release
Cloudera is a provider of commercial support for Hadoop, advanced tools and professional services companies. CDH their release is a free open source. Follow Apache2.0. CDH for the user without a lot of branches version, the version number is continuous and has good compatibility with the current version of CDH CDH5, and having characteristics Apache2.0 1.0. Including NameNode HA and the Federation, while supporting MRv1 and MRv2. This is the current version of Apache does not have. Another feature is that CDH CDH integrates different Hadoop ecosystem projects, HDFS and Hadoop MapReduce was the core component, and on top of this there has been an increasing number of components. These components make Hadoop more friendly to use, shorten the development cycle. It makes writing MapReduce task easier.
Here we must mention at Impala in CDH project, the project can completely bypass MapReduce layer, to obtain data directly from HDFS, to perform real-time queries against Hadoop, the Hadoop ecosystem CDH solve many dependent components. It provides component of most ecosystem. Addresses compatibility between components. This is for the user to select CDH is a great advantage, but also makes CDH has become so popular, in order to assist CDH, Cloudera released a web-based management tool Cloudera Manager, to plan, configure and monitor Hadoop cluster, Cloudera Manager has a free version and a paid enterprise edition.
-
HortonworksHadoop release
another version of the popular Hadoop is Horonworks produced Horonworks Data Platform (HDP), and Cloudera similar, Horonworks provides an integrated version installed, and provides commercial support and services. Provide HDP1.2 and 2.0. HDP1.2 which provides some of the characteristics of other distributions do not have, Horonworks realized NameNode of HA on the basis of Hadoop1.0 (Note: This use of Linux HA HA technology, rather than using JournalNode), HDP includes HCatalog, to provide an integrated service Pig and Hive such projects. In order to attract users, Horonworks very careful and combine traditional BI. HDP provides ODBC drivers for the Hive. Most BI tools make it possible to drink there is to be a fit. Another feature is that you can run HDP in the windows platform, but this stability is still being tested. HDP use Ambari do to manage and monitor cluster. Ambari similar Cloudera Manager is a tool of the Web side. The difference is 100% free and open source. -
MapR release
except Cloudera and Hortoworks outside MapR is a provider of Hadoop-based platform company. They have different product versions. M3 is a free version has limited functionality, M5 and M7 are Enterprise Edition. Hortoworks different software and Cloudera and MapR offers are not free. But offers some enterprise-class features, the main difference is that MapR and MapR Apache Hadoop HDFS is not used but the use of MapR-FS file system. MapR-FS is implemented in C ++ than Java to write HDFS provides low latency and high concurrency. Although the API-compatible, but completely different implementations. In addition, MapR-FS provides the ability to NFS volumes, snapshots clusters and cluster monitoring, MapR-FS realization of these capabilities are based.
Select suggestion:
according to their own conditions and operations teams to decide which version to use, if the team does not hadoop customization capabilities, there is no strong business need, consider CDH version, will be relatively more stable and more relevant component integration also convenient, to free version of CDH Manager to manage and monitor cluster; if you need custom development community can choose the version, so you can easily exchange the development community, of course, you can also CDH version of the patch updates to your branch, or based on a development version of CDH .
Java version #
The most basic, Hadoop requires JDK before running. JDK version is concern, if you are using older Java 6, you need to install the Oracle (Sun) of the JDK; if it is Java 7 you can use the system's default OpenJDK 7. Specific compatibility after the official test some users posted:
http://wiki.apache.org/hadoop/HadoopJavaVersions
It is generally selected 64-bit system, because the general configuration of the memory is much larger than 4GB.
Parameter optimization #
OS parameter optimization #
Adjustment parameters, may be optimized according to the actual situation
-
Selinux closed, empty iptables in the server configuration completed successfully after the normal service, in the open selinux
-
Streamline the boot from the start the service
crond, network, syslog, sshd service can be, leaving only the back of customized according to the needs -
Adjust the size of the file descriptor
-
Remove unnecessary system users and groups
-
Time synchronization system
-
Kernel parameter optimization
-
File system
is recommended hadoop ext3 formatted disk, ext3 extensive testing (yahoo cluster use ext3), ext4 and xfs has lost data risk.
Disable Linux logical volume management (LVM)
Disable file and directory atime atime when mounting the data partition
-M formatting parameters may be increased to reduce the headspace
-
Close THP
It can be added to the boot entry.
hadoop parameter optimization #
Verify hadoop version 2.6.4, as appropriate, other versions of the reference
core-site.xml
parameter name | Defaults | Explanation |
---|---|---|
hadoop.tmp.dir | /tmp/hadoop-${user.name} | Hadoop best intermediate temporary file directory is specified separately, such as mr split information, stag information |
io.file.buffer.size | 4096 | IO operation set the buffer size, the larger the cache can provide higher data transmission, but it also means more memory consumption and latency. The parameters to be set to a multiple of the system page size, a byte as a unit, the default is 4KB, under normal circumstances, can be set to 64KB (65536byte) |
fs.trash.interval | 0 | Recommend Recycle Bin function is turned on, this parameter defines .Trash directory file retention time before they are permanently deleted |
topology.script.file.name | -- | Cluster nodes relatively long time, it is recommended to configure rack awareness. Script Example: rack_awareness |
hdfs-site.xml
parameter name | Defaults | Explanation |
---|---|---|
dfs.namenode.handler.count | 10 | The number of service threads, turn up a number of general principles is to set it to the cluster size of the natural logarithm is multiplied by 20, that 20logN, N cluster size |
dfs.datanode.handler.count | 10 | The number of service threads, according to the actual tests to determine the number of CPU cores box, usually more than a few audit |
dfs.datanode.max.transfer.threads | 4096 | datanode allowed and the number of simultaneously executed tasks of receiving the transmission, similar to the file handle constraints on linux |
dfs.namenode.name.dir | file://${hadoop.tmp.dir}/dfs/name | A plurality of redundant position, a local, a further NFS |
dfs.datanode.data.dir | file://${hadoop.tmp.dir}/dfs/data | Multiple locations distributed storage, distributed as much as several partitions directory |
dfs.datanode.failed.volumes.tolerated | 0 | DataNode define the entire statement before the number of hard disk failure allowed to fail. When any local disk failure, its default behavior found throughout DataNode failure. The number of hard disk can be dubbed by 30% |
dfs.client.read.shortcircuit | false | Proposal to change the short path to true open reading |
dfs.domain.socket.path | Short path setting read socket path, / var / run / hadoop-hdfs / dn._PORT assurance / var / run / hadoop-hdfs / writable group, group root | |
dfs.blocksize | 134217728 | The new default file block size, a default 128M, cluster size can be adjusted, mapper count the number of basic block decisions by the input file, block small rise to many small tasks |
dfs.hosts | -- | The file format is dfs.hosts line breaks to separate the host name or IP address of the host that are not in the list are not allowed to join the cluster. |
dfs.host.exclude | -- | Similarly dfs.hosts, HDFS can be excluded from the relevant node by specifying the file can be safely uninstalled node |
dfs.datanode.balance.bandwidthPerSec | 1048576 | In the balancer movement between DataNode data block to ensure load balancing. If we do not balance the operating bandwidth limit, then it will soon occupy all network resources, impact Mapreduce operations and other services, too small, too slow balanced. This parameter sets the maximum bandwidth per second, the unit value is byte, network bandwidth are generally used to describe bit. When set, first calculate. |
dfs.datanode.du.reserved | 0 | The sum of the size of the space datanode will report configuration directory, defaults are used dfs store, part of the space can be reserved for other services, it can also reduce some unnecessary alarm monitoring |
mapred-site.xml
parameter name | Defaults | Explanation |
---|---|---|
mapreduce.cluster.local.dir | ${hadoop.tmp.dir}/mapred/local | MR intermediate data storage, preferably multisection directory pass, comma |
mapreduce.shuffle.readahead.bytes | 4194304 | 默认为4M,ShuffleHandler在发送文件时使用posix_fadvise管理操作系统cache,可以增加预取长度提高shuffle效率 |
mapreduce.ifile.readahead.bytes | 4194304 | 默认为4M,ifile预取长度 |
mapreduce.tasktracker.outofband.heartbeat | false | 建议设成ture,在完成任务时让TaskTracker发送一个 out-of-band心跳来减少延迟 |
mapreduce.jobtracker.heartbeat.interval.min | 300 | 增加TaskTracker-to-JobTracker 心跳间隔,对小集群可以增加MR性能,可以改成1000 |
mapred.reduce.slowstart.completed.maps | 0.05 | 此属性设置当map任务完成多少的时候启动reduce任务,许多小任务可以设成0,大任务设成0.5 |
mapreduce.map.speculative | true | map任务推测执行,如果计算资源紧张,任务执行本身很耗资源情况下可以考虑设置成false。需要时通过任务参数制定 。 |
mapreduce.reduce.speculative | true | reduce任务推测执行,建议关闭,需要时通过任务参数制定 |
mapreduce.task.io.sort.mb | 100 | 以MB为单位,默认100M,根据map输出数据量的大小,可以适当的调整buffer的大小,注意是适当的调整,不是越大越好。 |
mapreduce.map.sort.spill.percent | 0.8 | buffer中达到80%时,进行spill |
mapreduce.map.output.compress | false | map输出是否压缩,建议开启减少io和网络消耗 |
mapreduce.map.output.compress.codec | org.apache.hadoop.io.compress.DefaultCodec | 建议使用snappy压缩 org.apache.hadoop.io.compress.SnappyCodec |
mapreduce.output.fileoutputformat.compress.type | RECORD | 输出SequenceFiles是的压缩类型,建议改成BLOCK |
mapreduce.map.java.opts | -- | 可以指定一些JVM参数用于调优 |
mapreduce.jobtracker.handler.count | 10 | jobtracker rpc的线程数,一般推荐为tasktracker节点数的40% |
mapreduce.tasktracker.http.threads | 40 | 获取map输出的工作线程数,可根据集群规模和硬件配置调整 |
mapreduce.tasktracker.map.tasks.maximum | 2 | tasktracker同时运行map任务数,一般可配为CPU核数或1.5倍核数 |
mapreduce.tasktracker.reduce.tasks.maximum | 2 | tasktracker同时运行reduce任务数,一般可配为CPU核数或1.5倍核数 |
mapreduce.reduce.shuffle.input.buffer.percent | 0.7 | reduce用于接受map输出buffer占堆大小的比例,类似于map端的mapreduce.task.io.sort.mb,shuffle最大使用的内存量。如果 map 输出很大而且在 reduce 到排序阶段本地磁盘 I/O 很频繁,应该尝试增加这个值。 |
mapreduce.reduce.shuffle.parallel.copies | 5 | shuffle阶段copy线程数,默认是5,一般可以设置为 4*logN N为集群大小 |
mapreduce.job.jvm.num.tasks | 1 | 默认为1,设置为 -1,重用jvm |
yarn-site.xml
参数名 | 默认值 | 说明 |
---|---|---|
yarn.scheduler.minimum-allocation-mb | 1024 | 一次申请分配内存资源的最小数量 |
yarn.scheduler.maximum-allocation-mb | 8192 | 一次申请分配内存资源的最大数量 |
yarn.nodemanager.resource.memory-mb | 8192 | 默认值为8192M,节点所在物理主机的可用物理内存总量 |
yarn.nodemanager.resource.cpu-vcores | 8 | NodeManager总的可用虚拟CPU个数,根据硬件配置设定,简单可以配置为CPU超线程个数 |
如何调优#
一般系统调优的基本步骤
在集群安装部署时应收集业务对系统的需求及特点(数据量,读写特点,计算量等),同时做好硬件的规划和初始测试(对服务器的IO/net/cpu做基本测试,保证加入集群服务器硬件没有问题)。下面主要从硬件规划和软件调优方面讨论hadoop集群的调优。
设计基准测试用例#
怎么看你的调整有效果?怎么看你是否找到了瓶颈点?要有一个对比的基线,才能比较出你的调整能改善多少性能。Hadoop提供线程测试基线应用。比如用于 HDFS I/O 测试的 TestDFSIO 和 dfsthroughput(包含在 hadoop--test.jar 中)、用于总体硬件测试的 Sort(包含在 hadoop--examples.jar 中)。可以根据自己的测试需求选择任何基准。
在所有这些基准中,当输入数据很大时,Sort 可以同时反映 MapReduce 运行时性能(在 “执行排序” 过程中)和 HDFS I/O 性能(在 “把排序结果写到 HDFS” 过程中)。另外,Sort 是 Apache 推荐的硬件基准。(http://wiki.apache.org/hadoop/Sort)
可以先测试HDFS的写入和读写性能,然后通过Sort基线测试计算和整体性能。
IO测试,会在最后输出相关统计数据
Sort基线测试
基线配置参数可以都使用默认的配置参数,然后一轮一轮修改参数,通过对比结果和服务器监控数据来做对比。
监控数据分析#
在做调优之前集群应该已经有基本的资源监控 比如 CPU、文件IO、网络IO、内存。如果没有则需要安装监控工具或监控脚步完成主机性能参数收集,常用工具(nmon/nmonanalyser,dstat等等)。
性能瓶颈的表象:资源消耗过多、外部处理系统的性能不足、资源消耗不多但程序的响应速度却仍达不到要求。
我们分析性能的主要来源数据就是主机监控数据和MR计算过程中的计数器数据等等,常见的分析点:
-
work节点的CPU资源是否有效利用,一般维持在70%左右,如果不到40%表面没有充分利用CPU
-
观察磁盘IO是否有压力,CPU wa比例是不是很大
-
观察网络流量是否正常
-
MR性能主要看shuffle阶段的计数器的表现,spill IO是否为瓶颈等
-
内存方面结合JVM优化进行调整
结合前面讲的一些优化参数点,反复测试分析结果。解决高层次性能瓶颈后再考虑应用级别的优化,比如文件格式,序列化性能,这方向调优可以使用JVM/HotSpot原生profile能力。