【翻译】给新的Hadoop集群选择合适的硬件（一）

One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.

Cloudera的客户们刚接触Apache Hadoop的时候，他们提出的第一问题就是如何为他们新的hadoop集群选择恰当的硬件。

Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)

虽然Haddop被设计跑在工业标准的硬件上，推荐一个理想的集群配置并不像列出一个硬件规格清单那样简单。在给定的工作场景下，选择既满足性能又经济的硬件需要测试。（例如，在IO密集型的工作场景下，需要每个核心投入更多的spindles）

In this blog post, you’ll learn some of the principles of workload evaluation and the critical role it plays in hardware selection. You’ll also learn the various factors that Hadoop administrators should take into account during this process.

在这篇博客里，你将学到工作场景的评估原则和你选择的硬件担任的角色，还有在这个过程中，hadoop管理员需要考虑的各种因素。

存储结合计算

Over the past decade, IT organizations have standardized on blades and SANs (Storage Area Networks) to satisfy their grid and processing-intensive workloads. While this model makes a lot of sense for a number of standard applications such as web servers, app servers, smaller structured databases, and data movement, the requirements for infrastructure have changed as the amount of data and number of users has grown. Web servers now have caching tiers, databases have gone massively parallel with local disk, and data movement jobs are pushing more data than they can handle locally.

在过去的十年里，IT组织制定了刀片机、网络存储设备（SAN）等标准来满足网络和处理密集型工作。这对于大量的标准应用是有意义的，例如web服务器，应用服务器，结构化数据库和数据移动，随着用户量和数据量增长到一定数量，对基础设施的需求就变了。Web服务器现在有缓存层，本地数据库已经不能处理大量的磁盘并行任务和数据移动。

Hardware vendors have created innovative systems to address these requirements including storage blades, SAS (Serial Attached SCSI) switches, external SATA arrays, and larger capacity rack units. However, Hadoop is based on a new approach to storing and processing complex data, with data movement minimized. Instead of relying on a SAN for massive storage and reliability then moving it to a collection of blades for processing, Hadoop handles large data volumes and reliability in the software tier.

硬件供应商们创建了很多革命性的系统来满足这些需求，例如刀片存储，SAS交换机，扩展SATA阵列和大容量机架单元。然后，Hadoop基于一种新的方式来存储和处理复杂的数据，通过最小化的数据移动。Hadoop在软件层处理大数据量和可靠性，而不依赖SAN设备的海量存储和可靠性，然后移动到一系列刀片机上来计算。

Hadoop distributes data across a cluster of balanced machines and uses replication to ensure data reliability and fault tolerance. Because data is distributed on machines with compute power, processing can be sent directly to the machines storing the data. Since each machine in a Hadoop cluster stores as well as processes data, those machines need to be configured to satisfy both data storage and processing requirements.

Hadoop将数据分布到配置均衡的集群上主机上，通过副本的方式来保证可靠性和容错性。由于数据分布在有计算能力的机器上，程序可发送到存储数据的那台机器上执行。由于集群的每一天机器即用来存储又用来计算，因此机器配置需要满足这两方面的需求。

为什么工作场景很重要

In nearly all cases, a MapReduce job will either encounter a bottleneck reading data from disk or from the network (known as an IO-bound job) or in processing data (CPU-bound). An example of an IO-bound job is sorting, which requires very little processing (simple comparisons) and a lot of reading and writing to disk. An example of a CPU-bound job is classification, where some input data is processed in very complex ways to determine ontology.

在几乎所有的案例中，一个MapReduce任务的瓶颈可能来自磁盘读写、网络（读写瓶颈）或CPU（瓶颈）一个典型的受读写限制的任务是排序，仅需要很少的CPU进程，却需要大量的磁盘读写。典型的受CPU限制的任务是分类，需要复杂的方法来处理输入数据来得出结论。

Here are several more examples of IO-bound workloads:

Indexing
Grouping
Data importing and exporting
Data movement and transformation

几个受IO限制的工作场景：

排序／分组／数据导入导出／数据迁移和转换

Here are several more examples of CPU-bound workloads:

Clustering/Classification
Complex text mining
Natural-language processing
Feature extraction

几个受CPU限制的工作场景：

聚合／分类／复杂文本采集／自然语言处理／特征抽取

未完成，请继续阅读：https://my.oschina.net/u/234661/blog/855913