翻译：How to Benchmark a Hadoop Cluster

How to Benchmark a Hadoop Cluster
如何检测你的Hadoop集群性能？

比较早的一篇文章了，有些可能翻译的不是很顺，比如Benchmark这个词就比较晕，姑且理解为性能吧，原文地址：
http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/

引用

Is the cluster set up correctly? The best way to answer this question is empirically: run some jobs and confirm that you get the expected results. Benchmarks make good tests, as you also get numbers that you can compare with other clusters as a sanity check on whether your new cluster is performing roughly as expected. And you can tune a cluster using benchmark results to squeeze the best performance out of it. This is often done with monitoring systems in place, so you can see how resources are being used across the cluster.

你是否正确假设了集群？这个问题的最佳答案是在经验中得出：在集群中跑一些Job看是否得到了预期的结果(好像是废话)。性能和良好的测试是可以互相促进的，因为你可以用自己搭建集群的性能数据与其他的集群数据作出对比，用于查看你的集群性能是否达到了预期。同时，你可以通过基准(性能)结果对集群作出调整，然后压榨出更好的性能。这些工作需要运维系统的监控数据作为支持，你可以直观的看到集群的硬件资源使用情况。

引用

To get the best results, you should run benchmarks on a cluster that is not being used by others. In practice, this is just before it is put into service, and users start relying on it. Once users have periodically scheduled jobs on a cluster it is generally impossible to find a time when the cluster is not being used (unless you arrange downtime with users), so you should run benchmarks to your satisfaction before this happens.

为了获得最佳结果，你应当在一个没其他人使用的集群上进行基准(性能)测试。实际上，这项工作往往是在集群投入生产环境、用户开始使用之前进行的。一旦用户开始周期性的在集群中运行Job，你就很难再找到集群没被使用的时刻(除非你安排了集群维护时间)。所以你应当在这种情况发生前安排基准(性能)测试。

引用

Experience has shown that most hardware failures for new systems are hard drive failures. By running I/O intensive benchmarks—such as the ones described next—you can “burn in” the cluster before it goes live.

以往的经验显示，大多数新系统的硬件问题源于硬盘损坏。通过跑一个I/O压力测试(正是我们下面要讲到的)，你可以在集群使用前，将他的"小宇宙"燃烧到最佳状态。

Hadoop Benchmarks
Hadoop性能测试

引用

Hadoop comes with several benchmarks that you can run very easily with minimal setup cost. Benchmarks are packaged in the test JAR file, and you can get a list of them, with descriptions, by invoking the JAR file with no arguments:

Hadoop自带了一些性能检测工具，你只需要简单的配置即可。这些工具在测试Jar包里，你可以通过以下命令来查看这些工具的描述列表：

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar

引用

Most of the benchmarks show usage instructions when invoked with no arguments. For example:

多数的工具在你不输入任何参数的时候，会显示有用的提示，比如：

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIOTestFDSIO.0.0.4

Usage: TestFDSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile 

resultFileName] [-bufferSize Bytes]

Benchmarking HDFS with TestDFSIO
使用TestDFSIO测试HDFS性能

引用

TestDFSIO tests the I/O performance of HDFS. It does this by using a MapReduce job as a convenient way to read or write files in parallel. Each file is read or written in a separate map task, and the output of the map is used for collecting statistics relating to the file just processed. The statistics are accumulated in the reduce, to produce a summary.

TestDFSIO测试HDFS的吞吐性能。他使用MapReduce Job方便的进行并行读/写文件。每个文件的读/写操作都是在单独的task中完成的。Map结果的输出可以用于收集刚才运行过程中的一些统计结果。这些统计结果是在reduce过程中收集的，并且生成一份摘要。

引用

The following command writes 10 files of 1,000 MB each:

下面的命令用于写10个文件，每个文件都有1000 MB：

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000

引用

At the end of the run, the results are written to the console and also recorded in a local file (which is appended to, so you can rerun the benchmark and not lose old results):

在运行结束后，结果会被打印在console上，同时也会生成一份在本地文件中(以Appended方式，所以你可以循环的跑这个压力测试而不用担心之前的结果被冲掉)：

% cat TestDFSIO_results.log

----- TestDFSIO ----- : write

           Date & time: Sun Apr 12 07:14:09 EDT 2009

       Number of files: 10

Total MBytes processed: 10000

     Throughput mb/sec: 7.796340865378244

Average IO rate mb/sec: 7.8862199783325195

 IO rate std deviation: 0.9101254683525547

    Test exec time sec: 163.387

引用

The files are written under the /benchmarks/TestDFSIO directory by default (this can be changed by setting the test.build.data system property), in a directory called io_data.

这些文件默认保存在/benchmarks/TestDFSIO路径下(可以通过test.build.data系统属性来设置)，在io_data目录下。

引用

To run a read benchmark, use the -read argument. Note that these files must already exist (having been written by TestDFSIO -write):

如果要运行一个read性能测试，添加-read参数。注意这些文件必须是已经存在的(已经通过带-write参数的TestDFSIO写过) 这里没很懂

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

引用

Here are the results for a real run:

以下为运行结果：

----- TestDFSIO ----- : read

           Date & time: Sun Apr 12 07:24:28 EDT 2009

       Number of files: 10

Total MBytes processed: 10000

     Throughput mb/sec: 80.25553361904304

Average IO rate mb/sec: 98.6801528930664

 IO rate std deviation: 36.63507598174921

    Test exec time sec: 47.624

引用

When you’ve finished benchmarking, you can delete all the generated files from HDFS using the -clean argument:

当你已经运行完性能测试，你可以通过添加-clean参数清除HDFS中生成的文件：

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean

引用

Benchmarking MapReduce with Sort

分类/排序MapReduce性能测试

引用

Hadoop comes with a MapReduce program that does a partial sort of its input. It is very useful for benchmarking the whole MapReduce system, as the full input dataset is transferred through the shuffle. The three steps are: generate some random data, perform the sort, then validate the results.

Hadoop自带了可以对输入进行部分排序的MapReduce工具。他对整个MapReduce系统的性能测试是非常有用的，由于整个Input数据在Shuffle后进行传输。这三个步骤是：生成随机数、排序、验证结果。

引用

First we generate some random data using RandomWriter. It runs a MapReduce job with 10 maps per node, and each map generates (approximately) 10 GB of random binary data, with key and values of various sizes. You can change these values if you like by setting the properties test.randomwriter.maps_per_host and test.randomwrite.bytes_per_map. There are also settings for the size ranges of the keys and values; see RandomWriter for details.

首先，我们通过RandomWriter生成随机数。他会运行一个MapReduce Job，每个节点会有10个Map，并且每个map生成约10G的随机二进制数据，key和value的大小都是不同的。你可以按需修改这些测试属性值如:test.randomwriter.maps_per_host和 test.randomwrite.bytes_per_map。同样，对于key和Value的大小，也有地方可以设置，可以通过RandomWriter查看其中细节。

引用

Here’s how to invoke RandomWriter (found in the example JAR file, not the test one) to write its output to a directory called random-data:

下面展示了如何去调用RandomWriter(是在example的jar包，而不是test包)把结果输出在名为random-data的目录中：

% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data

接下来我们可以运行排序程序：

% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data

引用

The overall execution time of the sort is the metric we are interested in, but it’s instructive to watch the job’s progress via the web UI (http://jobtracker-host:50030/), where you can get a feel for how long each phase of the job takes.

整个排序的耗时是我们的关注点，在通过web UI对Job运行的监控是有益的，你可以直观的感受到job在每个运行过程中耗费的时间。

引用

As a final sanity check, we validate the data in sorted-data is, in fact, correctly sorted:

最后一定要进行检查，我们验证一下sorted-data中的数据是一个正确的排序：

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar testmapredsort -sortInput random-data \ -sortOutput sorted-data

引用

This command runs the SortValidator program, which performs a series of checks on the unsorted and sorted data to check whether the sort is accurate. It reports the outcome to the console at the end of its run:

这条命令运行了SortValidator程序，他对未排序和已排序数据进行了一系列的操作用于检查结果是否正确。最终报告会在控制台中输出：

SUCCESS! Validated the MapReduce framework's 'sort' successfully.

引用

Other benchmarks

其他测试工具

引用

There are many more Hadoop benchmarks, but the following are widely used:

MRBench (invoked with mrbench) runs a small job a number of times. It acts as a good counterpoint to sort, as it checks whether small job runs are responsive.

NNBench (invoked with nnbench) is useful for load testing namenode hardware.

Gridmix is a suite of benchmarks designed to model a realistic cluster workload, by mimicking a variety of data-access patterns seen in practice. See src/benchmarks/gridmix2 in the distribution for further details

还有更多的Hadoop性能检查工具，以下为使用的比较多的：

MRBench 会跑几次小的job，是排序的补充，用于查看规模较小的job耗时是否正常。

NNBench 硬件压力测试工具

Gridmix 一个用于模拟实际集群压力的套件，可以模拟实际的data-access。在src/benchmarks/gridmix2中可以看到更多细节。

User Jobs
用户Job

引用

For tuning, it is best to include a few jobs that are representative of the jobs that your users run, so your cluster is tuned for these and not just for the standard benchmarks. If this is your first Hadoop cluster and you don’t have any user jobs yet, then Gridmix is a good substitute.

在集群调试时，最好可以包括用户经常要跑的一些典型的job用例，这样你的集群可以更有针对性。如果这是你的第一个Hadoop集群，你还没有任何用户，那么Gridmix会更适合你。

引用

When running your own jobs as benchmarks you should select a dataset for your user jobs that you use each time you run the benchmarks to allow comparisons between runs. When you set up a new cluster, or upgrade a cluster, you will be able to use the same dataset to compare the performance with previous runs.

当运行你自己的job来测试时，最好选择你每次都会使用的一些数据，这样可以更好的在结果之间进行比对。当你搭建了一个新的集群，或者更新了集群，你可以使用同样的数据集来进行测试，可以和之前的测试结果进行比对。

引用

In a similar vein, PigMix is a set of benchmarks for Pig available from http://wiki.apache.org/pig/PigMix.

与之相似，PigMix是一个Pig的性能测试套件，主页地址： http://wiki.apache.org/pig/PigMix

翻译：How to Benchmark a Hadoop Cluster

猜你喜欢