1.1 Execute and test the Hadoop sample program
In the Hadoop installation directory you will find JAR files containing Hadoop sample programs that you can use to try out Hadoop. Before you execute these sample programs, you should ensure that your installation is complete and your runtime environment is set up correctly. As we mentioned in the previous section, the check_basic_env.sh script can help you to verify the installation, and if there are any errors in the installation, it will prompt you to correct them.
1.1.1 Hadoop sample code
The hadoop-0.19.0-examples.jar file contains several sample programs that can be run directly. We list the sample programs contained in this JAR file in Figures 1-4.
Figure 1‑4 Sample program in hadoop-0.19.0-examples.jar file
program |
describe |
aggregatewordcount |
Aggregate-based MapReduce program that counts the number of characters in an input file. |
aggregatewordhist |
Aggregation-based MapReduce program that generates a statistical graph of the number of characters in an input file. |
grep |
A MapReduce program that counts the number of characters in an input file that match a regular expression. |
join |
Merge sort job on an evenly split dataset. |
multifilewc |
An assignment for calculating the number of characters in several files. |
pentomino |
A block-layered MapReduce program that solves the five-grid imposition problem. |
pi |
MapReduce program for calculating PI using the Monte Carlo method. |
randomtextwriter |
A MapReduce program that writes 10G random text on one node. |
randomwriter |
A MapReduce program that writes 10G random data on each node. |
sleep |
Programs that take a break between each Map and Reduce job. |
sort |
MapReduce program that sorts data generated by random writers. |
sudoku |
A solution to the Jiugongge game. |
wordcount |
A statistic that counts the number of characters in the input file. |
1.1.1.1 Execute the PI calculator
The PI calculator sample program calculates PI values by the Monte Carlo method. A technical discussion of this algorithm is available at http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html . The sample size is the number of points in a random collection within the square. The larger the number of samples, the more accurate the calculation of the PI value. For the simplicity of the program, we only roughly calculate the value of PI with few operations.
The PI program uses two integer parameters. The number of map jobs and the number of samples in each map job. The total number of samples in the calculation is the number of map jobs multiplied by the number of samples in each map job.
The Map job generates random points within a 1-by-1 rectangular area. For each of these samples, if X2+Y2 <=1, then the point is inside the circle. Otherwise, the point is just outside the circle. The output key value of the Map job is 1 or 0. 1 means inside the circle with a diameter of 1, 0 means outside the circle with a diameter of 1, and the mapped data values are all 1. The Reduce task calculates the number of points inside the circle and the points outside the circle. quantity. The ratio of these two quantities is the limit value PI.
In this sample program, in order to make the program execute faster and have less output, you will choose 2 Map jobs, each with 10 samples, for a total of 20 samples.
If you want to execute this program, you need to change your working directory to HADOOP_HOME (via cd ${HADOOP_HOME}), then, type the following command:
jason@cloud9:~/src/hadoop-0.19.0$ hadoop jar hadoop-0.19.0-examples.jar pi 2 10
The bin/hadoop jar command submits jobs to the cluster. It processes command-line arguments in three steps, each of which processes a portion of the command-line arguments. We'll see the details of handling parameters in Chapter 5. But for now, we just need to know that the hadoop_0.19.0-examples.jar file contains the main class of this application, and this class accepts 3 parameters.
1.1.1.2 Viewing output: input segmentation, obfuscation, overflow and sorting
Listing 1-3 is the output of this application.
Listing 1-3 Output of the sample PI program
Number of Maps = 2 Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Starting Job
jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
mapred.FileInputFormat: Total input paths to process : 2
mapred.FileInputFormat: Total input paths to process : 2
mapred.JobClient: Running job: job_local_0001
mapred.FileInputFormat: Total input paths to process : 2
mapred.FileInputFormat: Total input paths to process : 2
mapred.MapTask: numReduceTasks: 1
mapred.MapTask: io.sort.mb = 100
mapred.MapTask: data buffer = 79691776/99614720
mapred.MapTask: record buffer = 262144/327680
mapred.JobClient: map 0% reduce 0%
mapred.MapTask: Starting flush of map output
mapred.MapTask: bufstart = 0; bufend = 32; bufvoid = 99614720
mapred.MapTask: kvstart = 0; kvend = 2; length = 327680
mapred.LocalJobRunner: Generated 1 samples
mapred.MapTask: Index: (0, 38, 38)
mapred.MapTask: Finished spill 0
mapred.LocalJobRunner: Generated 1 samples.
mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
mapred.TaskRunner: Saved output of task 'attempt_local_0001_m_000000_0' ➥
to file:/home/jason/src/hadoop-0.19.0/test-mini-mr/outmapred.
MapTask: numReduceTasks: 1
mapred.MapTask: io.sort.mb = 100
mapred.JobClient: map 0% reduce 0%
mapred.LocalJobRunner: Generated 1 samples
mapred.MapTask: data buffer = 79691776/99614720
mapred.MapTask: record buffer = 262144/327680
mapred.MapTask: Starting flush of map output
mapred.MapTask: bufstart = 0; bufend = 32; bufvoid = 99614720
mapred.MapTask: kvstart = 0; kvend = 2; length = 327680
mapred.JobClient: map 100% reduce 0%
mapred.MapTask: Index: (0, 38, 38)
mapred.MapTask: Finished spill 0
mapred.LocalJobRunner: Generated 1 samples.
mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
mapred.TaskRunner: Saved output of task 'attempt_local_0001_m_000001_0' ➥
to file:/home/jason/src/hadoop-0.19.0/test-mini-mr/out
mapred.ReduceTask: Initiating final on-disk merge with 2 files
mapred.Merger: Merging 2 sorted segments
mapred.Merger: Down to the last merge-pass, with 2 segments left of ➥
total size: 76 bytes
mapred.LocalJobRunner: reduce > reduce
mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
mapred.TaskRunner: Saved output of task 'attempt_local_0001_r_000000_0' ➥
to file:/home/jason/src/hadoop-0.19.0/test-mini-mr/out
mapred.JobClient: Job complete: job_local_0001
mapred.JobClient: Counters: 11
mapred.JobClient: File Systems
mapred.JobClient: Local bytes read=314895
mapred.JobClient: Local bytes written=359635
mapred.JobClient: Map-Reduce Framework
mapred.JobClient: Reduce input groups=2
mapred.JobClient: Combine output records=0
mapred.JobClient: Map input records=2
mapred.JobClient: Reduce output records=0
mapred.JobClient: Map output bytes=64
mapred.JobClient: Map input bytes=48
mapred.JobClient: Combine input records=0
mapred.JobClient: Map output records=4
mapred.JobClient: Reduce input records=4
Job Finished in 2.322 seconds
Estimated value of PI is 3.8
Note that the Hadoop project uses the Apache Foundation's log4j package to process logs. By default, the framework's output log starts with a timestamp, followed by the log level and the class name that produced the message. Besides that, only INFO and higher level log messages are printed by default. For the sake of introduction, I have removed the timestamp and log level from the output.
In the log you are most interested in the last line of the log output that says "The calculated PI value is...". This means that your Hadoop installation is successful and it is able to execute your application correctly.
Below we'll take a step-by-step look at the output of Listing 2-3, which will help you understand how this sample program works, and even find out if the program has errors.
The first log is output when Hadoop initializes the PI calculator program, we can see that your input is 2 map jobs and each map job has 10 samples.
Number of Maps = 2 Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Then, the framework program starts and takes over the control flow, it does input splitting (dividing the input into irrelevant parts is called input splitting).
You can get the job ID from the line below, you can use the job ID to find the job in the job control tool.
Running job: job_local_0001
From the following line, we can know that there are two input files and two input splits.
jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
mapred.FileInputFormat: Total input paths to process : 2
mapred.FileInputFormat: Total input paths to process : 2
mapred.JobClient: Running job: job_local_0001
mapred.FileInputFormat: Total input paths to process : 2
mapred.FileInputFormat: Total input paths to process : 2
The output of the mapping job is divided into different output blocks, and each output block is sorted, a process known as the obfuscation process. The file containing each sorted output block becomes the output block. For each Reduce job, the framework gets the output blocks from the output of the Map job, and then merges and sorts these output blocks to get the sorted blocks. This process becomes the sorting process.
In the next part of Listing 2-3, we can see the details of the obfuscation process execution of the Map job. The framework obtains the output records of all Map jobs, and then feeds these outputs to a single Reduce job (numReduceTasks: 1). If you specify multiple Reduce jobs, you will see a log like Finished spill N for each Reduce job. The rest of the log is for output buffering, we don't care about that.
Below, you will see:
mapred.MapTask: numReduceTasks: 1
...
mapred.MapTask: Finished spill 0
mapred.LocalJobRunner: Generated 1 samples.
mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' ➥
done.mapred.TaskRunner: Saved output of task ➥
'attempt_local_0001_m_000000_0' ➥
to file:/home/jason/src/hadoop-0.19.0/test-mini-mr/out
Generated 1 samples are the output of the final state of the Map job. The Hadoop framework tells you that the first Map job was done with job 'attempt_local_0001_m_000000_0' and the output has been saved to the default filesystem file:/home/Jason/src/hadoop-0.19.0/test-mini-mr/out .
The following part is the output of the sorting process:
mapred.ReduceTask: Initiating final on-disk merge with 2 files
mapred.Merger: Merging 2 sorted segments
mapred.Merger: Down to the last merge-pass, with 2 segments left of ➥
total size: 76 bytes
According to the command line parameters, there are 2 Map jobs and one Reduce job in the program run by Listing 2-3. Since there is only one Reduce job, the output of all Map jobs will be organized into a sorted sorted block. However, two Map jobs produce two output blocks that go into the sorting phase. Each Reduce job will generate an output file named part-ON in the output directory, where N is the increasing ordinal number of the Reduce job starting from 0. Generally speaking, the numerical part of the name is composed of 5 digits, if there are not enough 5 digits, it is filled with 0s.
The next part of the log output illustrates the execution of the only Reduce task:
mapred.LocalJobRunner: reduce > reduce
mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
mapred.TaskRunner: Saved output of task 'attempt_local_0001_r_000000_0' to ➥
file:/home/jason/src/hadoop-0.19.0/test-mini-mr/out
We can see that the sample program writes the output of the Reduce job to the file attempt_local_0001_4_000000_0, then renames it to part-00000 and saves it in the output directory of the job.
The output of the log below provides detailed job completion information.
mapred.JobClient: Job complete: job_local_0001
mapred.JobClient: Counters: 11
mapred.JobClient: File Systems
mapred.JobClient: Local bytes read=314895
mapred.JobClient: Local bytes written=359635
mapred.JobClient: Map-Reduce Framework
mapred.JobClient: Reduce input groups=2
mapred.JobClient: Combine output records=0
mapred.JobClient: Map input records=2
mapred.JobClient: Reduce output records=0
mapred.JobClient: Map output bytes=64
mapred.JobClient: Map input bytes=48
mapred.JobClient: Combine input records=0
mapred.JobClient: Map output records=4
mapred.JobClient: Reduce input records=4
The last two lines of log are not printed by the framework, but by the PiEstimator program code.
Job Finished in 2.322 seconds
Estimated value of PI is 3.8
1.1.2 Testing Hadoop
The Hadoop framework provides sample programs for testing distributed file systems and MapReduce jobs running on distributed file systems. These test programs are included in the hadoop-0.19.0-test.jar file. Icons 1-5 are a list of these test programs and the functions they provide:
Figure 1‑5 Test program in hadoop-0.19.0-tests.jar file
test |
describe |
DFSCIOTest |
A benchmark for testing distributed I/O in libhdfs. Libhdfs is a shared library that provides HDFS file services for C/C++ applications. |
DistributedFSCheck |
Distributed checking of filesystem consistency. |
TestDFSIO |
Distributed I/O benchmark. |
clustertestdfs |
Pseudo-distributed testing of distributed filesystems. |
dfsthroughput |
Measure throughput of HDFS. |
filebench |
Benchmarks of SequenceFileInputFormat and SequenceFileOutputFormat, this covers the BLOCK compressed, RECORD compressed and uncompressed cases. Benchmarks of TextInputFormat and TextOutputFormat, both compressed and uncompressed. |
loadgen |
Generic MapReduce load generator. |
mapredtest |
Testing and instrumentation of MapReduce jobs. |
mrbench |
Create a MapReduce benchmark for a large number of small jobs. |
nnbench |
Performance benchmarks for NameNode. |
testarrayfile |
Tests on text files with key-value pairs. |
testbigmapoutput |
This is a MapReduce job, which is used to process indivisible large files to produce a flag MapReduce job. |
testfilesystem |
File system read and write tests. |
testipc |
Interprocess interaction testing of the Hadoop core. |
testmapredsort |
A program to verify the ordering of the MapReduce framework. |
testrpc |
对远程过程调用的测试。 |
testsequencefile |
对包含二进制键值对的文本文件的测试。 |
testsequencefileinputformat |
对序列文件输入格式的测试。 |
testsetfile |
对包含二进制键值对文本文件的测试。 |
testtextinputformat |
对文本输入格式的测试。 |
threadedmapbench |
对比输出一个排序块的Map作业和输出多个排序块的Map作业的性能。 |