Proficient in HADOOP (5) - First acquaintance with Hadoop - Execute and test Hadoop sample programs

 

1.1 Execute and test the Hadoop sample program

In the Hadoop installation directory you will find JAR files containing Hadoop sample programs that you can use to try out Hadoop. Before you execute these sample programs, you should ensure that your installation is complete and your runtime environment is set up correctly. As we mentioned in the previous section, the check_basic_env.sh script can help you to verify the installation, and if there are any errors in the installation, it will prompt you to correct them.

 

1.1.1 Hadoop sample code

The hadoop-0.19.0-examples.jar file contains several sample programs that can be run directly. We list the sample programs contained in this JAR file in Figures 1-4.

Figure 1‑4 Sample program in hadoop-0.19.0-examples.jar file

program

describe

aggregatewordcount

Aggregate-based MapReduce program that counts the number of characters in an input file.

aggregatewordhist

Aggregation-based MapReduce program that generates a statistical graph of the number of characters in an input file.

grep

A MapReduce program that counts the number of characters in an input file that match a regular expression.

join

Merge sort job on an evenly split dataset.

multifilewc

An assignment for calculating the number of characters in several files.

pentomino

A block-layered MapReduce program that solves the five-grid imposition problem.

pi

MapReduce program for calculating PI using the Monte Carlo method.

randomtextwriter

A MapReduce program that writes 10G random text on one node.

randomwriter

A MapReduce program that writes 10G random data on each node.

sleep

Programs that take a break between each Map and Reduce job.

sort

MapReduce program that sorts data generated by random writers.

sudoku

A solution to the Jiugongge game.

wordcount

A statistic that counts the number of characters in the input file.

 

1.1.1.1 Execute the PI calculator

The PI calculator sample program calculates PI values ​​by the Monte Carlo method. A technical discussion of this algorithm is available at http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html . The sample size is the number of points in a random collection within the square. The larger the number of samples, the more accurate the calculation of the PI value. For the simplicity of the program, we only roughly calculate the value of PI with few operations.

The PI program uses two integer parameters. The number of map jobs and the number of samples in each map job. The total number of samples in the calculation is the number of map jobs multiplied by the number of samples in each map job.

The Map job generates random points within a 1-by-1 rectangular area. For each of these samples, if X2+Y2 <=1, then the point is inside the circle. Otherwise, the point is just outside the circle. The output key value of the Map job is 1 or 0. 1 means inside the circle with a diameter of 1, 0 means outside the circle with a diameter of 1, and the mapped data values ​​are all 1. The Reduce task calculates the number of points inside the circle and the points outside the circle. quantity. The ratio of these two quantities is the limit value PI.

In this sample program, in order to make the program execute faster and have less output, you will choose 2 Map jobs, each with 10 samples, for a total of 20 samples.

If you want to execute this program, you need to change your working directory to HADOOP_HOME (via cd ${HADOOP_HOME}), then, type the following command:

jason@cloud9:~/src/hadoop-0.19.0$ hadoop jar hadoop-0.19.0-examples.jar pi 2 10

The bin/hadoop jar command submits jobs to the cluster. It processes command-line arguments in three steps, each of which processes a portion of the command-line arguments. We'll see the details of handling parameters in Chapter 5. But for now, we just need to know that the hadoop_0.19.0-examples.jar file contains the main class of this application, and this class accepts 3 parameters.

 

1.1.1.2 Viewing output: input segmentation, obfuscation, overflow and sorting

Listing 1-3 is the output of this application.

Listing 1-3 Output of the sample PI program

Number of Maps = 2 Samples per Map = 10

Wrote input for Map #0

Wrote input for Map #1

Starting Job

jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

mapred.FileInputFormat: Total input paths to process : 2

mapred.FileInputFormat: Total input paths to process : 2

mapred.JobClient: Running job: job_local_0001

mapred.FileInputFormat: Total input paths to process : 2

mapred.FileInputFormat: Total input paths to process : 2

mapred.MapTask: numReduceTasks: 1

mapred.MapTask: io.sort.mb = 100

mapred.MapTask: data buffer = 79691776/99614720

mapred.MapTask: record buffer = 262144/327680

mapred.JobClient: map 0% reduce 0%

mapred.MapTask: Starting flush of map output

mapred.MapTask: bufstart = 0; bufend = 32; bufvoid = 99614720

mapred.MapTask: kvstart = 0; kvend = 2; length = 327680

mapred.LocalJobRunner: Generated 1 samples

mapred.MapTask: Index: (0, 38, 38)

mapred.MapTask: Finished spill 0

mapred.LocalJobRunner: Generated 1 samples.

mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.

mapred.TaskRunner: Saved output of task 'attempt_local_0001_m_000000_0' ➥

to file:/home/jason/src/hadoop-0.19.0/test-mini-mr/outmapred.

MapTask: numReduceTasks: 1

mapred.MapTask: io.sort.mb = 100

mapred.JobClient: map 0% reduce 0%

mapred.LocalJobRunner: Generated 1 samples

mapred.MapTask: data buffer = 79691776/99614720

mapred.MapTask: record buffer = 262144/327680

mapred.MapTask: Starting flush of map output

mapred.MapTask: bufstart = 0; bufend = 32; bufvoid = 99614720

mapred.MapTask: kvstart = 0; kvend = 2; length = 327680

mapred.JobClient: map 100% reduce 0%

mapred.MapTask: Index: (0, 38, 38)

mapred.MapTask: Finished spill 0

mapred.LocalJobRunner: Generated 1 samples.

mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.

mapred.TaskRunner: Saved output of task 'attempt_local_0001_m_000001_0' ➥

to file:/home/jason/src/hadoop-0.19.0/test-mini-mr/out

mapred.ReduceTask: Initiating final on-disk merge with 2 files

mapred.Merger: Merging 2 sorted segments

mapred.Merger: Down to the last merge-pass, with 2 segments left of ➥

total size: 76 bytes

mapred.LocalJobRunner: reduce > reduce

mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.

mapred.TaskRunner: Saved output of task 'attempt_local_0001_r_000000_0' ➥

to file:/home/jason/src/hadoop-0.19.0/test-mini-mr/out

mapred.JobClient: Job complete: job_local_0001

mapred.JobClient: Counters: 11

mapred.JobClient: File Systems

mapred.JobClient: Local bytes read=314895

mapred.JobClient: Local bytes written=359635

mapred.JobClient: Map-Reduce Framework

mapred.JobClient: Reduce input groups=2

mapred.JobClient: Combine output records=0

mapred.JobClient: Map input records=2

mapred.JobClient: Reduce output records=0

mapred.JobClient: Map output bytes=64

mapred.JobClient: Map input bytes=48

mapred.JobClient: Combine input records=0

mapred.JobClient: Map output records=4

mapred.JobClient: Reduce input records=4

Job Finished in 2.322 seconds

Estimated value of PI is 3.8

Note that the Hadoop project uses the Apache Foundation's log4j package to process logs. By default, the framework's output log starts with a timestamp, followed by the log level and the class name that produced the message. Besides that, only INFO and higher level log messages are printed by default. For the sake of introduction, I have removed the timestamp and log level from the output.

In the log you are most interested in the last line of the log output that says "The calculated PI value is...". This means that your Hadoop installation is successful and it is able to execute your application correctly.

Below we'll take a step-by-step look at the output of Listing 2-3, which will help you understand how this sample program works, and even find out if the program has errors.

The first log is output when Hadoop initializes the PI calculator program, we can see that your input is 2 map jobs and each map job has 10 samples.

Number of Maps = 2 Samples per Map = 10

Wrote input for Map #0

Wrote input for Map #1

Then, the framework program starts and takes over the control flow, it does input splitting (dividing the input into irrelevant parts is called input splitting).

You can get the job ID from the line below, you can use the job ID to find the job in the job control tool.

Running job: job_local_0001

From the following line, we can know that there are two input files and two input splits.

jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

mapred.FileInputFormat: Total input paths to process : 2

mapred.FileInputFormat: Total input paths to process : 2

mapred.JobClient: Running job: job_local_0001

mapred.FileInputFormat: Total input paths to process : 2

mapred.FileInputFormat: Total input paths to process : 2

The output of the mapping job is divided into different output blocks, and each output block is sorted, a process known as the obfuscation process. The file containing each sorted output block becomes the output block. For each Reduce job, the framework gets the output blocks from the output of the Map job, and then merges and sorts these output blocks to get the sorted blocks. This process becomes the sorting process.

In the next part of Listing 2-3, we can see the details of the obfuscation process execution of the Map job. The framework obtains the output records of all Map jobs, and then feeds these outputs to a single Reduce job (numReduceTasks: 1). If you specify multiple Reduce jobs, you will see a log like Finished spill N for each Reduce job. The rest of the log is for output buffering, we don't care about that.

Below, you will see:

mapred.MapTask: numReduceTasks: 1

...

mapred.MapTask: Finished spill 0

mapred.LocalJobRunner: Generated 1 samples.

mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0'

done.mapred.TaskRunner: Saved output of task

'attempt_local_0001_m_000000_0'

to file:/home/jason/src/hadoop-0.19.0/test-mini-mr/out

Generated 1 samples are the output of the final state of the Map job. The Hadoop framework tells you that the first Map job was done with job 'attempt_local_0001_m_000000_0' and the output has been saved to the default filesystem file:/home/Jason/src/hadoop-0.19.0/test-mini-mr/out .

The following part is the output of the sorting process:

mapred.ReduceTask: Initiating final on-disk merge with 2 files

mapred.Merger: Merging 2 sorted segments

mapred.Merger: Down to the last merge-pass, with 2 segments left of

total size: 76 bytes

According to the command line parameters, there are 2 Map jobs and one Reduce job in the program run by Listing 2-3. Since there is only one Reduce job, the output of all Map jobs will be organized into a sorted sorted block. However, two Map jobs produce two output blocks that go into the sorting phase. Each Reduce job will generate an output file named part-ON in the output directory, where N is the increasing ordinal number of the Reduce job starting from 0. Generally speaking, the numerical part of the name is composed of 5 digits, if there are not enough 5 digits, it is filled with 0s.

The next part of the log output illustrates the execution of the only Reduce task:

mapred.LocalJobRunner: reduce > reduce

mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.

mapred.TaskRunner: Saved output of task 'attempt_local_0001_r_000000_0' to

file:/home/jason/src/hadoop-0.19.0/test-mini-mr/out

We can see that the sample program writes the output of the Reduce job to the file attempt_local_0001_4_000000_0, then renames it to part-00000 and saves it in the output directory of the job.

The output of the log below provides detailed job completion information.

mapred.JobClient: Job complete: job_local_0001

mapred.JobClient: Counters: 11

mapred.JobClient: File Systems

mapred.JobClient: Local bytes read=314895

mapred.JobClient: Local bytes written=359635

mapred.JobClient: Map-Reduce Framework

mapred.JobClient: Reduce input groups=2

mapred.JobClient: Combine output records=0

mapred.JobClient: Map input records=2

mapred.JobClient: Reduce output records=0

mapred.JobClient: Map output bytes=64

mapred.JobClient: Map input bytes=48

mapred.JobClient: Combine input records=0

mapred.JobClient: Map output records=4

mapred.JobClient: Reduce input records=4

The last two lines of log are not printed by the framework, but by the PiEstimator program code.

Job Finished in 2.322 seconds

Estimated value of PI is 3.8

 

1.1.2 Testing Hadoop

The Hadoop framework provides sample programs for testing distributed file systems and MapReduce jobs running on distributed file systems. These test programs are included in the hadoop-0.19.0-test.jar file. Icons 1-5 are a list of these test programs and the functions they provide:

Figure 1‑5 Test program in hadoop-0.19.0-tests.jar file

test

describe

DFSCIOTest

A benchmark for testing distributed I/O in libhdfs. Libhdfs is a shared library that provides HDFS file services for C/C++ applications.

DistributedFSCheck

Distributed checking of filesystem consistency.

TestDFSIO

Distributed I/O benchmark.

clustertestdfs

Pseudo-distributed testing of distributed filesystems.

dfsthroughput

Measure throughput of HDFS.

filebench

Benchmarks of SequenceFileInputFormat and SequenceFileOutputFormat, this covers the BLOCK compressed, RECORD compressed and uncompressed cases. Benchmarks of TextInputFormat and TextOutputFormat, both compressed and uncompressed.

loadgen

Generic MapReduce load generator.

mapredtest

Testing and instrumentation of MapReduce jobs.

mrbench

Create a MapReduce benchmark for a large number of small jobs.

nnbench

Performance benchmarks for NameNode.

testarrayfile

Tests on text files with key-value pairs.

testbigmapoutput

This is a MapReduce job, which is used to process indivisible large files to produce a flag MapReduce job.

testfilesystem

File system read and write tests.

testipc

Interprocess interaction testing of the Hadoop core.

testmapredsort

A program to verify the ordering of the MapReduce framework.

testrpc

对远程过程调用的测试。

testsequencefile

对包含二进制键值对的文本文件的测试。

testsequencefileinputformat

对序列文件输入格式的测试。

testsetfile

对包含二进制键值对文本文件的测试。

testtextinputformat

对文本输入格式的测试。

threadedmapbench

对比输出一个排序块的Map作业和输出多个排序块的Map作业的性能。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325446704&siteId=291194637