Experiment 6: Mapreduce instance --WordCount

Purpose

1. accurate understanding of the design principles Mapreduce

2. The master code written WordCount

3. Learn to write your own programs WordCount word frequency statistics

Principle

MapReduce uses a "divide and conquer" mentality, the operation of large data sets, distributed to the nodes under management have completed a master node, and then through the integration of intermediate results of each node, the final result. In simple terms, MapReduce is the "summary of decomposition and the results of the task."

1.MapReduce works

In distributed computing, MapReduce framework to deal with parallel programming in distributed storage, job scheduling, load balancing complex issues, fault tolerance and network communications, and now we deal with highly abstract process into two parts Map and Reduce to elaborate wherein Map section responsible task into a plurality of sub-tasks, responsible for the portion of the Reduce decomposition processing results of a plurality of subtasks Taken together, the specific design ideas are as follows.

(1) Map Inherited org.apache.hadoop.mapreduce process requires package Mapper classes, methods and override its map. Map output process by adding the two values ​​to the key value and the code value of the console, you can be found in the map values ​​stored value method is entered in the text file line (lines terminated by a carriage return mark), and key value stored input line is the first letter of the first offset address relative to the text file. StringTokenizer class and then each split into a row of a field, the field requires the cut out (in this experiment is a buyer id field) set key, and outputs the result as a map method.

(2) Reduce process requires inheritance class org.apache.hadoop.mapreduce Reducer package and reduce its override method. Map process output <key, value> pair of the first key process shuffle through all values ​​of the same value of key values ​​grouped together in value, this case is a list of values ​​corresponding to the count value of the key field composed, then <key, values > input method to reduce, reduce long as the method to traverse values ​​and summed to obtain the total number of times a word.

() The main function in the main new Job object, a computing tasks by the Job object is responsible for managing and operating the MapReduce, and set the parameters related to the task by the methods of the Job. This experiment was set using the doMapper Mapper classes will inherit Map completion during handling and use doReducer Reduce class finish processing process. Further provided output type Map and Reduce process procedure: key of type Text, value of type IntWritable. Input path and output tasks specified by the character string, and by FileInputFormat FileOutputFormat are set. After the task is completed parameter settings, you can call job.waitForCompletion () method to perform the task, the rest of the work is handed over to the MapReduce framework process.

2.MapReduce job to run the process framework

 

 

 

(1) ResourceManager: YARN resources is a central module control framework, responsible for unified management and allocation of all resources in the cluster. It receives reports from NM (NodeManager), the establishment of AM, and will send to the resource AM (ApplicationMaster).

(2) NodeManager: short NM, NodeManager is ResourceManager agent on each machine, responsible for container management, and monitor their use of resources (cpu, memory, disk and network, etc.), and use these resources to report to the ResourceManager.

(3) ApplicationMaster: hereinafter referred to as AM. YARN Each application will start a AM, RM is responsible to apply for funding, NM request to start Container, Container and told to do something.

(4) Container: container resources. YARN All applications are running on Container. AM also on Container running, but AM is the Container RM application. Container is an abstraction of resources YARN, it encapsulates a node on a certain amount of resources (CPU and memory resource categories). Container ApplicationMaster to the ResourceManager application, assigned by resource scheduler in the asynchronous ResouceManager to ApplicationMaster. Container run by ApplicationMaster where the resources NodeManager initiated by the need to provide internal task command executed Container runtime (can be any command, such as java, Python, C ++ commands can start the process) as well as the command to perform the required environmental variables and external resources (such as dictionary files, executable files, jar packages, etc.).

Further, a desired application Container divided into two categories, as follows:

① Container run ApplicationMaster: this is a ResourceManager (internal to the resource scheduler) and start the application, when the user submits the application, you can specify the resources needed only ApplicationMaster.

② run all kinds of tasks Container: This is due to the ResourceManager ApplicationMaster application, and to communicate with NodeManager ApplicationMaster to start.

Container categories above may be on any node, and their general position is random, that it may be related to ApplicationMaster management tasks running on a node.

lab environment

Ubuntu Linux 14.0

jdk-7u75-linux-x64

hadoop-2.6.0-cdh5.4.5

hadoop-2.6.0-eclipse-cdh5.4.5.jar

eclipse-java-juno-SR2-linux-gtk-x86_64

Content Experiments

Existing electricity supplier website a collection of user data for commodities, records the user id and collection of goods collection date, called buyer_favorite1.

buyer_favorite1 comprising: a buyer id, product id, the date of collection of these three fields, data "\ t" split, and sample data format is as follows:

 

  1. Id id buyers of goods collection date  
  2. 10181   1000481   2010-04-04 16:54:31  
  3. 20001   1001597   2010-04-07 15:07:52  
  4. 20001   1001560   2010-04-07 15:08:27  
  5. 20042   1001368   2010-04-08 08:20:30  
  6. 20067   1002061   2010-04-08 16:45:33  
  7. 20056   1003289   2010-04-12 10:50:55  
  8. 20056   1003290   2010-04-12 11:57:35  
  9. 20056   1003292   2010-04-12 12:05:29  
  10. 20054   1002420   2010-04-14 15:24:12  
  11. 20055   1001679   2010-04-14 19:46:04  
  12. 20054   1010675   2010-04-14 15:23:53  
  13. 20054   1002429   2010-04-14 17:52:45  
  14. 20076   1002427   2010-04-14 19:35:39  
  15. 20054   1003326   2010-04-20 12:54:44  
  16. 20056   1002420   2010-04-15 11:24:49  
  17. 20064   1002422   2010-04-15 11:35:54  
  18. 20056   1003066   2010-04-15 11:43:01  
  19. 20056   1003055   2010-04-15 11:43:06  
  20. 20056   1010183   2010-04-15 11:45:24  
  21. 20056   1002422   2010-04-15 11:45:49  
  22. 20056   1003100   2010-04-15 11:45:54  
  23. 20056   1003094   2010-04-15 11:45:57  
  24. 20056   1003064   2010-04-15 11:46:04  
  25. 20056   1010178   2010-04-15 16:15:20  
  26. 20076   1003101   2010-04-15 16:37:27  
  27. 20076   1003103   2010-04-15 16:37:05  
  28. 20076   1003100   2010-04-15 16:37:18  
  29. 20076   1003066   2010-04-15 16:37:31  
  30. 20054   1003103   2010-04-15 16:40:14  
  31. 20054   1003100   2010-04-15 16:40:16  

Requirements write MapReduce programs, the number of statistical collection of goods for each buyer.

Statistics data are as follows:

  1. Id quantity buyers  
  2. 10181   1  
  3. 20001   2  
  4. 20042   1  
  5. 20054   6  
  6. 20055   1  
  7. 20056   12  
  8. 20064   1  
  9. 20067   1  
  10. 20076   5  

Experimental Procedure

1. Change directory to / apps / hadoop / sbin, the start hadoop.

  1. cd /apps/hadoop/sbin  
  2. ./start-dfs.sh  

2. On linux, create a directory / data / mapreduce1.

  1. mkdir -p /data/mapreduce1  

3. Switch to next / data / mapreduce1 directory, the text file buyer_favorite1 establish itself.

Still in the / data / mapreduce1 directory, use wget command, from

Network download hadoop2lib.tar.gz, project dependencies for download.

The hadoop2lib.tar.gz extract to the current directory.

  1. tar -xzvf hadoop2lib.tar.gz  

4. The local linux / data / mapreduce1 / buyer_favorite1, uploaded to / mymapreduce1 on HDFS / in the directory. If HDFS directory does not exist, create in advance.

  1. hadoop fs -mkdir -p /mymapreduce1/in  
  2. hadoop fs -put /data/mapreduce1/buyer_favorite1 /mymapreduce1/in  

 

 

 

5. Open Eclipse, the new Java Project project and the project name to mapreduce1.

6. Under the project name mapreduce1, and the new package package package named mapreduce

7. In the package mapreduce created, the new class and the class name WordCount.

 

 

 

8. Add the project required dependent jar package, right-click the project name, create a new directory hadoop2lib, required for storing items jar package.

 

 

 

 

The upper linux / data under / mapreduce1 directory, directory hadoop2lib a jar, copied to all the eclipse, the hadoop2lib mapreduce1 project directory.

 

Select all the jar packages under hadoop2lib directory, right click and select Build Path => Add to Build Path

 

 

 

9. write Java code, and describe their design ideas.

The following diagram depicts the execution of mapreduce

 

The general idea is the hdfs text as input, to slice through the MapReduce will InputFormat text, and the first letter of each row offset with respect to the first address of the text file as an input key of the key-value pairs, as text enter key-value pairs, in the processed map function, outputs an intermediate result <word, 1> form, and complete frequency statistics for each word in the reduce function. The entire program code consists of two parts: Mapper portion and a portion Reducer.

Mapper Code

  1. public static class doMapper extends Mapper<Object, Text, Text, IntWritable>{  
  2. // Object represents a first type of key input; a second type of input represents the value of Text; Text represents the third output indicates the type of bond; represents a fourth output value type IntWritable  
  3. public static final IntWritable one = new IntWritable(1);  
  4.         public static Text word = new Text();  
  5.         @Override  
  6.         protected void map(Object key, Text value, Context context)  
  7.                     throws IOException, InterruptedException  
  8.                        //Throw an exception  
  9. {  
  10.             StringTokenizer tokenizer = new StringTokenizer(value.toString(),"\t");  
  11.           // StringTokenizer is a Java-based toolkit for the string split  
  12.   
  13.                 word.set(tokenizer.nextToken());  
  14.                  // returns the current position to the next string delimited between  
  15.                 context.write(word, one);  
  16.                  // save the word into the container, referred to a number of  
  17.         }  

There are three parameters in the map function, the front two Object key, Text value is input and the key value, the third parameter Context context value is entered and the record key. E.g. context.write (word, one); in addition also recorded map context state operations. phase map using the default mode of Hadoop job input, the input value with a StringTokenizer () method, taken out of the buyer id field is set to key, value set to 1, then output directly <key, value>.

Reducer Code

  1. public static class doReducer extends Reducer<Text, IntWritable, Text, IntWritable>{  
  2. // Map with the same parameters, in turn indicates a type of input keys, the type of an input value, the output of the key type, the output value of the type  
  3. private IntWritable result = new IntWritable();  
  4.         @Override  
  5.         protected void reduce(Text key, Iterable<IntWritable> values, Context context)  
  6.     throws IOException, InterruptedException {  
  7.     int sum = 0;  
  8.     for (IntWritable value : values) {  
  9.     sum += value.get();  
  10.     }  
  11.     // for loop traversal, the value of accumulated values ​​obtained  
  12.     result.set(sum);  
  13.     context.write(key, result);  
  14.     }  
  15.     }  

map output <key, value> must first shuffle through all process the same key value aggregated together to form <key, values> to reduce the end. After reduce receives the <key, values>, key Copy key input directly to the output, and a for loop through summed values, the result is the sum total of the representative key value times the word appears, it is set to value , direct output <key, value>.

The complete code

  1. package mapreduce;  
  2. import java.io.IOException;  
  3. import java.util.StringTokenizer;  
  4. import org.apache.hadoop.fs.Path;  
  5. import org.apache.hadoop.io.IntWritable;  
  6. import org.apache.hadoop.io.Text;  
  7. import org.apache.hadoop.mapreduce.Job;  
  8. import org.apache.hadoop.mapreduce.Mapper;  
  9. import org.apache.hadoop.mapreduce.Reducer;  
  10. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
  11. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
  12. public class WordCount {  
  13.     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {  
  14.         Job job = Job.getInstance();  
  15.         job.setJobName("WordCount");  
  16.         job.setJarByClass(WordCount.class);  
  17.         job.setMapperClass(doMapper.class);  
  18.         job.setReducerClass(doReducer.class);  
  19.         job.setOutputKeyClass(Text.class);  
  20.         job.setOutputValueClass(IntWritable.class);  
  21.         Path in = new Path("hdfs://localhost:9000/mymapreduce1/in/buyer_favorite1");  
  22.         Path out = new Path("hdfs://localhost:9000/mymapreduce1/out");  
  23.         FileInputFormat.addInputPath(job, in);  
  24.         FileOutputFormat.setOutputPath(job, out);  
  25.         System.exit(job.waitForCompletion(true) ? 0 : 1);  
  26.     }  
  27.     public static class doMapper extends Mapper<Object, Text, Text, IntWritable>{  
  28.         public static final IntWritable one = new IntWritable(1);  
  29.         public static Text word = new Text();  
  30.         @Override  
  31.         protected void map(Object key, Text value, Context context)  
  32.                     throws IOException, InterruptedException {  
  33.             StringTokenizer tokenizer = new StringTokenizer(value.toString(), "\t");  
  34.                 word.set(tokenizer.nextToken());  
  35.                 context.write(word, one);  
  36.         }  
  37.     }  
  38.     public static class doReducer extends Reducer<Text, IntWritable, Text, IntWritable>{  
  39.         private IntWritable result = new IntWritable();  
  40.         @Override  
  41.         protected void reduce(Text key, Iterable<IntWritable> values, Context context)  
  42.         throws IOException, InterruptedException {  
  43.         int sum = 0;  
  44.         for (IntWritable value : values) {  
  45.         sum += value.get();  
  46.         }  
  47.         result.set(sum);  
  48.         context.write(key, result);  
  49.         }  
  50.     }  
  51. }  

10. WordCount class file, right-click => Run As => Run on Hadoop option, will be submitted to the Hadoop MapReduce task in.

 

 

 

11. to be finished, or open a terminal plug hadoop eclipse, the HDFS view, results in the program output.

 

View part-r-00000 file in the DFS Locations

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/msdog/p/11766431.html