Purpose
1. accurate understanding of the design principles Mapreduce
2. The master code written WordCount
3. Learn to write your own programs WordCount word frequency statistics
Principle
MapReduce uses a "divide and conquer" mentality, the operation of large data sets, distributed to the nodes under management have completed a master node, and then through the integration of intermediate results of each node, the final result. In simple terms, MapReduce is the "summary of decomposition and the results of the task."
1.MapReduce works
In distributed computing, MapReduce framework to deal with parallel programming in distributed storage, job scheduling, load balancing complex issues, fault tolerance and network communications, and now we deal with highly abstract process into two parts Map and Reduce to elaborate wherein Map section responsible task into a plurality of sub-tasks, responsible for the portion of the Reduce decomposition processing results of a plurality of subtasks Taken together, the specific design ideas are as follows.
(1) Map Inherited org.apache.hadoop.mapreduce process requires package Mapper classes, methods and override its map. Map output process by adding the two values to the key value and the code value of the console, you can be found in the map values stored value method is entered in the text file line (lines terminated by a carriage return mark), and key value stored input line is the first letter of the first offset address relative to the text file. StringTokenizer class and then each split into a row of a field, the field requires the cut out (in this experiment is a buyer id field) set key, and outputs the result as a map method.
(2) Reduce process requires inheritance class org.apache.hadoop.mapreduce Reducer package and reduce its override method. Map process output <key, value> pair of the first key process shuffle through all values of the same value of key values grouped together in value, this case is a list of values corresponding to the count value of the key field composed, then <key, values > input method to reduce, reduce long as the method to traverse values and summed to obtain the total number of times a word.
() The main function in the main new Job object, a computing tasks by the Job object is responsible for managing and operating the MapReduce, and set the parameters related to the task by the methods of the Job. This experiment was set using the doMapper Mapper classes will inherit Map completion during handling and use doReducer Reduce class finish processing process. Further provided output type Map and Reduce process procedure: key of type Text, value of type IntWritable. Input path and output tasks specified by the character string, and by FileInputFormat FileOutputFormat are set. After the task is completed parameter settings, you can call job.waitForCompletion () method to perform the task, the rest of the work is handed over to the MapReduce framework process.
2.MapReduce job to run the process framework
(1) ResourceManager: YARN resources is a central module control framework, responsible for unified management and allocation of all resources in the cluster. It receives reports from NM (NodeManager), the establishment of AM, and will send to the resource AM (ApplicationMaster).
(2) NodeManager: short NM, NodeManager is ResourceManager agent on each machine, responsible for container management, and monitor their use of resources (cpu, memory, disk and network, etc.), and use these resources to report to the ResourceManager.
(3) ApplicationMaster: hereinafter referred to as AM. YARN Each application will start a AM, RM is responsible to apply for funding, NM request to start Container, Container and told to do something.
(4) Container: container resources. YARN All applications are running on Container. AM also on Container running, but AM is the Container RM application. Container is an abstraction of resources YARN, it encapsulates a node on a certain amount of resources (CPU and memory resource categories). Container ApplicationMaster to the ResourceManager application, assigned by resource scheduler in the asynchronous ResouceManager to ApplicationMaster. Container run by ApplicationMaster where the resources NodeManager initiated by the need to provide internal task command executed Container runtime (can be any command, such as java, Python, C ++ commands can start the process) as well as the command to perform the required environmental variables and external resources (such as dictionary files, executable files, jar packages, etc.).
Further, a desired application Container divided into two categories, as follows:
① Container run ApplicationMaster: this is a ResourceManager (internal to the resource scheduler) and start the application, when the user submits the application, you can specify the resources needed only ApplicationMaster.
② run all kinds of tasks Container: This is due to the ResourceManager ApplicationMaster application, and to communicate with NodeManager ApplicationMaster to start.
Container categories above may be on any node, and their general position is random, that it may be related to ApplicationMaster management tasks running on a node.
lab environment
Ubuntu Linux 14.0
jdk-7u75-linux-x64
hadoop-2.6.0-cdh5.4.5
hadoop-2.6.0-eclipse-cdh5.4.5.jar
eclipse-java-juno-SR2-linux-gtk-x86_64
Content Experiments
Existing electricity supplier website a collection of user data for commodities, records the user id and collection of goods collection date, called buyer_favorite1.
buyer_favorite1 comprising: a buyer id, product id, the date of collection of these three fields, data "\ t" split, and sample data format is as follows:
- Id id buyers of goods collection date
- 10181 1000481 2010-04-04 16:54:31
- 20001 1001597 2010-04-07 15:07:52
- 20001 1001560 2010-04-07 15:08:27
- 20042 1001368 2010-04-08 08:20:30
- 20067 1002061 2010-04-08 16:45:33
- 20056 1003289 2010-04-12 10:50:55
- 20056 1003290 2010-04-12 11:57:35
- 20056 1003292 2010-04-12 12:05:29
- 20054 1002420 2010-04-14 15:24:12
- 20055 1001679 2010-04-14 19:46:04
- 20054 1010675 2010-04-14 15:23:53
- 20054 1002429 2010-04-14 17:52:45
- 20076 1002427 2010-04-14 19:35:39
- 20054 1003326 2010-04-20 12:54:44
- 20056 1002420 2010-04-15 11:24:49
- 20064 1002422 2010-04-15 11:35:54
- 20056 1003066 2010-04-15 11:43:01
- 20056 1003055 2010-04-15 11:43:06
- 20056 1010183 2010-04-15 11:45:24
- 20056 1002422 2010-04-15 11:45:49
- 20056 1003100 2010-04-15 11:45:54
- 20056 1003094 2010-04-15 11:45:57
- 20056 1003064 2010-04-15 11:46:04
- 20056 1010178 2010-04-15 16:15:20
- 20076 1003101 2010-04-15 16:37:27
- 20076 1003103 2010-04-15 16:37:05
- 20076 1003100 2010-04-15 16:37:18
- 20076 1003066 2010-04-15 16:37:31
- 20054 1003103 2010-04-15 16:40:14
- 20054 1003100 2010-04-15 16:40:16
Requirements write MapReduce programs, the number of statistical collection of goods for each buyer.
Statistics data are as follows:
- Id quantity buyers
- 10181 1
- 20001 2
- 20042 1
- 20054 6
- 20055 1
- 20056 12
- 20064 1
- 20067 1
- 20076 5
Experimental Procedure
1. Change directory to / apps / hadoop / sbin, the start hadoop.
- cd /apps/hadoop/sbin
- ./start-dfs.sh
2. On linux, create a directory / data / mapreduce1.
- mkdir -p /data/mapreduce1
3. Switch to next / data / mapreduce1 directory, the text file buyer_favorite1 establish itself.
Still in the / data / mapreduce1 directory, use wget command, from
Network download hadoop2lib.tar.gz, project dependencies for download.
The hadoop2lib.tar.gz extract to the current directory.
- tar -xzvf hadoop2lib.tar.gz
4. The local linux / data / mapreduce1 / buyer_favorite1, uploaded to / mymapreduce1 on HDFS / in the directory. If HDFS directory does not exist, create in advance.
- hadoop fs -mkdir -p /mymapreduce1/in
- hadoop fs -put /data/mapreduce1/buyer_favorite1 /mymapreduce1/in
5. Open Eclipse, the new Java Project project and the project name to mapreduce1.
6. Under the project name mapreduce1, and the new package package package named mapreduce
7. In the package mapreduce created, the new class and the class name WordCount.
8. Add the project required dependent jar package, right-click the project name, create a new directory hadoop2lib, required for storing items jar package.
The upper linux / data under / mapreduce1 directory, directory hadoop2lib a jar, copied to all the eclipse, the hadoop2lib mapreduce1 project directory.
Select all the jar packages under hadoop2lib directory, right click and select Build Path => Add to Build Path
9. write Java code, and describe their design ideas.
The following diagram depicts the execution of mapreduce
The general idea is the hdfs text as input, to slice through the MapReduce will InputFormat text, and the first letter of each row offset with respect to the first address of the text file as an input key of the key-value pairs, as text enter key-value pairs, in the processed map function, outputs an intermediate result <word, 1> form, and complete frequency statistics for each word in the reduce function. The entire program code consists of two parts: Mapper portion and a portion Reducer.
Mapper Code
- public static class doMapper extends Mapper<Object, Text, Text, IntWritable>{
- // Object represents a first type of key input; a second type of input represents the value of Text; Text represents the third output indicates the type of bond; represents a fourth output value type IntWritable
- public static final IntWritable one = new IntWritable(1);
- public static Text word = new Text();
- @Override
- protected void map(Object key, Text value, Context context)
- throws IOException, InterruptedException
- //Throw an exception
- {
- StringTokenizer tokenizer = new StringTokenizer(value.toString(),"\t");
- // StringTokenizer is a Java-based toolkit for the string split
- word.set(tokenizer.nextToken());
- // returns the current position to the next string delimited between
- context.write(word, one);
- // save the word into the container, referred to a number of
- }
There are three parameters in the map function, the front two Object key, Text value is input and the key value, the third parameter Context context value is entered and the record key. E.g. context.write (word, one); in addition also recorded map context state operations. phase map using the default mode of Hadoop job input, the input value with a StringTokenizer () method, taken out of the buyer id field is set to key, value set to 1, then output directly <key, value>.
Reducer Code
- public static class doReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
- // Map with the same parameters, in turn indicates a type of input keys, the type of an input value, the output of the key type, the output value of the type
- private IntWritable result = new IntWritable();
- @Override
- protected void reduce(Text key, Iterable<IntWritable> values, Context context)
- throws IOException, InterruptedException {
- int sum = 0;
- for (IntWritable value : values) {
- sum += value.get();
- }
- // for loop traversal, the value of accumulated values obtained
- result.set(sum);
- context.write(key, result);
- }
- }
map output <key, value> must first shuffle through all process the same key value aggregated together to form <key, values> to reduce the end. After reduce receives the <key, values>, key Copy key input directly to the output, and a for loop through summed values, the result is the sum total of the representative key value times the word appears, it is set to value , direct output <key, value>.
The complete code
- package mapreduce;
- import java.io.IOException;
- import java.util.StringTokenizer;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class WordCount {
- public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
- Job job = Job.getInstance();
- job.setJobName("WordCount");
- job.setJarByClass(WordCount.class);
- job.setMapperClass(doMapper.class);
- job.setReducerClass(doReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- Path in = new Path("hdfs://localhost:9000/mymapreduce1/in/buyer_favorite1");
- Path out = new Path("hdfs://localhost:9000/mymapreduce1/out");
- FileInputFormat.addInputPath(job, in);
- FileOutputFormat.setOutputPath(job, out);
- System.exit(job.waitForCompletion(true) ? 0 : 1);
- }
- public static class doMapper extends Mapper<Object, Text, Text, IntWritable>{
- public static final IntWritable one = new IntWritable(1);
- public static Text word = new Text();
- @Override
- protected void map(Object key, Text value, Context context)
- throws IOException, InterruptedException {
- StringTokenizer tokenizer = new StringTokenizer(value.toString(), "\t");
- word.set(tokenizer.nextToken());
- context.write(word, one);
- }
- }
- public static class doReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
- private IntWritable result = new IntWritable();
- @Override
- protected void reduce(Text key, Iterable<IntWritable> values, Context context)
- throws IOException, InterruptedException {
- int sum = 0;
- for (IntWritable value : values) {
- sum += value.get();
- }
- result.set(sum);
- context.write(key, result);
- }
- }
- }
10. WordCount class file, right-click => Run As => Run on Hadoop option, will be submitted to the Hadoop MapReduce task in.
11. to be finished, or open a terminal plug hadoop eclipse, the HDFS view, results in the program output.
View part-r-00000 file in the DFS Locations