【Mapreduce】Mapreduce实例——WordCount

Foreword

MapReduce uses a "divide and conquer" mentality, the operation of large data sets, distributed to the nodes under management have completed a master node, and then through the integration of intermediate results of each node, the final result. In simple terms, MapReduce is the "summary of decomposition and the results of the task."

A, MapReduce works

In distributed computing, MapReduce framework to deal with parallel programming in distributed storage, job scheduling, load balancing complex issues, fault tolerance and network communications, and now we deal with highly abstract process into two parts Map and Reduce to elaborate wherein Map section responsible task into a plurality of sub-tasks, responsible for the portion of the Reduce decomposition processing results of a plurality of subtasks Taken together, the specific design ideas are as follows.

(1) Map Inherited org.apache.hadoop.mapreduce process requires package Mapper classes, methods and override its map. Map output process by adding the two values ​​to the key value and the code value of the console, you can be found in the map values ​​stored value method is entered in the text file line (lines terminated by a carriage return mark), and key value stored input line is the first letter of the first offset address relative to the text file. StringTokenizer class and then each split into a row of a field, the field requires the cut out (in this experiment is a buyer id field) set key, and outputs the result as a map method.

(2) Reduce process requires inheritance class org.apache.hadoop.mapreduce Reducer package and reduce its override method. Map process output <key, value> pair of the first key process shuffle through all values ​​of the same value of key values ​​grouped together in value, this case is a list of values ​​corresponding to the count value of the key field composed, then <key, values > input method to reduce, reduce long as the method to traverse values ​​and summed to obtain the total number of times a word.

() The main function in the main new Job object, a computing tasks by the Job object is responsible for managing and operating the MapReduce, and set the parameters related to the task by the methods of the Job. This experiment was set using the doMapper Mapper classes will inherit Map completion during handling and use doReducer Reduce class finish processing process. Further provided output type Map and Reduce process procedure: key of type Text, value of type IntWritable. Input path and output tasks specified by the character string, and by FileInputFormat FileOutputFormat are set. After the task is completed parameter settings, you can call job.waitForCompletion () method to perform the task, the rest of the work is handed over to the MapReduce framework process.

Job runs two processes, MapReduce framework

Here Insert Picture Description(1) ResourceManager: YARN resources is a central module control framework, responsible for unified management and allocation of all resources in the cluster. It receives reports from NM (NodeManager), the establishment of AM, and will send to the resource AM (ApplicationMaster).

(2) NodeManager: short NM, NodeManager is ResourceManager agent on each machine, responsible for container management, and monitor their use of resources (cpu, memory, disk and network, etc.), and use these resources to report to the ResourceManager.

(3) ApplicationMaster: hereinafter referred to as AM. YARN Each application will start a AM, RM is responsible to apply for funding, NM request to start Container, Container and told to do something.

(4) Container: container resources. YARN All applications are running on Container. AM also on Container running, but AM is the Container RM application. Container is an abstraction of resources YARN, it encapsulates a node on a certain amount of resources (CPU and memory resource categories). Container ApplicationMaster to the ResourceManager application, assigned by resource scheduler in the asynchronous ResouceManager to ApplicationMaster. Container run by ApplicationMaster where the resources NodeManager initiated by the need to provide internal task command executed Container runtime (can be any command, such as java, Python, C ++ commands can start the process) as well as the command to perform the required environmental variables and external resources (such as dictionary files, executable files, jar packages, etc.).

Further, a desired application Container divided into two categories, as follows:

① Container run ApplicationMaster: this is a ResourceManager (internal to the resource scheduler) and start the application, when the user submits the application, you can specify the resources needed only ApplicationMaster.

② run all kinds of tasks Container: This is due to the ResourceManager ApplicationMaster application, and to communicate with NodeManager ApplicationMaster to start.

Container categories above may be on any node, and their general position is random, that it may be related to ApplicationMaster management tasks running on a node.

Third, the experimental environment

Ubuntu Linux 14.0

jdk-7u75-linux-x64

hadoop-2.6.0-cdh5.4.5

hadoop-2.6.0-eclipse-cdh5.4.5.jar

eclipse-java-juno-SR2-linux-gtk-x86_64

Fourth, the experiment content

Existing electricity supplier website a collection of user data for commodities, records the user id and collection of goods collection date, called buyer_favorite1.

buyer_favorite1 comprising: a buyer id, item id, date of collection of these three fields, data "\ t" split, and sample data format is as follows:
the buyer id id product collection date
101811000481 2010-04-04 16:54:31
2010-04-07 15:07:52 200,011,001,597
200,011,001,560 2010-04-07 15:08:27
200421001368 2010-04-08 08:20:30
200671002061 2010-04-08 16:45:33
200561003289 2010-04-12 10:50:55
200561003290 2010-04-12 11:57:35
200561003292 2010-04-12 12:05:29
200541002420 2010-04-14 15:24:12
20055 10016792010- 04-14 19:46:04
200,541,010,675 2010-04-14 15:23:53
200541002429 2010-04-14 17:52:45
200761002427 2010-04-14 19:35:39
200541003326 2010-04- 20 12:54:44
200,561,002,420 2010-04-15 11:24:49
200641002422 2010-04-15 11:35:54
2010-04-15 11:43:01 200,561,003,066
200,561,003,055 2010-04-15 11:43:06
200561010183 2010-04-15 11:45:24
200561002422 2010-04-15 11:45:49
200561003100 2010-04-15 11:45:54
200561003094 2010-04-15 11:45:57
200561003064 2010-04-15 11:46:04
200561010178 2010-04-15 16:15:20
200,761,003,101 2010- 04-15 16:37:27
200,761,003,103 2010-04-15 16:37:05
200761003100 2010-04-15 16:37:18
200761003066 2010-04-15 16:37:31
200541003103 2010-04- 15 16:40:14
200,541,003,100 2010-04-15 16:40:16
requirement to write MapReduce programs, the number of statistical collection of goods for each buyer.

Statistics data are as follows:

Buyer product id number
10181. 1
20001 2
20042. 1
20054. 6
20055. 1
20056 12 is
20064. 1
20067. 1
20076. 5

Fifth, the experimental procedure

1. Change directory to / apps / hadoop / sbin, the start hadoop.

cd / Apps / hadoop / sbin
./start-all.sh
2. On linux, create a directory / data / mapreduce1.

-p mkdir / data / mapreduce1
3. Switch to next / data / mapreduce1 directory using wget command from the address http://59.74.172.143:60000/allfiles/mapreduce1/buyer_favorite1, download text files buyer_favorite1.

cd / data / mapreduce1
wget http://59.74.172.143:60000/allfiles/mapreduce1/buyer_favorite1
still in the / data / mapreduce1 directory, use wget command, from

http://59.74.172.143:60000/allfiles/mapreduce1/hadoop2lib.tar.gz, project dependencies for download.

wget http://59.74.172.143:60000/allfiles/mapreduce1/hadoop2lib.tar.gz
will hadoop2lib.tar.gz extract to the current directory.

-xzvf hadoop2lib.tar.gz the tar
4. The local linux / data / mapreduce1 / buyer_favorite1, uploaded to / mymapreduce1 / in directory on HDFS. If HDFS directory does not exist, create in advance.

FS -mkdir -p hadoop / mymapreduce1 / in
hadoop FS -put / the Data / mapreduce1 / buyer_favorite1 / mymapreduce1 / in
5. Open Eclipse, the new Java Project project.

Here Insert Picture Description
And the project name to mapreduce1.

Here Insert Picture Description
6. Under the project name mapreduce1, the new package package.
Here Insert Picture Description

And package named mapreduce.
Here Insert Picture Description

7. In the package mapreduce created, the new class.

Here Insert Picture Description
And the class is named WordCount.

Here Insert Picture Description
8. Add the project required dependent jar package, right-click the project name, create a new directory hadoop2lib, required for storing items jar package.

Here Insert Picture Description
Here Insert Picture Description
The upper linux / data under / mapreduce1 directory, directory hadoop2lib a jar, copied to all the eclipse, the hadoop2lib mapreduce1 project directory.
Here Insert Picture Description

Select all the jar packages under hadoop2lib directory, right click and select Build Path => Add to Build Path

Here Insert Picture Description
9. write Java code, and describe their design ideas.

The following diagram depicts the execution of mapreduce
Here Insert Picture Description

大致思路是将hdfs上的文本作为输入,MapReduce通过InputFormat会将文本进行切片处理,并将每行的首字母相对于文本文件的首地址的偏移量作为输入键值对的key,文本内容作为输入键值对的value,经过在map函数处理,输出中间结果<word,1>的形式,并在reduce函数中完成对每个单词的词频统计。整个程序代码主要包括两部分:Mapper部分和Reducer部分。
Mapper代码:

public static class doMapper extends Mapper<Object, Text, Text, IntWritable>{
//第一个Object表示输入key的类型;第二个Text表示输入value的类型;第三个Text表示表示输出键的类型;第四个IntWritable表示输出值的类型
public static final IntWritable one = new IntWritable(1);
public static Text word = new Text();
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException
//抛出异常
{
StringTokenizer tokenizer = new StringTokenizer(value.toString(),"\t");
//StringTokenizer是Java工具包中的一个类,用于将字符串进行拆分
word.set(tokenizer.nextToken());
//返回当前位置到下一个分隔符之间的字符串
context.write(word, one);
//将word存到容器中,记一个数
}

在map函数里有三个参数,前面两个Object key,Text value就是输入的key和value,第三个参数Context context是可以记录输入的key和value。例如context.write(word,one);此外context还会记录map运算的状态。map阶段采用Hadoop的默认的作业输入方式,把输入的value用StringTokenizer()方法截取出的买家id字段设置为key,设置value为1,然后直接输出<key,value>。

Reducer代码:

public static class doReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
//参数同Map一样,依次表示是输入键类型,输入值类型,输出键类型,输出值类型
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
//for循环遍历,将得到的values值累加
result.set(sum);
context.write(key, result);
}
}

map输出的<key,value>先要经过shuffle过程把相同key值的所有value聚集起来形成<key,values>后交给reduce端。reduce端接收到<key,values>之后,将输入的key直接复制给输出的key,用for循环遍历values并求和,求和结果就是key值代表的单词出现的总次,将其设置为value,直接输出<key,value>。

完整代码

package mapreduce;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance();
job.setJobName("WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(doMapper.class);
job.setReducerClass(doReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Path in = new Path("hdfs://localhost:9000/mymapreduce1/in/buyer_favorite1");
Path out = new Path("hdfs://localhost:9000/mymapreduce1/out");
FileInputFormat.addInputPath(job, in);
FileOutputFormat.setOutputPath(job, out);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class doMapper extends Mapper<Object, Text, Text, IntWritable>{
public static final IntWritable one = new IntWritable(1);
public static Text word = new Text();
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString(), "\t");
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
public static class doReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
result.set(sum);
context.write(key, result);
}
}
}

10.在WordCount类文件中,单击右键=>Run As=>Run on Hadoop选项,将MapReduce任务提交到Hadoop中。
Here Insert Picture Description

11.待执行完毕后,打开终端或使用hadoop eclipse插件,查看hdfs上,程序输出的实验结果。

hadoop fs -ls /mymapreduce1/out
hadoop fs -cat /mymapreduce1/out/part-r-00000
Here Insert Picture Description

Guess you like

Origin blog.csdn.net/weixin_44039347/article/details/91465584