Really help you achieve—MapReduce counts the frequency of WordCount words, and arranges the statistical results in descending order of the number of occurrences

Project overall introduction

Count the frequency of words similar to WordCount cases, and arrange the statistical results in descending order of occurrence.

There are many posts on the Internet, all of which use similar schemes, rewrite a certain method and then. . . Some errors may be reported when running. Here is a solution, which is shared for your reference: write two MapReduce programs, the first program performs word frequency statistics, and the second program performs descending order processing. Since it is in descending order, You also need to customize the object to implement descending sorting inside the object.

1. Project background and data set description

There is an existing collection data of products by users of an e-commerce website, which records the product id and collection date of the user's collection, and is named buyer_favorite1. buyer_favorite1 contains three fields: buyer id, product id, and collection date, and the data is separated by "\t". The sample is shown as follows:insert image description here

2. Write a MapReduce program to count the number of items collected by each buyer. (that is, to count the number of occurrences of the buyer id)

pre-declaration

1. Configure the Hadoop cluster environment and start the corresponding service.
2. Upload the file on the hdfs corresponding path first, and you can define it according to the file path. Here is "hdfs://localhost:9000/mymapreduce1/in/buyer_favorite1". At the same time, define the output path
3. Here is the entry of the whole program (word frequency descending order). If you just want to count the word frequency, please comment out WordCountSortDESC.mainJob2();

package mapreduce;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    
    
	public static void main(String[] args) {
    
    
		Configuration conf = new Configuration();
		conf.set("yarn,resourcemanager", "bym@d2e674ec1e78");
		try {
    
    
			Job job = Job.getInstance(conf, "111");
			job.setJobName("WordCount");
			job.setJarByClass(WordCount.class);
			job.setMapperClass(doMapper.class); // 这里就是设置下job使用继承doMapper类,与定义的内容保持一致
			job.setReducerClass(doReducer.class); // 同上,设置Reduce类型

			job.setMapOutputKeyClass(Text.class); // 如果map的输出和reduce的输出不一样,这里要分别定义好格式
			job.setMapOutputValueClass(IntWritable.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(Text.class);

			Path in = new Path(
					"hdfs://localhost:9000/mymapreduce1/in/buyer_favorite1");
			Path out = new Path("hdfs://localhost:9000/mymapreduce1/out");
			FileInputFormat.addInputPath(job, in);
			FileOutputFormat.setOutputPath(job, out);
			if (job.waitForCompletion(true)) {
    
    
				System.out.println("WordCount completition");
				WordCountSortDESC.mainJob2();
				System.out.println("diaoyong");
			}
		} catch (Exception e) {
    
    
			e.printStackTrace();
		}

		// System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

	// 第一个Object表示输入key的类型、是该行的首字母相对于文本文件的首地址的偏移量;
	// 第二个Text表示输入value的类型、存储的是文本文件中的一行(以回车符为行结束标记);
	// 第三个Text表示输出键的类型;第四个IntWritable表示输出值的类型
	public static class doMapper extends
			Mapper<LongWritable, Text, Text, IntWritable> {
    
    
		public static final IntWritable one = new IntWritable(1);
		public static Text word = new Text();

		@Override
		// 前面两个Object key,Text value就是输入的key和value,第三个参数Context
		// context是可以记录输入的key和value。
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
    
    

			// StringTokenizer是Java工具包中的一个类,用于将字符串进行拆分
			StringTokenizer tokenizer = new StringTokenizer(value.toString(),
					"\t");
			// 返回当前位置到下一个分隔符之间的字符串, 并把字符串设置成Text格式
			word.set(tokenizer.nextToken());
			context.write(word, one);
		}
	}

	// 参数依次表示是输入键类型,输入值类型,输出键类型,输出值类型
	public static class doReducer extends
			Reducer<Text, IntWritable, Text, Text> {
    
    

		@Override
		// 输入的是键值类型,其中值类型为归并后的结果,输出结果为Context类型
		protected void reduce(Text key, Iterable<IntWritable> values,
				Context context) throws IOException, InterruptedException {
    
    
			int sum = 0;
			for (IntWritable value : values) {
    
    
				sum += value.get();
			}
			context.write(key, new Text(Integer.toString(sum)));
		}
	}
}

3. The core problem: write the MapReduce program again, and sort the results of the previous step in descending order

pre-declaration

1. Here, the statistical results of the previous step are used as input to run the mapreduce program for the second time. So pay attention to keep the input path consistent with the output path of the previous step.
2. Since it is sorted in descending order, only the FlowBean object can be customized, and the sorting method is implemented internally. Otherwise, ascending order can use the default sorting strategy of the shuffle mechanism instead of custom object sorting, which will not be described here.

package mapreduce;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountSortDESC {
    
    
	public static void mainJob2() {
    
    
		Configuration conf = new Configuration();
		conf.set("yarn,resourcemanager", "bym@d2e674ec1e78");
		try {
    
    
			Job job = Job.getInstance(conf, "1111");
			job.setJobName("WordCountSortDESC");
			job.setJarByClass(WordCountSortDESC.class);
			job.setMapperClass(TwoMapper.class); // 这里就是设置下job使用继承doMapper类,与定义的内容保持一致
			job.setReducerClass(TwoReducer.class); // 同上,设置Reduce类型
			
			job.setMapOutputKeyClass(FlowBean.class);
			job.setMapOutputValueClass(Text.class);

			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(FlowBean.class);

			Path in = new Path("hdfs://localhost:9000/mymapreduce1/out");
			Path out = new Path("hdfs://localhost:9000/mymapreduce1/out555");
			FileInputFormat.addInputPath(job, in);
			FileOutputFormat.setOutputPath(job, out);
			if (job.waitForCompletion(true)) {
    
    
				System.out.println("DESC Really Done");
			}
		} catch (Exception e) {
    
    
			System.out.println("errormainJob2-----------");
		}
	}

	public static class TwoMapper extends Mapper<Object, Text, FlowBean, Text> {
    
    
		private FlowBean outK = new FlowBean();
		private Text outV = new Text();

		@Override
		protected void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
    
    
			// 由于真实的数据存储在文件块上,这里是因为数据量较小,可以保证只在一个文件块
			FileSplit fs = (FileSplit) context.getInputSplit();
			if (fs.getPath().getName().contains("part-r-00000")) {
    
    

				// 1 获取一行数据
				String line = value.toString();

				// 2 按照"\t",切割数据
				String[] split = line.split("\t");

				// 3 封装outK outV
				outK.setNumber(Long.parseLong(split[1]));
				outV.set(split[0]);

				// 4 写出outK outV
				context.write(outK, outV);
			} else {
    
    
				System.out.println("error-part-r-------------------");
			}
		}
	}

	public static class TwoReducer extends
			Reducer<FlowBean, Text, Text, FlowBean> {
    
    
		@Override
		protected void reduce(FlowBean key, Iterable<Text> values,
				Context context) throws IOException, InterruptedException {
    
    

			// 遍历values集合,循环写出,避免总流量相同的情况
			for (Text value : values) {
    
    
				// 调换KV位置,反向写出
				context.write(value, key);
			}
		}
	}


	public static class FlowBean implements WritableComparable<FlowBean> {
    
    

		private long number;

		// 提供无参构造
		public FlowBean() {
    
    
		}

		public long getNumber() {
    
    
			return number;
		}

		public void setNumber(long number) {
    
    
			this.number = number;
		}

		// 实现序列化和反序列化方法,注意顺序一定要一致
		@Override
		public void write(DataOutput out) throws IOException {
    
    
			out.writeLong(this.number);
		}

		@Override
		public void readFields(DataInput in) throws IOException {
    
    
			this.number = in.readLong();
		}

		@Override
		public String toString() {
    
    
			return number + "\t";
		}

		@Override
		public int compareTo(FlowBean o) {
    
    
			// 按照总流量比较,倒序排列
			if (this.number > o.number) {
    
    
				return -1;
			} else if (this.number < o.number) {
    
    
				return 1;
			} else {
    
    
				return 0;
			}
		}
	}

}

4. Results display:

Execute the view file command

hadoop fs -cat /mymapreduce1/out555/part-r-00000

insert image description here
It can be found that the descending order has been performed, and the results of other data sets should be similar.

Guess you like

Origin blog.csdn.net/weixin_52323239/article/details/132008331