1 data deduplication
1.1 Data deduplication
Deduplicate the data in the data file. Each line in the data file is a piece of data.
1.2 Application scenarios
Seemingly complex tasks such as counting the number of data types on a large data set, calculating access locations from website logs, etc., all involve data deduplication.
1.3 Design Ideas
The ultimate goal of data deduplication is to have data that appears more than once in the original data appear only once in the output file. We naturally think of handing over all records of the same data to a reduce machine, no matter how many times this data appears, as long as it is output once in the final result. Specifically, the input of reduce should use data as the key , and there is no requirement for value-list . When reduce receives a <key , value-list> , it directly copies the key to the output key , and sets the value to a null value.
In the MapReduce process, the output <key , value> of the map is aggregated into <key , value-list> through the shuffle process and then handed over to reduce . Therefore, from the designed reduce input, it can be deduced that the output key of the map should be data, and the value is arbitrary. Continue to reverse, the key of the map output data is data, and in this example, each data represents a line of content in the input file, so the task to be completed in the map stage is to set the value to Hadoop 's default job input method. key , and output it directly (the value in the output is arbitrary). The results in the map are handed over to reduce after the shuffle process .The reduce stage does not care how many values each key has , it directly copies the input key as the output key , and outputs it (the value in the output is set to be empty).
1.4 Program code
The program code is as follows:
importjava.io.IOException; importorg.apache.hadoop.conf.Configuration; importorg.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable ; importorg.apache.hadoop.io.Text; importorg.apache.hadoop.mapreduce.Job; importorg.apache.hadoop.mapreduce.Mapper; importorg.apache.hadoop.mapreduce.Reducer; importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat; importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; importorg.apache.hadoop.util.GenericOptionsParser; public class Dedup { //map copies the value in the input to the key of the output data and outputs it directly publicstaticclassMapextendsMapper<Object,Text,Text,Text>{ private static Text line = new Text(); // Data per line // implement the map function publicvoidmap(Object key,Text value,Context context) throwsIOException,InterruptedException{ line=value; context.write(line, newText("")); } } //reduce copies the key in the input to the key of the output data and outputs it directly publicstaticclassReduceextendsReducer<Text,Text,Text,Text>{ // implement the reduce function publicvoidreduce(Text key,Iterable<Text> values,Context context) throwsIOException,InterruptedException{ context.write(key, newText("")); } } publicstaticvoidmain(String[] args)throwsException{ Configuration conf = new Configuration(); conf.set("mapred.job.tracker", "192.168.1.2:9001"); String[] ioArgs=new String[]{"dedup_in","dedup_out"}; String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: Data Deduplication <in> <out>"); System.exit(2); } Job job = new Job(conf, "Data Deduplication"); job.setJarByClass(Dedup.class); //设置Map、Combine和Reduce处理类 job.setMapperClass(Map.class); job.setCombinerClass(Reduce.class); job.setReducerClass(Reduce.class); //设置输出类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); //设置输入和输出目录 FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } |