Data deduplication in the MapReduce case

1        data deduplication

1.1               Data deduplication

   Deduplicate the data in the data file. Each line in the data file is a piece of data.

1.2               Application scenarios

Seemingly complex tasks such as counting the number of data types on a large data set, calculating access locations from website logs, etc., all involve data deduplication.

1.3               Design Ideas 

The ultimate goal of data deduplication is to have data that appears more than once in the original data appear only once in the output file. We naturally think of handing over all records of the same data to a reduce machine, no matter how many times this data appears, as long as it is output once in the final result. Specifically, the input of reduce should use data as the key , and there is no requirement for value-list . When reduce receives a <key , value-list> , it directly copies the key to the output key , and sets the value to a null value.

  In the MapReduce process, the output <key , value> of the map is aggregated into <key , value-list> through the shuffle process and then handed over to reduce . Therefore, from the designed reduce input, it can be deduced that the output key of the map should be data, and the value is arbitrary. Continue to reverse, the key of the map output data is data, and in this example, each data represents a line of content in the input file, so the task to be completed in the map stage is to set the value to Hadoop 's default job input method. key , and output it directly (the value in the output is arbitrary). The results in the map are handed over to reduce after the shuffle process .The reduce stage does not care how many values ​​each key has , it directly copies the input key as the output key , and outputs it (the value in the output is set to be empty).

1.4               Program code

     The program code is as follows:

importjava.io.IOException; 

 importorg.apache.hadoop.conf.Configuration; 

importorg.apache.hadoop.fs.Path; 

import org.apache.hadoop.io.IntWritable ; 

importorg.apache.hadoop.io.Text; 

importorg.apache.hadoop.mapreduce.Job; 

importorg.apache.hadoop.mapreduce.Mapper; 

importorg.apache.hadoop.mapreduce.Reducer; 

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat; 

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 

importorg.apache.hadoop.util.GenericOptionsParser; 

 public class Dedup {  

     //map copies the value in the input to the key of the output data and outputs it directly

    publicstaticclassMapextendsMapper<Object,Text,Text,Text>{     

        private static Text line = new Text(); // Data per line    

        // implement the map function

        publicvoidmap(Object key,Text value,Context context)  

                throwsIOException,InterruptedException{ 

            line=value;

            context.write(line, newText("")); 

        }

    }

    //reduce copies the key in the input to the key of the output data and outputs it directly

    publicstaticclassReduceextendsReducer<Text,Text,Text,Text>{     

        // implement the reduce function

        publicvoidreduce(Text key,Iterable<Text> values,Context context)  

                throwsIOException,InterruptedException{ 

            context.write(key, newText("")); 

        }

    }

    publicstaticvoidmain(String[] args)throwsException{     

        Configuration conf = new Configuration();

        conf.set("mapred.job.tracker", "192.168.1.2:9001");

        String[] ioArgs=new String[]{"dedup_in","dedup_out"};

String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();

        if (otherArgs.length != 2) {

            System.err.println("Usage: Data Deduplication <in> <out>");

            System.exit(2);

         }

     Job job = new Job(conf, "Data Deduplication");

     job.setJarByClass(Dedup.class);

     //设置MapCombineReduce处理类

     job.setMapperClass(Map.class);

     job.setCombinerClass(Reduce.class);

     job.setReducerClass(Reduce.class);

     //设置输出类型

     job.setOutputKeyClass(Text.class);

     job.setOutputValueClass(Text.class);

     //设置输入和输出目录

     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

     System.exit(job.waitForCompletion(true) ? 0 : 1);

     }

}

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326994915&siteId=291194637