mapreduce for data deduplication

Original link:

https://www.toutiao.com/i6764933201203823107/

Concept: "Data deduplication" primarily to learn and to perform meaningful data using parallel screening of thought. The number of types of statistical data on large data sets, such as computing access to these seemingly complex tasks are involved from the site log data deduplication.

The ultimate goal is to make data deduplication raw data appears more than once in the data appears only once in the output file. After MapReduce process, map output <key, value> shuffle through process gathered into <key, value-list> to reduce. We naturally think of all record the same data have to reduce a machine, no matter how many times the data appears as long as the output once in the final result on it. Concrete is the input data to be reduce as the key, while the value-list is not required (can be set to null). When receiving reduce a <key, value-list> to copy the key input directly to the output of the key, and the value set to a null value, then the output <key, value>.

If our data sources are:

 

Objective: To write MapReduce programs, according to trade heavy de-id, which are commodity products in the user statistics.

Let's prepare to generate analog data, write Java code

Create a project, package and class structure as follows

 

Generates a random number

 

Generate a random date

 

Write IO

 

Write code generation

 

Data generation

 

Maven project

 

Pom configuration file

 

 

 

Create a class data deduplication

 

Write Map and Reduce

 

 

Packaging Project

 

Start Hadoop

 

Upload data

 

The data uploaded to the HDFS

 

Perform jar package

yarn jar /data/removal/removal-client.jar com.xlglvc.xxx.mapredece.removal.Removal /removalinput/data.txt /removaloutput

 

We view the results

 

 

Exercise: We can write a mapreduce, there are a number of statistical data?

 

Guess you like

Origin www.cnblogs.com/bqwzy/p/12528462.html