Original link:
https://www.toutiao.com/i6764933201203823107/
Concept: "Data deduplication" primarily to learn and to perform meaningful data using parallel screening of thought. The number of types of statistical data on large data sets, such as computing access to these seemingly complex tasks are involved from the site log data deduplication.
The ultimate goal is to make data deduplication raw data appears more than once in the data appears only once in the output file. After MapReduce process, map output <key, value> shuffle through process gathered into <key, value-list> to reduce. We naturally think of all record the same data have to reduce a machine, no matter how many times the data appears as long as the output once in the final result on it. Concrete is the input data to be reduce as the key, while the value-list is not required (can be set to null). When receiving reduce a <key, value-list> to copy the key input directly to the output of the key, and the value set to a null value, then the output <key, value>.
If our data sources are:
Objective: To write MapReduce programs, according to trade heavy de-id, which are commodity products in the user statistics.
Let's prepare to generate analog data, write Java code
Create a project, package and class structure as follows
Generates a random number
Generate a random date
Write IO
Write code generation
Data generation
Maven project
Pom configuration file
Create a class data deduplication
Write Map and Reduce
Packaging Project
Start Hadoop
Upload data
The data uploaded to the HDFS
Perform jar package
yarn jar /data/removal/removal-client.jar com.xlglvc.xxx.mapredece.removal.Removal /removalinput/data.txt /removaloutput
We view the results
Exercise: We can write a mapreduce, there are a number of statistical data?