Original link:
https://www.toutiao.com/i6764296608705151496/
Statistics word statistics is the number of times a word appears in a document, such as the following data sources
Among them, the number eventually result should be the following display
So how to write code in MapReduce and appears ultimately result?
First we upload files to the HDFS (hdfs dfs -put ...)
Data Name: data.txt, size is the size of 2G
Red yellow is represented by three blocks stored in the block data
Data.txt map data then enters phase, will be <K, V> form (KV on) enters, K is represented by: the first letter of each row byte offset relative to the file header, V indicates that each row text.
FIG then I can be represented by: Blue oval sphere represents a map, red yellow map block when entering the stage, the data is left in the form of red <K, V> form (KV pair)
After treatment map, such String.split ( ""), do a process, the following data will become KV forms in different data blocks red yellow
Our number when configuring Hadoop settings reduce or, if there are two reduce
Map data will be placed in executing the corresponding to reduce in the following FIG.
This place has a simple principle that
Job.setNumReduce (2) reduce the number of sets
And using the result HashPartioner class of key.hashcode% reduce, different result is input to a different map reduce, such ae beginning one place, at the beginning of the EZ one place, then
Such data results will become
Well then we can count at this time, and I started writing code
First we create a wordCount project, the project we are creating a maven project
Wherein the configuration portion pom
We create a class
Inheritance Mapper (note the comment)
Write code
Also create WordCountReducer, write code, using the preceding reduce ideological understanding
Create a class to write code WordCountDriver
Export project jar
We start Hadoop
We upload the data and the jar package
Upload data to the hdfs
Execute the following statement
bin/yarn jar /data/wordCount/wordCount.jar com.xlglvc.xx.mapredece.wordcount_client.WordCountDriver /data.txt /outputwordcount
There are some problems, time is not synchronized
Ntpdate installation tool
yum -y install ntp ntpdate
Set the system time synchronized with the network time
ntpdate cn.pool.ntp.org
Then re-run, this time we choose a new directory
bin/yarn jar /data/wordCount/wordCount.jar com.xlglvc.xx.mapredece.wordcount_client.WordCountDriver /data.txt /outputwordcount1
We go to the browser queries
We view the final result
bin/hdfs dfs -text /outputwordcount1/part-r-00000
Appear the results we want, complete statistics