Java implementation word count

Original link:

https://www.toutiao.com/i6764296608705151496/

Statistics word statistics is the number of times a word appears in a document, such as the following data sources

 

Among them, the number eventually result should be the following display

 

So how to write code in MapReduce and appears ultimately result?

First we upload files to the HDFS (hdfs dfs -put ...)

Data Name: data.txt, size is the size of 2G

Red yellow is represented by three blocks stored in the block data

 

Data.txt map data then enters phase, will be <K, V> form (KV on) enters, K is represented by: the first letter of each row byte offset relative to the file header, V indicates that each row text.

 

FIG then I can be represented by: Blue oval sphere represents a map, red yellow map block when entering the stage, the data is left in the form of red <K, V> form (KV pair)

 

After treatment map, such String.split ( ""), do a process, the following data will become KV forms in different data blocks red yellow

 

Our number when configuring Hadoop settings reduce or, if there are two reduce

Map data will be placed in executing the corresponding to reduce in the following FIG.

 

This place has a simple principle that

Job.setNumReduce (2) reduce the number of sets

And using the result HashPartioner class of key.hashcode% reduce, different result is input to a different map reduce, such ae beginning one place, at the beginning of the EZ one place, then

 

 

Such data results will become

 

 

Well then we can count at this time, and I started writing code

First we create a wordCount project, the project we are creating a maven project

 

Wherein the configuration portion pom

 

 

 

We create a class

 

Inheritance Mapper (note the comment)

 

Write code

 

Also create WordCountReducer, write code, using the preceding reduce ideological understanding

 

Create a class to write code WordCountDriver

 

Export project jar

 

 

 

 

We start Hadoop

 

We upload the data and the jar package

 

Upload data to the hdfs

 

Execute the following statement

bin/yarn jar /data/wordCount/wordCount.jar com.xlglvc.xx.mapredece.wordcount_client.WordCountDriver /data.txt /outputwordcount

 

There are some problems, time is not synchronized

 

Ntpdate installation tool

yum -y install ntp ntpdate

Set the system time synchronized with the network time

ntpdate cn.pool.ntp.org

 

Then re-run, this time we choose a new directory

bin/yarn jar /data/wordCount/wordCount.jar com.xlglvc.xx.mapredece.wordcount_client.WordCountDriver /data.txt /outputwordcount1

 

We go to the browser queries

 

We view the final result

bin/hdfs dfs -text /outputwordcount1/part-r-00000

 

Appear the results we want, complete statistics

 

Guess you like

Origin www.cnblogs.com/bqwzy/p/12528446.html