TFIDF based on MapReduce


The TF-IDF MapReduce Phases by Ricky Ho

 

Job1:

Map:

input: (document, each line of the document) # TextInputformat

output: (word@document, 1)

Reducer:

output: ((word@document), n)

n = sum of the values of each key(word@document)

the implicit process is:

the same key(word@document) will be pushed to the same reducer(in the shuffer phase)

 

Job2:

Map:

1、input: ((word@document), n)

2、Re-arrange the mapper to have the key based on each document

3、output: (document, word=n)

Reducer:

     output: ((word@document), n/N)

     N = total wordsInDocs = sum[word = n] for each document

Job3:

Map:

1、input: ((word@document), n/N)

2、Re-arrange the mapper to have the word as the key, since we need to count the number of documents where it occurs

3、ouput: (word, document=n/N)

Reducer:

     ouput: ((word@document), d/D, n/N, tfidf)

     D = total number of documents in corpus, which can be set in the configuration

     d = number of documents in corpus where the word appears

             TFIDF = n/N * log(D/d)

猜你喜欢

转载自irwenqiang.iteye.com/blog/1538484