Python implements wordcount of Mapreduce

Article Directory

Introduction

As a fund project of Apache, Hadoop solves the problem of long processing time for big data. Among them, the MapReduce parallel processing framework is an important member of Hadoop. Because the Hadoop architecture is implemented by JAVA, JAVA programs are used more when processing big data. However, if you want to use deep learning algorithms in MapReduce, Python is easier for deep learning and data mining to process data. Based on the above considerations, this article introduces the use of python to implement the WordCount experiment in MapReduce. The content of the article (code part) comes from a blogger's CSDN blog, and the reference link is at the end.

Hadoop Stream

The main use of Hadoop Streaming provided by Hadoop, first of all, introduce Hadoop Stream

The role of Streaming

The biggest advantage of the Hadoop Streaming framework is that it allows map and reduce programs written in any language to run on the Hadoop cluster; map/reduce programs only need to read from standard input stdin and write to standard output stdout;
Secondly, it is easy to perform stand-alone debugging. Streaming can be simulated by connecting pipes before and after, and the map/reduce program can be debugged
locally #cat inputfile | mapper | sort | reducer> output
Finally, the streaming framework also provides rich parameter control during job submission, directly through the streaming parameters, without the need to use the java language to modify; many high-level functions of mapreduce can be completed by adjusting the steaming parameters.

Limitations of Streaming

Streaming can only process text data Textfile by default. For binary data, a better method is to base64 encode the binary key and value and convert it into text;
before and after the Mapper and reducer, the standard input and standard output must be converted, involving data Copying and parsing brings a certain amount of overhead.

Related parameters of the Streaming command

# hadoop jar hadoop-streaming-2.6.5.jar \ [普通选项] [Streaming选项]

For common options and Stream options, please refer to the following URL:
https://www.cnblogs.com/shay-zhangjin/p/7714868.html

Python implements WordCount of MapReduce

First, write the mapper.py script:

#!/usr/bin/env python  
  
import sys  
  
# input comes from STDIN (standard input)  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
    # split the line into words  
    words = line.split()  
    # increase counters  
    for word in words:  
        # write the results to STDOUT (standard output);  
        # what we output here will be the input for the  
        # Reduce step, i.e. the input for reducer.py  
        #  
        # tab-delimited; the trivial word count is 1  
        print '%s\t%s' % (word, 1)

In this script, the total number of word occurrences is not calculated, it will output "1" quickly, although it may appear multiple times in the input, the calculation is left to the subsequent Reduce step (or program) to implement. Remember to grant executable permissions to mapper.py: chmod 777 mapper.py

reducer.py script

#!/usr/bin/env python  
  
from operator import itemgetter  
import sys  
  
current_word = None  
current_count = 0  
word = None  
  
# input comes from STDIN  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
  
    # parse the input we got from mapper.py  
    word, count = line.split('\t', 1)  
  
    # convert count (currently a string) to int  
    try:  
        count = int(count)  
    except ValueError:  
        # count was not a number, so silently  
        # ignore/discard this line  
        continue  
  
    # this IF-switch only works because Hadoop sorts map output  
    # by key (here: word) before it is passed to the reducer  
    if current_word == word:  
        current_count += count  
    else:  
        if current_word:  
            # write result to STDOUT  
            print '%s\t%s' % (current_word, current_count)  
        current_count = count  
        current_word = word  
  
# do not forget to output the last word if needed!  
if current_word == word:  
    print '%s\t%s' % (current_word, current_count)

Store the code in /usr/local/hadoop/reducer.py. The function of this script is to read the result from the STDIN of mapper.py , then calculate the sum of the occurrences of each word, and output the result to STDOUT.
Also, pay attention to the script permissions: chmod 777 reducer.py

It is recommended to test that the script runs correctly when running MapReduce tasks:

root@localhost:/root/pythonHadoop$ echo "foo foo quux labs foo bar quux" | ./mapper.py  
foo      1  
foo      1  
quux     1  
labs     1  
foo      1  
bar      1  
quux     1  
root@localhost:/root/pythonHadoop$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort |./reducer.py  
bar     1  
foo     3  
labs    1  
quux    2

If the implementation effect is as above, it proves feasible. You can run MapReduce.

Run python script on Hadoop platform:

[root@node01 pythonHadoop]         
hadoop jar contrib/hadoop-streaming-2.6.5.jar    
-mapper mapper.py    
-file mapper.py    
-reducer reducer.py    
-file reducer.py    
-input /ooxx/*   
-output /ooxx/output/

Finally, execute hdfs dfs -cat /ooxx/output/part-00000 to view the output results.
The results will not be shown. For the hello.txt file, you can use echo to make it yourself, or you can download the test file from the Internet. For the test results, the results of different data sets are not the same.

Reference article: https://blog.csdn.net/crazyhacking/article/details/43304499