1.
A. Write map function, reduce function
#!/usr/bin/env python import sys for line in sys.stdin: line=line.strip() words=line.split() for word in words: print '%s\t%s' % (word,1)
#!/usr/bin/env python from operator import itemgetter import sys current_word=None current_count=0 word=None for line in sys.stdin: line=line.strip() word,count=line.split('\t',1) try: count=int(count) except ValueError: continue if current_word==word: current_count+=count else: if current_word: print '%s\t%s' % (current_word,current_count) current_count=count current_word=word if current_word==word: print '%s\t%s' % (current_word,current_count)
B. Modify its authority accordingly
chmod a+x /home/hadoop/wc/mapper.py
chmod a+x /home/hadoop/wc/reducer.py
C. Test and run the code on this machine
D. Put it to run on HDFS
a. Upload the previously crawled text file to hdfs
b. Submit tasks with the Hadoop Streaming command