数据格式:
4234 4565 89579 0989 ····
3455 879 123 9090 ····
2342 9897 765 5746 ····
987 8098 8008 80099 ····
····
需求:
计算这一组数中出现次数最多的数字,按出现次数从大到小排序,取前n个数以及他们出现的次数(top n)
Python 代码:
- mapper:
对于读入的每个数做一个(num, 1)的简单映射
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
def map():
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print('%s\t%s' % (word, 1))
if __name__ == '__main__':
map()
- reducer:
用groupby方法对每相同的关键字(num)进行分组,分组后key为num,value为(num,1),在根据value的第二项计算该num出现的总次数count ,最后比较大小筛选出top n ,这里的n为通过Streaming运行python脚本时传入的参数
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
from itertools import groupby
def from_stdin():
for line in sys.stdin:
word, count = line.strip().split('\t')
yield (word, count)
def reduce():
n = int(sys.argv[1])
a = {}
for word, group in groupby(from_stdin(), key=lambda x: x[0]):
count = sum([int(tup[1]) for tup in group])
if len(a) < n:
a.setdefault(word, count)
else:
y = min(a, key=a.get)
if count > a[y]:
a.pop(y)
a.setdefault(word, count)
a = [(key, value) for key, value in a.items()]
a.sort(reverse=True, key=lambda x: x[1])
for b in a:
print('%s\t%s' % (b[1], b[0]))
if __name__ == '__main__':
reduce()