MapRduce python Top_n

数据格式:

4234 4565 89579 0989 ····
3455 879 123 9090 ····
2342 9897 765 5746 ····
987 8098 8008 80099 ····
····

需求:

计算这一组数中出现次数最多的数字,按出现次数从大到小排序,取前n个数以及他们出现的次数(top n)

Python 代码:

  • mapper:

对于读入的每个数做一个(num, 1)的简单映射

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys


def map():
    for line in sys.stdin:
        line = line.strip()
        words = line.split()
        for word in words:
            print('%s\t%s' % (word, 1))


if __name__ == '__main__':
    map()
  • reducer:

用groupby方法对每相同的关键字(num)进行分组,分组后key为num,value为(num,1),在根据value的第二项计算该num出现的总次数count ,最后比较大小筛选出top n ,这里的n为通过Streaming运行python脚本时传入的参数

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
from itertools import groupby


def from_stdin():
    for line in sys.stdin:
        word, count = line.strip().split('\t')
        yield (word, count)


def reduce():
    n = int(sys.argv[1])
    a = {}
    for word, group in groupby(from_stdin(), key=lambda x: x[0]):
        count = sum([int(tup[1]) for tup in group])
        if len(a) < n:
            a.setdefault(word, count)
        else:
            y = min(a, key=a.get)
            if count > a[y]:
                a.pop(y)
                a.setdefault(word, count)
    a = [(key, value) for key, value in a.items()]
    a.sort(reverse=True, key=lambda x: x[1])
    for b in a:
        print('%s\t%s' % (b[1], b[0]))


if __name__ == '__main__':
    reduce()
发布了9 篇原创文章 · 获赞 4 · 访问量 2818

猜你喜欢

转载自blog.csdn.net/weixin_44129672/article/details/88720849
今日推荐