Python 玩转大数据 Mapreduce开发 wordcount

一介绍

MapReduce 是一种分布式编程模型，用于处理大规模的数据。用户主要通过指定一个 map 函数和一个 reduce 函数来处理一个基于key/value pair的数据集合，输出中间的基于key/value pair的数据集合；然后再创建一个Reduce函数用来合并所有的具有相同中间key值的中间value值。

使用python写MapReduce的“诀窍”是利用Hadoop流的API，通过STDIN(标准输入)，STDOUT(标准输出)在Map函数和Reduce函数之间传递数据。
我们唯一需要做的是利用Python的sys.stdin读取输入数据，并把我们的输出传送给sys.stdout。Hadoop流将会帮助我们处理别的任何事情。

二 mapreduce程序

1）用户编写的python程序分成2个部分：Mapper，Reducer；然后将程序提交到集群上跑
（2）Mapper的输入数据是KV对的形式（KV的类型可自定义）
（3）Mapper的输出数据是KV对的形式（KV的类型可自定义）
（4）Mapper中的业务逻辑写在map()方法中
（5）map()方法（maptask进程）对每一个<K,V>调用一次
（6）Reducer的输入数据类型对应Mapper的输出数据类型，也是KV
（7）Reducer的业务逻辑写在reduce()方法中
（8）Reducetask进程对每一组相同k的<k,v>组调用一次reduce()方法

2.1.map.py

# !/usr/bin/python
# 第一行一定要注明python运行位置，否则需要在运行python程序时，在程序前面加上python命令
# -*- coding: utf-8 -*-
# @Time    : 2018/10/25 下午11:42
# @Author  : Einstein Yang！！
# @Nickname : 穿着开裆裤上大学
# @FileName: map.py.py
# @Software: PyCharm
# @PythonVersion: python3.5
# @Blog    ：https://blog.csdn.net/weixin_41734687




import sys

# maps words to their counts
word2count = {}

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words while removing any empty strings
    words = filter(lambda word: word, line.split())
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print('%s\t%s' % (word, 1))

reduce.py

# !/usr/bin/python
# -*- coding: utf-8 -*-
# @Time    : 2018/10/25 下午11:54
# @Author  : Einstein Yang！！
# @Nickname : 穿着开裆裤上大学
# @FileName: reduce.py
# @Software: PyCharm
# @PythonVersion: python3.5
# @Blog    ：https://blog.csdn.net/weixin_41734687



from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split()
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        pass

# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
# 实现按字典key对word进行排序，这样输出的结果是有序的
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print('%s\t%s' % (word, count))

本地测试

cat wordcount.csv | python map.py |sort -k 1|python reduce.py

把代码放到集群上跑

# 提交集群完整代码实例，下面用shell启动
# /root/apps/hadoop-2.6.4/bin/hadoop jar /root/apps/hadoop-2.6.4/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar -mapper map.py -reducer  reduce.py -input /data/data_coe/data_asset/bigdata/*.csv -output /data/data_coe/data_asset/bigdata/output -file /root/Desktop/map.py -file /root/Desktop/reduce.py
HADOOP_CMD="/root/apps/hadoop-2.6.4/bin/hadoop"
    STREAM_JAR_PATH="/root/apps/hadoop-2.6.4/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar"
    
    INPUT_FILE_PATH="/data/data_coe/data_asset/bigdata/*.csv"
    OUTPUT_PATH="/data/data_coe/data_asset/bigdata/output"
    
    hdfs dfs -rmr  $OUTPUT_PATH 
    
    $HADOOP_CMD jar $STREAM_JAR_PATH \
        -input $INPUT_FILE_PATH \
        -output $OUTPUT_PATH \
        -mapper "python map.py" \
        -reducer "python reduce.py" \
        -file /root/Desktop/map.py \
        -file /root/Desktop/reduce.py

脚本解释

HADOOP_CMD： hadoop的bin的路径
    STREAM_JAR_PATH：streaming jar包的路径
    INPUT_FILE_PATH：hadoop集群上的资源输入路径
    OUTPUT_PATH：hadoop集群上的结果输出路径。（注意：这个目录不应该存在的，因此在脚本加了先删除这个目录。**注意****注意****注意**：若是第一次执行，没有这个目录，会报错的。可以先手动新建一个新的output目录。）
    hdfs dfs -rmr  $OUTPUT_PATH
    
    $HADOOP_CMD jar $STREAM_JAR_PATH \
        -input $INPUT_FILE_PATH \
        -output $OUTPUT_PATH \
        # 若干map.py第一行指定了运行的python路径，即第一行写了!/usr/bin/python，可以以以下方式运行 -mapper map.py -reducer reduce.py
        -mapper "python map.py" \
        -reducer "python reduce.py" \
        -file /root/Desktop/map.py \
        -file /root/Desktop/reduce.py                 #这里固定格式，指定输入，输出的路径；指定mapper，reducer的文件；并分发mapper，reducer角色的我们用户写的代码文件，因为集群其他的节点还没有mapper、reducer的可执行文件。