python implement MapReduce operation (merge the data required rearrangement Example)

Disclaimer: This article is original, All Rights Reserved https://blog.csdn.net/weixin_41864878/article/details/91473360

Now there are a lot of blog demo to wordcount example, we all know this is a very simple function, whenever encountered some high-level operating point I would mind a blank, just a related needs today, to learn a bit .
http://www.zhangdongshengtech.com/article-detials/236
link above is the frequency of record demo, written by very good, I am sure you read it will understand the core of the wording mapreduce

Intro:wordcount

He said front: mapreduce debugging program can be run separately mapper and reducer, respectively, directly at the command line you specify input formats which will print out output

mapper.py

Enter the form of a document is

word1
word2
word1
word3

# coding=utf-8
import sys
 
for line in sys.stdin:
	words = line.strip().split('|')
	try:
	    his = data['uP_cat']
	    for vid, fre in his.items():
	        if vid[0] != 'V': continue
	        print(vid)
	 except:
	     continue
reducer.py

Implemented here is a simple count and manipulate files in the frequency of writes.
If you only need to implement the counting operation, then only modify the value to print mapper.py

# coding=utf-8
import sys

count = 0
key = ""
current_key = ""

for line in sys.stdin:
    line =  line.rstrip()
    if not line:
        sys.stderr.write("data is wrong")
        sys.exit(1)
    line = line.rstrip()
    items = line.split("\t")
    current_key = items[3]
    cur_timestamp = items[2]
    if current_key == key:
        if cur_timestamp < timestamp:
            print "%s\t%d" % (key, count)
        count = 0
        key = current_key
    count += 1

if key:
    print "%s\t%d" % (key, count)
run.sh

This one runtime configuration no way I could make it clear, because someone else is the script written, I only modify the above two codes
here to modify your input and output files

#!/bin/bash
HADOOP_bin='/your/path/hadoop-2.7.3/bin/hadoop'
INPUT_PATH="input_data"
OUTPUT_PATH="test"

$HADOOP_bin fs -rmr $OUTPUT_PATH


$HADOOP_bin jar /your/path/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar\
    -D mapred.job.priority="VERY_HIGH"\
    -D mapred.reduce.tasks=200\
    -D mapreduce.job.queuename=root.online.default\
    -D mapred.job.map.capacity=400\
    -D mapred.job.reduce.capacity=100\
    -D mapred.job.name="test"\
    -D mapred.textoutputformat.ignoreseparator="true"\
    -input ${INPUT_PATH} \
    -output ${OUTPUT_PATH} \
    -file ./mapper.py\
    -file ./reducer.py\
    -partitioner "org.apache.hadoop.mapred.lib.HashPartitioner"\
    -mapper "python mapper.py"\
    -reducer "python reducer.py"\
    -inputformat "org.apache.hadoop.mapred.TextInputFormat"\
    -outputformat "org.apache.hadoop.mapred.TextOutputFormat"\

Advance: Conditional merge content

The following is achieved by the combined input a certain value

Input forms:
key1 VALUE1 value2
key2 VALUE1 value2
key1 value3 value4
output form:
key1 VALUE1 value2 value3 value4 ............
key2 value2 ............ VALUE1

mapper.py
#coding=utf8
import json
import sys
#f = open('part-07198', 'r') #调试用,因为我的mapreduce任务配置在python2下,调试的时候sys.stdin接收不到输入,所以直接读文件
for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    data = line.split('\t', 2) #只区分key,后面的values不做区分
    if len(data) <= 1:
        continue
    print data

If there is a problem here fear data can be written in the try except where, if still need to do data processing for each line are placed mapper in the process, when data is to be pre-print output based on when a particular merger, lost to a reducer

reducer.py
import json
import sys
from operator import itemgetter
from itertools import groupby

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 2)

def main(separator='\t'):
    #f = open('part-07198', 'r') #调试用
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    for name, group in groupby(data, itemgetter(0)):
        val = []
        for values in group:
            for v in values[1]:
                val.append(v)
        print "%s\t%s"% (values[0], json.dumps(v))

if __name__ == "__main__":
    main()

There are two very important functions, get to know them you will be able to get to know how to write reducer, then you can become complex functions realized flowers
groupby
https://blog.csdn.net/Together_CZ/article/details/73042997

key,  group= groupby(iterator, key=func())

The function of each key element in the original function of the circulator. The result of the function key, will have the same result of the function element is assigned to a new cycle. Each new cycle is a function returns a value of the tag.
It's like a group of people's height as a circulator. We can use such a key function: If the height is greater than 180, return "tall"; if the height of the bottom 160, return "short"; intermediate return "middle". Ultimately, all tall and will be divided into three cycles, that is "tall", "short", "middle".

This function of the value of the original meaning that the iterator key according to any one polymerizable
again review the principles mapreduce:
Hadoop frame will automatically be assigned to the same key on the same reducer, the key, the default is one the output data mapper \ t, or \ 001 after the first part of the divided
see here probably should understand, as long as we need to transform the merged key to the mapper in the first output, so you can directly use groupby polymerization
then, groupby output what is it?
groupby output has two parts, a key part, the other part is the same for all the key data, where the same data field containing the key
look https://blog.csdn.net/LY_ysys629/article/details/72553273 this example You should be able to have an intuitive feel for the output of groupby.
itemgetter
I think this blog understand very well written https://blog.csdn.net/qq_22022063/article/details/79019294

Action: itemgetter acquiring position data for which the object, the representative parameter is the value of the position number,

That is equal itemgetter parameter i, to take data [i] data, and it is a function, it can be directly used as the key groupby transmission parameters.

Handle Chinese characters

I have been twice the pit, and the record about
(1) the beginning of the file must remember that #coding=utf8
(2) if the file is saved at the beginning of '\ u', parse when json.loads can resolve directly with Chinese
(3) If you save a file is the beginning of the '\ x' is. . . When possible resolution to decode ( 'utf-8') it. . .

Guess you like

Origin blog.csdn.net/weixin_41864878/article/details/91473360