hadoop program development --- python

Here is an example of statistical words

1 First create mapper.py

mkdir /usr/local/hadoop-python
cd /usr/local/hadoop-python
vim mapper.py

mapper.py

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input) 输入来自STDIN(标准输入)
for line in sys.stdin:
    # remove leading and trailing whitespace 删除前导和尾随空格
    line = line.strip()
    # split the line into words 把线分成单词
    words = line.split()
    # increase counters 增加柜台
    for word in words:
        # write the results to STDOUT (standard output); 
        # 将结果写入STDOUT(标准输出);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        # tab-delimited; the trivial word count is 1
        # 我们在此处输出的内容将是Reduce步骤的输入,即reducer.py制表符分隔的输入; 		   # 平凡的字数是1
        print '%s\t%s' % (word, 1)

After saving the file, please pay attention to modify its permissions accordingly:

chmod a+x /usr/local/hadoop-python/mapper.py

2 Build reducer.py

vim reducer.py
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN 输入来自STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace 
    # 删除前导和尾随空格
    line = line.strip()

    # parse the input we got from mapper.py
    # 解析我们从mapper.py获得的输入
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    # 将count(当前为字符串)转换为int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        # count不是数字,因此请忽略/丢弃此行
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    # 该IF开关仅起作用是因为Hadoop在将映射输出传递给reducer之前按键(此处为word)对	  # 映射输出进行排序
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            # 将结果写入STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
# 如果需要,不要忘记输出最后一个单词!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

After saving the file, please pay attention to modify its permissions accordingly:

chmod a+x /usr/local/hadoop-python/reducer.py

First, you can test the above code on this machine, so that if there is a problem, you can find it in time:

# echo "foo foo quux labs foo bar quux" | /usr/local/hadoop-python/mapper.py
输出:
foo	1
foo	1
quux	1
labs	1
foo	1
bar	1
quux	1

Then run the following code containing reduce.py:

echo "foo foo quux labs foo bar quux" | /usr/local/hadoop-python/mapper.py | sort -k1,1 | /usr/local/hadoop-python/reducer.py
输出:
bar		1
foo		3
labs	1
quux	2

3 Run Python code on Hadoop

Preparation:
Download the text file:

yum install wget -y
mkdir input
cd /usr/local/hadoop-python/input
wget http://www.gutenberg.org/files/5000/5000-8.txt
wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt

Then upload these two books to the hdfs file system:

# 在hdfs上的该用户目录下创建一个输入文件的文件夹
hdfs dfs -mkdir /input 

# 上传文档到hdfs上的输入文件夹中
hdfs dfs -put /usr/local/hadoop-python/input/pg20417.txt /input 

Find the storage location of your streaming jar file. Note that the 2.6 version is placed in the share directory. You can enter the hadoop installation directory to find the file:

cd $HADOOP_HOME
find ./ -name "*streaming*.jar"

Then you will find the hadoop-straming*.jar file in our share folder:

./share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar
./share/hadoop/tools/sources/hadoop-streaming-2.8.4-test-sources.jar
./share/hadoop/tools/sources/hadoop-streaming-2.8.4-sources.jar
/usr/local/hadoop-2.8.4/share/hadoop/tools/lib

Since the path of this file is relatively long, we can write it to the environment variable:

vim /etc/profile
export STREAM=/usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar

Because the script to run through the streaming interface is too long, create a shell named run.sh directly to run:

vim run.sh
hadoop jar /usr/local/hadoop-2.8.4/share/hadoop/tools/lib/hadoop-streaming-2.8.4.jar \
-files /usr/local/hadoop-python/mapper.py,/usr/local/hadoop-python/reducer.py \
-mapper /usr/local/hadoop-python/mapper.py \
-reducer /usr/local/hadoop-python/reducer.py \
-input /input/pg20417.txt \
-output /output1
hadoop jar $STREAM \-files /usr/local/hadoop-python/mapper.py,/usr/local/hadoop-python/reducer.py \-mapper /usr/local/hadoop-python/mapper.py \-reducer /usr/local/hadoop-python/reducer.py \-input /input/pg20417.txt \-output /output1

Guess you like

Origin blog.csdn.net/zx77588023/article/details/110144295