Hadoop Python MapReduce

环境：Linux + hadoop python3

需要注意python不同版本的语法；

解决的问题：对文本文件进行词频统计；

hadoop mapreduce计算流程

inputdata->HDFS ->datasplit ->map-(shuffer&sort)->reudce->output(HDFS)

任意文本文件

需要准备的脚本：

map.py reduce.py run.sh

(base) [root@pyspark mapreduce]# cat map.py
import sys

for line in sys.stdin:
wordlist = line.strip().split(' ')
for word in wordlist:
print('\t'.join([word.strip(),'1']))

--------------------------------------------------------------------------

扫描二维码关注公众号，回复： 7803316 查看本文章

(base) [root@pyspark mapreduce]# cat reduce.py
import sys

cur_word = None

sum = 0
for line in sys.stdin:
wordlist =line.strip().split('\t')
if len(wordlist) !=2:
continue
word,cnt = wordlist

if cur_word == None:
cur_word = word
if cur_word != word:
print('\t'.join([cur_word,str(sum)]))
cur_word = word
sum = 0
sum +=int(cnt)

print('\t'.join([cur_word,str(sum)]))

----------------------------------------------------------------

(base) [root@pyspark wordcount]# cat run.sh

#!/root/anaconda3/bin/python

HADOOP_CMD="/root/hadoop/hadoop-2.9.2/bin/hadoop"
STREAM_JAR_PATH="/root/hadoop/hadoop-2.9.2/hadoop-streaming-2.9.2.jar"

INPUT_FILE_PATH_1="/1.data"

OUTPUT_PATH="/output"

$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH

$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH_1 \
-output $OUTPUT_PATH \
-mapper "python maptest.py" \
-reducer "python reducetest.py" \
-file ./maptest.py \
-file ./reducetest.py

-----------------------------------------------------------------

(base) [root@pyspark wordcount]# sh run.sh
rmr: DEPRECATED: Please use '-rm -r' instead.
Deleted /output
19/11/10 23:49:42 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./maptest.py, ./reducetest.py, /tmp/hadoop-unjar5889192628432952141/] [] /tmp/streamjob7228039301823824466.jar tmpDir=null
19/11/10 23:49:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/10 23:49:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/10 23:49:47 INFO mapred.FileInputFormat: Total input files to process : 1
19/11/10 23:49:48 INFO mapreduce.JobSubmitter: number of splits:2
19/11/10 23:49:48 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/11/10 23:49:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1573374330013_0006
19/11/10 23:49:49 INFO impl.YarnClientImpl: Submitted application application_1573374330013_0006
19/11/10 23:49:49 INFO mapreduce.Job: The url to track the job: http://pyspark:8088/proxy/application_1573374330013_0006/
19/11/10 23:49:49 INFO mapreduce.Job: Running job: job_1573374330013_0006
19/11/10 23:50:04 INFO mapreduce.Job: Job job_1573374330013_0006 running in uber mode : false
19/11/10 23:50:04 INFO mapreduce.Job: map 0% reduce 0%
19/11/10 23:50:18 INFO mapreduce.Job: map 50% reduce 0%
19/11/10 23:50:19 INFO mapreduce.Job: map 100% reduce 0%
19/11/10 23:50:30 INFO mapreduce.Job: map 100% reduce 100%
19/11/10 23:50:32 INFO mapreduce.Job: Job job_1573374330013_0006 completed successfully
19/11/10 23:50:33 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=3229845
FILE: Number of bytes written=7065383
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1900885
HDFS: Number of bytes written=183609
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Killed map tasks=1
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=24785
Total time spent by all reduces in occupied slots (ms)=9853
Total time spent by all map tasks (ms)=24785
Total time spent by all reduce tasks (ms)=9853
Total vcore-milliseconds taken by all map tasks=24785
Total vcore-milliseconds taken by all reduce tasks=9853
Total megabyte-milliseconds taken by all map tasks=25379840
Total megabyte-milliseconds taken by all reduce tasks=10089472
Map-Reduce Framework
Map input records=8598
Map output records=335454
Map output bytes=2558931
Map output materialized bytes=3229851
Input split bytes=168
Combine input records=0
Combine output records=0
Reduce input groups=16985
Reduce shuffle bytes=3229851
Reduce input records=335454
Reduce output records=16984
Spilled Records=670908
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=687
CPU time spent (ms)=10610
Physical memory (bytes) snapshot=731275264
Virtual memory (bytes) snapshot=6418624512
Total committed heap usage (bytes)=504365056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1900717
File Output Format Counters
Bytes Written=183609
19/11/10 23:50:33 INFO streaming.StreamJob: Output directory: /output

---------------------------------------------------------------------------------------------------

Hadoop Python MapReduce

-----------------------------------------------------------------------------------------------

(base) [root@pyspark wordcount]# hdfs dfs -ls /output
Found 2 items
-rw-r--r-- 1 root supergroup 0 2019-11-10 23:50 /output/_SUCCESS
-rw-r--r-- 1 root supergroup 183609 2019-11-10 23:50 /output/part-00000

============================================================

文本数据，mapper.py,reducer.py.run.sh

(base) [root@pyspark stock]# head 300340.csv
2016-12-30,-1
2016-12-29,0
2016-12-28,-1
2016-12-27,-2
2016-12-26,-4
2016-12-23,-4
2016-12-22,-3
2016-12-21,1
2016-12-20,3
2016-12-19,-2

--------------------------------------------------------------------------------------------------------

maper.py

(base) [root@pyspark stock]# cat mapper.py
#! /usr/bin/python

import sys

for line in sys.stdin:
line = line.strip()
date, change = line.split(',')
change = int(change)
print('%s,%d' % (date, change))

-----------------------------------------------------------------------------------------------------

(base) [root@pyspark stock]# cat reducer.py
#! /usr/bin/python

import sys

stat = {}
for line in sys.stdin:
line = line.strip()
date, change = line.split(',')
change = int(change)
if change not in stat:
stat[change] = 0
stat[change] += 1
for k, v in sorted(stat.items()):
if k >= 0:
print('+++ %d%%-%d%% for %d days' % (k, k + 1, v))
else:
print('--- %d%%-%d%% for %d days' % (-k, -(k - 1), v))

--------------------------------------------------------------------------------------------------

(base) [root@pyspark stock]# cat run.sh

#!/root/anaconda3/bin/python
HADOOP_CMD="/root/hadoop/hadoop-2.9.2/bin/hadoop"
STREAM_JAR_PATH="/root/hadoop/hadoop-2.9.2/hadoop-streaming-2.9.2.jar"

INPUT_FILE_PATH_1="/stock_input/300340.csv"

OUTPUT_PATH="/stock_output"

$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH

$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH_1 \
-output $OUTPUT_PATH \
-mapper "python mapper.py" \
-reducer "python reducer.py" \
-file ./mapper.py \
-file ./reducer.py

------------------------------------------------------------------------------------------------

(base) [root@pyspark stock]# sh run.sh
rmr: DEPRECATED: Please use '-rm -r' instead.
Deleted /stock_output
19/11/11 00:02:06 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop-unjar8953104380979231231/] [] /tmp/streamjob1591230460326040415.jar tmpDir=null
19/11/11 00:02:09 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/11 00:02:10 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/11/11 00:02:12 INFO mapred.FileInputFormat: Total input files to process : 1
19/11/11 00:02:12 INFO mapreduce.JobSubmitter: number of splits:2
19/11/11 00:02:13 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/11/11 00:02:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1573374330013_0007
19/11/11 00:02:15 INFO impl.YarnClientImpl: Submitted application application_1573374330013_0007
19/11/11 00:02:15 INFO mapreduce.Job: The url to track the job: http://pyspark:8088/proxy/application_1573374330013_0007/
19/11/11 00:02:15 INFO mapreduce.Job: Running job: job_1573374330013_0007
19/11/11 00:02:30 INFO mapreduce.Job: Job job_1573374330013_0007 running in uber mode : false
19/11/11 00:02:30 INFO mapreduce.Job: map 0% reduce 0%
19/11/11 00:02:43 INFO mapreduce.Job: map 100% reduce 0%
19/11/11 00:02:55 INFO mapreduce.Job: map 100% reduce 100%
19/11/11 00:02:57 INFO mapreduce.Job: Job job_1573374330013_0007 completed successfully
19/11/11 00:02:57 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=2703
FILE: Number of bytes written=611093
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3517
HDFS: Number of bytes written=475
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=21325
Total time spent by all reduces in occupied slots (ms)=10182
Total time spent by all map tasks (ms)=21325
Total time spent by all reduce tasks (ms)=10182
Total vcore-milliseconds taken by all map tasks=21325
Total vcore-milliseconds taken by all reduce tasks=10182
Total megabyte-milliseconds taken by all map tasks=21836800
Total megabyte-milliseconds taken by all reduce tasks=10426368
Map-Reduce Framework
Map input records=162
Map output records=162
Map output bytes=2373
Map output materialized bytes=2709
Input split bytes=200
Combine input records=0
Combine output records=0
Reduce input groups=162
Reduce shuffle bytes=2709
Reduce input records=162
Reduce output records=21
Spilled Records=324
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=984
CPU time spent (ms)=7240
Physical memory (bytes) snapshot=740073472
Virtual memory (bytes) snapshot=6417063936
Total committed heap usage (bytes)=502267904
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=3317
File Output Format Counters
Bytes Written=475
19/11/11 00:02:57 INFO streaming.StreamJob: Output directory: /stock_output

-----------------------------------------------------------------------------------------------

(base) [root@pyspark stock]# hdfs dfs -ls /stock_input/
Found 1 items
-rw-r--r-- 1 root supergroup 2211 2019-11-10 17:24 /stock_input/300340.csv

---------------------------------------------------------------------------------------------

(base) [root@pyspark stock]# hdfs dfs -ls /stock_output/
Found 2 items
-rw-r--r-- 1 root supergroup 0 2019-11-11 00:02 /stock_output/_SUCCESS
-rw-r--r-- 1 root supergroup 475 2019-11-11 00:02 /stock_output/part-00000

-----------------------------------------------------------------------------------------------------

单机调试（map，reduce分步调试）：

(base) [root@pyspark stock]# cat 300340.csv |python mapper.py |sort -k1|head
2016-01-04,-10
2016-01-05,-9
2016-01-06,1
2016-01-07,-10
2016-01-08,-1
2016-01-11,-10
2016-01-12,-2
2016-01-13,-6
2016-01-14,3
2016-01-15,-4

(base) [root@pyspark stock]# cat 300340.csv |python mapper.py |sort -k1|python reducer.py
--- 10%-11% for 5 days
--- 9%-10% for 1 days
--- 8%-9% for 3 days
--- 7%-8% for 3 days
--- 6%-7% for 6 days
--- 5%-6% for 7 days
--- 4%-5% for 7 days
--- 3%-4% for 15 days
--- 2%-3% for 15 days
--- 1%-2% for 23 days
+++ 0%-1% for 16 days
+++ 1%-2% for 13 days
+++ 2%-3% for 10 days
+++ 3%-4% for 9 days
+++ 4%-5% for 6 days
+++ 5%-6% for 2 days
+++ 6%-7% for 2 days
+++ 7%-8% for 1 days
+++ 8%-9% for 1 days
+++ 9%-10% for 2 days
+++ 10%-11% for 15 days

Hadoop Python MapReduce

猜你喜欢