Hadoop+python入门集成测试代码

版权声明:转载请说明来源,谢谢 https://blog.csdn.net/wsp_1138886114/article/details/81173000

一 词频统计WordCount(类似TF)

属于大数据框架中 最经典的案例:
统计文件中每个单词出现的个数

1.1、准备数据
# 创建目录
$ hdfs dfs -mkdir -p /user/cloudera/wordcount/input 

#安装上传模块
$ sudo yum install -y lrzsz 

#创建文件,并编辑
$ touch wordcount_mapper.py  
$ touch wordcount_reducer.py
$ vim wordcount_mapper.py
$ vim wordcount_reducer.py

#上传测试模块(文件)
rz  wordcount_mapper.py
rz  wordcount_reducer.py 

# 上传数据文件
$ hdfs dfs -put /home/cloudera/bigdata/wc.data /user/cloudera/wordcount/input 

1.2、使用Python编码,实现wordcount统计
    - input 
        要读取处理分析数据的路径 
        默认情况下,一行一行的读取文件中的数据 
    - mapper
        函数 ,进行分析处理
    - reducer
        函数,合并map函数输出的结果 
    
    当使用Python开发完成mapper.py和reduce.py脚本以后,在linux系统上要执行的话,需要给予执行权限,命令如何:
    $ chmod u+x wordcount_mapper.py
    $ chmod u+x wordcount_reducer.py

    本地测试:
    - wordcount_mapper.py
        $ echo "hadoop mapreduce mapreduce python" | python wordcount_mapper.py
    - wordcount_reducer.py
        $ echo "hadoop mapreduce mapreduce python" | python  wordcount_mapper.py | python wordcount_reducer.py
1.3、运行Python编写的WordCountt在YARN
hadoop jar \
/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.0.jar \
-files /home/cloudera/word_count/wordcount_mapper.py,\
/home/cloudera/word_count/wordcount_reducer.py \
-mapper "python wordcount_mapper.py" \
-reducer "python wordcount_reducer.py" \
-input /user/cloudera/wordcount/input.wc \
-output /user/cloudera/wordcount/output
----------------------------------------------------------------
    -a. 第一点:
        提交运行Hadoop 中MapReduce运行在YARN上
        hadoop jar / yarn jar 
    -b. 第二点:
        -files 参数 将Python编写脚本文件上传到集群上,以便集群中各个集群下载使用
        要求集群中各个机器上必须按照同一版本、同一目录的Python
    -c. 第三点:
        指定 input、output、mapper和reducer各个参数的值

这里写图片描述

二、IBM股票价格数据:

历史股票价格数据:https://finance.yahoo.com/quote/IBM/history?p=IBM

	1962-01-02,7.713333,7.713333,7.626667,7.626667,0.689273,387200

    日期,开盘价,最高价,最低价,收盘价,调整的收盘价和交易量
    Date,Open,High,Low,Close,Adj Close,Volume

	统计每日变化百分比总数
	    每日变化百分比 = (开盘价-收盘价)/ 开盘价
	
	1. 统计每日变化百分比  
	    0.23%   0.24%  -0.13% 
2.1开发程序:
-a. 创建目录stock,拷贝wordcount_mapper.py和wordcount_reducer.py文件至stock目录
    重命名文件:
    $ mv wordcount_mapper.py stock_mapper.py
    $ mv wordcount_reducer.py stock_reducer.py
-b. 开发代码
-c. 本地测试:
    $ more stock-ibm.csv | python stock_mapper.py | sort -k1| python stock_reducer.py > stock-ibm.output
-d. 集群测试
    读取HDFS上的数据,将程序提交运行在YARN上。 

------------------------------------------------------------------------------
#/user/bin/python
#encoding:utf-8

import sys

for line in  sys.stdin:
	row = line.split(","):
	open_price = float(row[1])
	close_price = float(row[-3])
	change = (open_price-close_price)/open_price*100
	change_text = str(round(change,1))+"%"
		print "%s\t%d" % (change_text,1) 
#!/user/bin/python
# encoding:utf-8

import sys
current_word = None
current_count = 1
	for line in sys.stdin:
		word,count = line.strip().split('\t')
		#判断当前是否存在单词
		if current_word:
			if word == current_word:
				current_count += int(count)
			else:
				print "%s\t%d"%(current_word,current_count)
				current_count =1
				
		current_word = word    #赋值当前单词
	
	if current_count >=1:    #处理读出最后一行数据
		print "%s\t%d"%(current_word,current_count)

猜你喜欢

转载自blog.csdn.net/wsp_1138886114/article/details/81173000