pyspark练习--进行日志提取IP并打印排行前五的访问次数的IP

拿到测试用日志文件并分析

27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET/static/image/common/faq.gif HTTP/1.1" 200 1127
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1" 200 1292
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_1.gif HTTP/1.1" 200 680
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_2.gif HTTP/1.1" 200 682
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/filetype/common.gif HTTP/1.1" 200 90
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1" 200 1482
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_forum_index.css?y7a HTTP/1.1" 200 2331

发现IP为每段日志开头并使用“ ”进行分割即可

利用本地IDE进行类wordcount开发
代码如下
import os
import sys
from pyspark import SparkConf, SparkContext
from operator import add

os.environ[‘PYSPARK_PYTHON’] = ‘/home/hadoop/app/python3/bin/python3’

if name == ‘main’:
if len(sys.argv) != 2:
print(‘Usage: TopN’, file=sys.stderr)
sys.exit(-1)
# 初始化
conf = SparkConf()
sc = SparkContext(conf=conf)

# ip
data = sc.textFile(sys.argv[1]).map(lambda x: x.split(' '))
# ip赋值为1
ip = data.map(lambda x: (x[0], 1))
# 同ip计数
count_ip = ip.reduceByKey(add)
# ip排序
sort = count_ip.map(lambda x: (x[1], x[0])).sortByKey(False).map(lambda x: (x[1], x[0]))
# 打印到控制台
print(sort.take(5))

sc.stop()

执行spark-submit指令

./spark-submit --master local[2] --name loganglice /home/hadoop/data/5/log.py hdfs:///test/access_2013_05_30.log 

根据不同文件地址进行不同输入,此处为个人HDFS文件
等待输出结果
在这里插入图片描述
结果如下
在这里插入图片描述
可以发现IP 222.133.189.179访问次数最多,为29948次
其次是61.50.141.7为22836次,
第三为123.147.245.79为9999次,
第四为49.72.74.77为8879次,
第五为60.10.5.65为6341次

猜你喜欢

转载自blog.csdn.net/weixin_43267534/article/details/82833238