Spark WordCount

1.1、创建测试文件

$ cd ~/ipynotebook/
$ mkdir data
$ cd data/
$ vim word.txt
$ tail word.txt 
hadoop spark hive
hive java python
spark perl hadoop
python RDD spark
RDD 

1.2、编写spark wordcount程序

  • 编写wordcount 程序
$ vim wordcount.py 

#!/usr/bin/env python

from pyspark import SparkContext, SparkConf

conf = SparkConf().setMaster("local").setAppName("pyspark WordCount")
sc = SparkContext(conf = conf)

textFile = sc.textFile("data/word.txt")
stringRDD = textFile.flatMap(lambda line:line.split(" "))
countsRDD = stringRDD.map(lambda word:(word,1)).reduceByKey(lambda x,y:x+y)
countsRDD.saveAsTextFile("data/output")
  • spark-submit 执行程序
$ spark-submit wordcount.py
  • 查看结果
$ cd ~/ipynotebook/data/
$ tree
.
├── output
│   ├── part-00000
│   └── _SUCCESS
└── word.txt

1 directory, 3 files
$ tail output/part-00000 
('hadoop', 2)
('spark', 3)
('hive', 2)
('java', 1)
('python', 2)
('perl', 1)
('RDD', 2)
('', 1)

猜你喜欢

转载自blog.51cto.com/balich/2132267