创建键值对RDD
- map
>>> lines = sc.textFile("/input/README.md")
>>> lines.count()
104
>>> pairs = lines.map(lambda x: (x.split(" ")[0], x))
键值对RDD上的变换
除了普通RDD上的变换:map, filter等等,
>>> pairs = pairs.filter(lambda kv: len(kv[1]) < 20 and len(kv[1])>0)
>>> pairs.take(10)
[(u'#', u'# Apache Spark'), (u'##', u'## Building Spark'), (u'', u' ./bin/pyspark'), (u'##', u'## Example Programs'), (u'##', u'## Running Tests'), (u'can', u'can be run using:'), (u'', u' ./dev/run-tests'), (u'##', u'## Configuration'), (u'##', u'## Contributing')]
主要还有和 key、value相关的变换:
单个RDD上的变换
- reduceByKey(func)
- groupByKey()
- combineByKey ( createCombiner, mergeValue, mergeCombiners, partitioner )
- mapValues(func)
= map(lambda x, y: (x, func(y))) - flatMapValues(func)
- keys()
- values()
- sortByKey()
- …
两个RDD间的变换
- subtractByKey
- join
- rightOuterJoin
- leftOuterJoin
- cogroup
- …
wordcount
>>> rdd = sc.textFile('/input/README.md')
>>> rdd
/input/README.md MapPartitionsRDD[12] at textFile at NativeMethodAccessorImpl.java:0
>>> rdd.count()
104
>>> words = rdd.flatMap(lambda x: x.split(' '))
>>> words.take(10)
[u'#', u'Apache', u'Spark', u'', u'Spark', u'is', u'a', u'fast', u'and', u'general']
>>> result = words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
>>> result.take(10)
[(u'', 72), (u'when', 1), (u'R,', 1), (u'including', 4), (u'computation', 1), (u'Kubernetes', 1), (u'using:', 1), (u'guidance', 2), (u'Scala,', 1), (u'environment', 1)]
or
>>> wordcount = rdd.flatMap(lambda x: x.split(' ')).countByValue()
>>> zip(wordcount.keys(), wordcount.values())[:10]
[(u'', 72), (u'project.', 1), (u'help', 1), (u'storage', 1), (u'Once', 1), (u'Hadoop', 3), (u'not', 1), (u'./dev/run-tests', 1), (u'including', 4), (u'same', 1)]
`