Use jupyter notebook as an interactive tool, written in python language.
Code
sc.textFile () is used to load file data.
words = sc.textFile('/data/word.txt')
result:
/data/word.txt MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0
This is due to Spark's lazy operation.
Add action actions:
words.first()
Word frequency statistics:
wordCount = words.flatMap(lambda line:line.split(" ")).map(lambda word:(word,1)).\
reduceByKey(lambda a, b: a+b)
wordCount.collect()
result:
[('is', 42),
('sheet', 3),
('material', 6),
('produced', 7),
('mechanically', 2),
('and/or', 1),
('cellulose', 3),
('derived', 3),
('rags,', 1)]
The textFile.flatMap () operation “slaps” multiple word sets into a large word set; the
word set performs a map () operation.
After map, get RDD, each element is in the form of (key, value). Finally, implement reduceByKey () to group by key, and add the values of the same key.
reference: