Spark 2.4.5 word frequency statistics (python)

Use jupyter notebook as an interactive tool, written in python language.

Code

sc.textFile () is used to load file data.

words = sc.textFile('/data/word.txt')

result:

/data/word.txt MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

This is due to Spark's lazy operation.
Add action actions:

words.first()

Word frequency statistics:

wordCount = words.flatMap(lambda line:line.split(" ")).map(lambda word:(word,1)).\
reduceByKey(lambda a, b: a+b)
wordCount.collect()

result:

[('is', 42),
 ('sheet', 3),
 ('material', 6),
 ('produced', 7),
 ('mechanically', 2),
 ('and/or', 1),
 ('cellulose', 3),
 ('derived', 3),
 ('rags,', 1)]

The textFile.flatMap () operation “slaps” multiple word sets into a large word set; the
word set performs a map () operation.
After map, get RDD, each element is in the form of (key, value). Finally, implement reduceByKey () to group by key, and add the values ​​of the same key.


reference:

  1. Getting started with Spark2.1.0 +: the first Spark application: WordCount (Python version) ;
Published 513 original articles · Like 152 · Visit 770,000+

Guess you like

Origin blog.csdn.net/rosefun96/article/details/105491660