03 Use spark word frequency statistics [python]

This section shows how to word frequency statistics by the python in the spark.

1 systems, software and premise constraints

  • CentOS 7 64 workstations of the machine ip is 192.168.100.200, host name danji, the reader is set according to their actual situation
  • Completed scala way of word frequency statistics
    https://www.jianshu.com/p/92257e814e59
  • Statistics have to be word files uploaded to HDFS, as the name / word
  • Permission to remove the effects of the operation, all operations are carried out in order to root

2 operation

  • 1. Log in as root to 192.168.100.200 xshell
  • 2. Go to the bin directory spark of a new wordcount.py, reads as follows:
from operator import add
from pyspark import SparkContext

def word_count():
    sc = SparkContext(appName="wordcount")
    textFile= sc.textFile("/word")
    result = textFile.flatMap(lambda x: x.split(" ")) \
        .map(lambda x: (x, 1)) \
        .reduceByKey(add) \
        .sortBy(lambda x: x[1], False).take(3)
    for k, v in result:
        print k, v
if __name__ == '__main__':
    word_count()

Save and exit.

  • 3. Perform
./spark-submit --master local wordcount.py

Waiting to see the results.
The above is the process we use python in the spark which frequency statistics, the reader pay special attention to the text of the python syntax constraints.

Guess you like

Origin www.cnblogs.com/alichengxuyuan/p/12576801.html