02 Use spark scala interactive word frequency statistics []

We have installed spark in CentOS7, this section will show how to interact scala word frequency statistics by way of spark.

1 systems, software and premise constraints

  • CentOS 7 64 workstations of the machine ip is 192.168.100.200, host name danji, the reader is set according to their actual situation
  • hadoop has been installed and started
    https://www.jianshu.com/p/b7ae3b51e559
  • spark has been installed and started
    https://www.jianshu.com/p/8384ab76e8d4
  • Permission to remove the effects of the operation, all operations are carried out in order to root

2 operation

  • 1. Log in as root to 192.168.100.200 xshell
  • 2. Create a new file, enter some string, uploaded to HDFS
# 进入hadoop的bin目录
cd /root/hadoop-2.5.2/bin
# 编辑word,加入以下内容,保存退出
I am zhangli
I am xiaoli
who are you
I am ali
hello jiangsu wanhe
wanhe
# 上传word到HDFS
./hdfs dfs -put word /word
# 查看
./hdfs dfs -cat /word
  • 3. Go to spark the command line
# 进入spark的命令目录
cd /root/spark-2.2.1-bin-hadoop2.7/bin
# 进入spark命令行
./spark-shell
  • 4. In the spark command to perform the following command line interaction
#创建一个上下文环境,以HDFS的/word作为输入
scala > val textFile = sc.textFile("/word")
#统计/word共有多少行
scala > textFile.count()
#打印/word内容
scala > textFile.collect().foreach(println)
#过滤哪些行包含"I"
scala > val linesWithSpark = textFile.filter(line => line.contains("I"))
#包含"I"总共有多少行
scala > linesWithSpark.count()
#统计单词频率
scala > val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
#打印统计结果
scala > wordCounts.collect()

The above is a word frequency statistics interactively through the spark among the scala.

Reproduced in: https: //www.jianshu.com/p/92257e814e59

Guess you like

Origin blog.csdn.net/weixin_33696106/article/details/91051978