We have learned in the interactive mode to complete the word frequency statistics spark among the command line, this section will explain the use of the environment sbt scala code complete idea of them, and word frequency statistics.
1 systems, software and premise constraints
- CentOS 7 64 workstations of the machine ip is 192.168.100.200, host name danji, the reader is set according to their actual situation
- Completed scala interactive mode in linux word frequency statistics
https://www.jianshu.com/p/92257e814e59 - Statistics have to be word files uploaded to HDFS, as the name / word
- The first test program has been completed scala idea of
https://www.jianshu.com/p/ec64c70e6bb6 - idea 2018.2
- Permission to remove the effects of the operation, all operations are carried out in order to root
2 operation
- 1 Create sbt project idea in
select File-> New-> Project-> Scala-> sbt-> Next
take some time. - Configuration dependencies 2:
Add the following in build.sbt:
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
- 3 WordCount.scala create a class in src / main / scala with the following contents
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object ScalaWordCount {
def main(args: Array[String]): Unit = {
//在windows下执行,必须设置本地的hadoop安装路径,倘若打成jar包,上传到linux,则不需要设置
System.setProperty("hadoop.home.dir", "C:\\hadoop2.7.2")
val conf: SparkConf = new SparkConf().setAppName("WordCount").setMaster("local[2]")
// 创建SparkContext
val sc: SparkContext = new SparkContext(conf)
sc.textFile("hdfs://192.168.100.200:9000/word")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.saveAsTextFile("hdfs://192.168.100.200:9000/outputscala")
// 释放资源
sc.stop()
}
}
- 4 execution, among view HDFS service outputscala you can see the results.
The above is a spark which we use in the process scala word frequency statistics.