Spark wordCount案例

1、构建一个RDD ##path指定文件所在的位置,第一个默认的路径是HDFS的路径,而且可以省略hdfs:主机名:8020/,第二个如果是linux文件的路径,那么需要写file:// + 文件的绝对路径
val textFile = sc.textFile("README.md")
org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: hdfs://spark.ibeifeng.com:8020/user/ibeifeng/README.md

val textFile = sc.textFile("file:///README.md")
org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: file:/README.md

val textFile = sc.textFile("file:///opt/modules/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/README.md")
res2: Long = 95

val path = "/input/dept.txt"
val rdd = sc.textFile(path)

2、对RDD进行操作
###把rdd类似的当作集合去处理(基于groupBy的wordCount案例)
val result = rdd.flatMap(line => line.split("\t")).filter(word => word.nonEmpty).groupBy(word => word).map(t => (t._1,t._2.size))

val result = rdd.flatMap(line => line.split("\t")).filter(word => word.nonEmpty).map(word => (word,1)).groupByKey().map(t => (t._1,t._2.size))

###把基于reduceByKey实现wordCount案例
val result = rdd.flatMap(line => line.split(" ")).filter(word => word.nonEmpty).map(word => (word,1)).reduceByKey((a:Int,b:Int)=> a + b)

3、结果保存
##1.直接输出在控制台
result.first //就是获取第一条数据
result.take(n) //就是获取任意条数据
result.collect //就是获取全部的数据
##2.储存在HDFS上面
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://spark.ibeifeng.com:8020/output/spark/wordCount already exists 输出目录不可以存在!!!
result.saveAsTextFile("/output/spark/wordCount")

猜你喜欢

转载自blog.csdn.net/qq_36567024/article/details/80559695