RDD算子实现文件统计

要求

给出一个TXT文件,文件包含uri下面对应的访问量。求每个域名下面访问量最大的uri

程序

package www.ruozedata.bigdata.homework

import org.apache.spark.{SparkConf, SparkContext}


object URIApp {
  def main(args: Array[String]): Unit = {
    val sparkConf=new SparkConf().setMaster("local[2]").setAppName("URIApp")
    val sc=new SparkContext(sparkConf)
    val lines=sc.textFile("file:///C:\\Users\\HJ\\Desktop/secondhomework.txt")
    val uri=lines.map(x=>{
      val uritemp=x.split("\t")
      val uritemp1=uritemp(0)
      var number:Long=0
      try{
        number=uritemp(2).toLong
      }catch {
        case e: Exception=>println("error number")
      }
      val net=uritemp1.split("//")(1).split("/")(0)
      (net,(uritemp1,number))
    }).groupByKey()
    val result=uri.map(x=>{
      val maxtemp=x._2.toList.sortBy(_._2).reverse//在这里sortBy后面不能跟参数false。因为sortBy可以自定义按照降序排列是RDD的算子,而此步骤我们已经把它转为了集合类型。所以采用了reverse这种集合的算子达到降序排列的要求
      (x._1,maxtemp(0))//取第一个值,也就是最大值
    }).foreach(println)

    sc.stop()
  }

}

结果

(segmentfault.com,(https://segmentfault.com/q/1010000000318379,50))
(blog.csdn.net,(https://blog.csdn.net/bitcarmanlee/article/details/75949268 ,40))
(www.baidu.com,(https://www.baidu.com/baidu?tn=monline_3_dg&ie=utf-8&wd=%E6%9C%89%E9%81%93%E7%BF%BB%E8%AF%91,5))
(www.cnblogs.com,(https://www.cnblogs.com/MOBIN/p/5384543.html,40))
(ruozedata.com,(http://ruozedata.com/student.html ,56))

输入文件

https://segmentfault.com/q/1010000000318379 [2018-1202:00] 50
http://ruozedata.com/teacher.html  201802:00 j         
http://ruozedata.com/student.html  201802:00 56
https://www.cnblogs.com/MOBIN/p/5384543.html [2018-12-12 22:00:00] 40
https://www.cnblogs.com/huxiuqian/p/10152166.html 201802:00 4
https://www.cnblogs.com/littleorange7/p/10152286.html [2018-12-12 22:00:00] 7
http://ruozedata.com/advanced.html  [2018-12-14 22:02:00] 8
https://www.baidu.com/baidu?tn=monline_3_dg&ie=utf-8&wd=%E6%9C%89%E9%81%93%E7%BF%BB%E8%AF%91 [2018-1202:00] 5
https://blog.csdn.net/maybe_fly/article/details/77979867  201802:00 h
https://blog.csdn.net/bitcarmanlee/article/details/75949268  [2018-12-13 22:02:00] 40
https://blog.csdn.net/tswisdom/article/details/79882308 [2018-12-13 22:02:00] 30

猜你喜欢

转载自blog.csdn.net/qq_42694416/article/details/85142267