Spark实例项目—每个域名下访问次数最多的URL

生活在于积累,你所遇到的困难,问题,,,,都将成为你生活的财富,而博客,就是记录你一路成长的最好见证。
————————送给正在写作业的你
Spark 项目分析网络URL数据。加深RDD理解
要求分析出每个域名的前三个访问量是哪些 URL
数据的格式:
https://blog.csdn.net/qq_43688472/article/details/84307884 [2015-07-07 13:52:58] 54
https://item.jd.com/26838388932.html [2008-03-04 15:47:37] 81
https://blog.csdn.net/qq_24073707/article/details/80665991 [5002-10-17 09:20:02] 73
https://hizero.taobao.com/?spm=a217m.8316598.682372.3.426d33d5KRgTIh [2009-05-29 20:02:43] 63
操作:
import java.net.URL

import org.apache.spark.{SparkConf, SparkContext}

object UrlCount01 {
def main(args: Array[String]): Unit = {

  val sparkConf = new SparkConf().setMaster("local[2]").setAppName("UrlCount01")

  val sc = new SparkContext(sparkConf)

  val line = sc.textFile("file:///E:\\data.txt" )
//筛选数据因为有的数据可能不散URL格式的数据
//println("line: "+line)


//    val rdd1 = line.filter(x=>{
//      val tmp = x.split(":")
//      if (tmp.length>=3||x.contains("[["))
//        false
//      else
//        true
//    }).map(
//      x=>{
//        val data = x.split("\t")
//        val urls = new URL(data(0))
//        val host = urls.getHost
//        (data(0), 1)
//      })

    val rdd1 = line.map(
      x=>{
        val data = x.split("\t")
        val urls = new URL(data(0))
        val host = urls.getHost
        (data(0), 1)
      })

val rdd2 = rdd1.reduceByKey((x,y)=>x+y)
//合并两个Map集合对象(将两个对应KEY的值累加)
//( map1 /: map2 ) { case (map, (k,v)) => map + ( k -> (v + map.getOrElse(k, 0)) ) }
val rdd3 = rdd2.map{case(d,t)=>{
  val urls = new URL(d)
  val host = urls.getHost
  (host,d, t)
}}
//把数据进行分组
val rdd4 = rdd3.groupBy(_._1)
//分组后进行排序操作
val rdd5 = rdd4.map(sx=>{
  val key = sx._1
  val value = sx._2;
  val sorval = value.toList.sortBy(_._3).take(3)
  (key,sorval)
})

  rdd5.foreach(println)

//把操作完的数据存入本地文件
//rdd5.saveAsTextFile("E:\\data2")
sc.stop()

}
}
其中已经表明一些注释了,需要什么就打开什么吧,
让我们看一下结果:
(wengna.taobao.com,List((wengna.taobao.com,https://wengna.taobao.com/?spm=a217m.8316598.682375.7.426d33d5KRgTIh,2)))
(takefired.taobao.com,List((takefired.taobao.com,https://takefired.taobao.com/?spm=a217m.8316598.711275.5.426d33d5KRgTIh,1)))
(blog.csdn.net,List((blog.csdn.net,https://blog.csdn.net/qq_43688472/article/details/84940873,1), (blog.csdn.net,https://blog.csdn.net/qq_24073707/article/details/80988329,1), (blog.csdn.net,https://blog.csdn.net/qq_24073707/article/details/80658301,1)))
(12cmlook.taobao.com,List((12cmlook.taobao.com,https://12cmlook.taobao.com/?spm=a217m.8316598.682375.9.426d33d5KRgTIh,1)))
(item.jd.com,List((item.jd.com,https://item.jd.com/25619900612.html,1), (item.jd.com,https://item.jd.com/19997245287.html,1), (item.jd.com,https://item.jd.com/100001625726.html,1)))
(unawares.taobao.com,List((unawares.taobao.com,https://unawares.taobao.com/?spm=a217m.8316598.682348.7.426d33d5KRgTIh,2)))
(mp.csdn.net,List((mp.csdn.net,https://mp.csdn.net/mdeditor/84307884#,1)))
付出总会有回报,相信自己,即使你很慢,只要你在前进就好额,不要害怕,加油!
——————————————送给努力的你

猜你喜欢

转载自blog.csdn.net/qq_43688472/article/details/85134619
今日推荐