Spark GraphX [Figure algorithm partitioning strategy, PageRank, ConnectedComponents, TriangleCount]

A. Partitioning strategy

  

  GraphX ​​using the apex divided partition distributed FIG. GraphX ​​not divided along the edge pattern, but the pattern is divided along the apex, which can reduce the communication and storage overhead. Logically, this corresponds to the dispensing side of the machine and allows the vertices across multiple machines. The method depends on the distribution side of the partitioning strategy PartitionStrategy and various heuristics made some compromise. The user may use the operator Graph.partitionBy FIG reclassified [] may use different partitioning strategy. The default partitioning strategy is the edge of the original partition using the graphical configuration provides. However, the user can easily switch to the 2D GraphX ​​partition or other heuristics included.

  

  Once the edge has been divided, the key challenges is efficient parallel computation FIG effective binding edge and vertex attributes. Because in the real world map typically have more than vertex side, so we will move to the edge of vertex attributes. Since not all partition contains all the vertices adjacent to the side, so we maintain a routing table within the routing table to achieve such triplets operations needed to connect, marking where the broadcast apex aggregateMessages.

II. Test Data

  1.users.txt

    

  2.followers.txt

    

  3. Data Visualization

    

Three .PageRank Page Rank

  1 Introduction

    The importance of each vertex using FIG PageRank measurement, it is assumed recognition from the side indicated by x u to v. For example, if a Twitter user is concerned about the number of other users, the user will get a high ranking. GraphX ​​with static and dynamic PageRank implemented as a method on the object PageRank. Static PageRant fixed number of iterations to run, and dynamic ranking PageRank run until convergence [change] less than a specified threshold value. GraphOps method calls directly run these algorithms.

  2. code implementation

 1 package graphx
 2 
 3 import org.apache.log4j.{Level, Logger}
 4 import org.apache.spark.graphx.GraphLoader
 5 import org.apache.spark.sql.SparkSession
 6 
 7 /**
 8   * Created by Administrator on 2019/11/27.
 9   */
10 object PageRank {
11   Logger.getLogger("org").setLevel(Level.WARN)
12   def main(args: Array[String]) {
13     val spark = SparkSession.builder()
14         .master("local[2]")
15         .appName(s"${this.getClass.getSimpleName}")
16         .getOrCreate()
17       val sc = spark.sparkContext
18     val graph = GraphLoader.edgeListFile(sc, "D:\\software\\spark-2.4.4\\data\\graphx\\followers.txt")
19     // 调用PageRank图计算算法
20     val ranks = graph.pageRank(0.0001).vertices
21     // join
22     val users = sc.textFile("D:\\software\\spark-2.4.4\\data\\graphx\\users.txt").map(line => {
23       val fields = line.split(",")
24       (fields(0).toLong, fields(1))
25     })
26     // join
27     val ranksByUsername = users.join(ranks).map{
28       case (id, (username, rank)) => (username, rank)
29     }
30     // print
31     ranksByUsername.foreach(println)
32   }
33 }

  3.执行结果

    

四.ConnectedComponents连通体算法

  1.简介

    连通体算法实现把图划分为多个子图【不进行节点切分】,清除孤岛子图【只要一个节点的子图】。其使用子图中编号最小的顶点ID标记子图。

  2.代码实现

 1 package graphx
 2 
 3 import org.apache.log4j.{Level, Logger}
 4 import org.apache.spark.graphx.GraphLoader
 5 import org.apache.spark.sql.SparkSession
 6 
 7 /**
 8   * Created by Administrator on 2019/11/27.
 9   */
10 object ConnectedComponents {
11   Logger.getLogger("org").setLevel(Level.WARN)
12   def main(args: Array[String]) {
13     val spark = SparkSession.builder()
14       .master("local[2]")
15       .appName(s"${this.getClass.getSimpleName}")
16       .getOrCreate()
17     val sc = spark.sparkContext
18     val graph = GraphLoader.edgeListFile(sc, "D:\\software\\spark-2.4.4\\data\\graphx\\followers.txt")
19     // 调用connectedComponents连通体算法
20     val cc = graph.connectedComponents().vertices
21     // join
22     val users = sc.textFile("D:\\software\\spark-2.4.4\\data\\graphx\\users.txt").map(line => {
23       val fields = line.split(",")
24       (fields(0).toLong, fields(1))
25     })
26     // join
27     val ranksByUsername = users.join(cc).map {
28       case (id, (username, rank)) => (username, rank)
29     }
30     val count = ranksByUsername.count().toInt
31     // print
32     ranksByUsername.map(_.swap).takeOrdered(count).foreach(println)
33   }
34 }

  3.执行结果

    

五.TriangleCount三角计数算法

  1.简介  

    当顶点有两个相邻的顶点且它们之间存在边时,该顶点是三角形的一部分。GraphX在TriangleCount对象中实现三角计数算法,该算法通过确定经过每个顶点的三角形的数量,从而提供聚类的度量。注意,TriangleCount要求边定义必须为规范方向【srcId < dstId】,并且必须使用Graph.partitionBy对图进行分区。

  2.代码实现

 1 package graphx
 2 
 3 import org.apache.log4j.{Level, Logger}
 4 import org.apache.spark.graphx.{PartitionStrategy, GraphLoader}
 5 import org.apache.spark.sql.SparkSession
 6 
 7 /**
 8   * Created by Administrator on 2019/11/27.
 9   */
10 object TriangleCount {
11   Logger.getLogger("org").setLevel(Level.WARN)
12   def main(args: Array[String]) {
13     val spark = SparkSession.builder()
14       .master("local[2]")
15       .appName(s"${this.getClass.getSimpleName}")
16       .getOrCreate()
17     val sc = spark.sparkContext
18     val graph = GraphLoader.edgeListFile(sc, "D:\\software\\spark-2.4.4\\data\\graphx\\followers.txt", true)
19       .partitionBy(PartitionStrategy.RandomVertexCut)
20 
21     // 调用triangleCount三角计数算法
22     val triCounts = graph.triangleCount().vertices
23     // map
24     val users = sc.textFile("D:\\software\\spark-2.4.4\\data\\graphx\\users.txt").map(line => {
25       val fields = line.split(",")
26       (fields(0).toLong, fields(1))
27     })
28     // join
29     val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
30       (username, tc)
31     }
32     val count = triCountByUsername.count().toInt
33     // print
34     triCountByUsername.map(_.swap).takeOrdered(count).foreach(println)
35   }
36 }

  3.执行结果

    

Guess you like

Origin www.cnblogs.com/yszd/p/11943045.html