A. Partitioning strategy
GraphX using the apex divided partition distributed FIG. GraphX not divided along the edge pattern, but the pattern is divided along the apex, which can reduce the communication and storage overhead. Logically, this corresponds to the dispensing side of the machine and allows the vertices across multiple machines. The method depends on the distribution side of the partitioning strategy PartitionStrategy and various heuristics made some compromise. The user may use the operator Graph.partitionBy FIG reclassified [] may use different partitioning strategy. The default partitioning strategy is the edge of the original partition using the graphical configuration provides. However, the user can easily switch to the 2D GraphX partition or other heuristics included.
Once the edge has been divided, the key challenges is efficient parallel computation FIG effective binding edge and vertex attributes. Because in the real world map typically have more than vertex side, so we will move to the edge of vertex attributes. Since not all partition contains all the vertices adjacent to the side, so we maintain a routing table within the routing table to achieve such triplets operations needed to connect, marking where the broadcast apex aggregateMessages.
II. Test Data
1.users.txt
2.followers.txt
3. Data Visualization
Three .PageRank Page Rank
1 Introduction
The importance of each vertex using FIG PageRank measurement, it is assumed recognition from the side indicated by x u to v. For example, if a Twitter user is concerned about the number of other users, the user will get a high ranking. GraphX with static and dynamic PageRank implemented as a method on the object PageRank. Static PageRant fixed number of iterations to run, and dynamic ranking PageRank run until convergence [change] less than a specified threshold value. GraphOps method calls directly run these algorithms.
2. code implementation
1 package graphx 2 3 import org.apache.log4j.{Level, Logger} 4 import org.apache.spark.graphx.GraphLoader 5 import org.apache.spark.sql.SparkSession 6 7 /** 8 * Created by Administrator on 2019/11/27. 9 */ 10 object PageRank { 11 Logger.getLogger("org").setLevel(Level.WARN) 12 def main(args: Array[String]) { 13 val spark = SparkSession.builder() 14 .master("local[2]") 15 .appName(s"${this.getClass.getSimpleName}") 16 .getOrCreate() 17 val sc = spark.sparkContext 18 val graph = GraphLoader.edgeListFile(sc, "D:\\software\\spark-2.4.4\\data\\graphx\\followers.txt") 19 // 调用PageRank图计算算法 20 val ranks = graph.pageRank(0.0001).vertices 21 // join 22 val users = sc.textFile("D:\\software\\spark-2.4.4\\data\\graphx\\users.txt").map(line => { 23 val fields = line.split(",") 24 (fields(0).toLong, fields(1)) 25 }) 26 // join 27 val ranksByUsername = users.join(ranks).map{ 28 case (id, (username, rank)) => (username, rank) 29 } 30 // print 31 ranksByUsername.foreach(println) 32 } 33 }
3.执行结果
四.ConnectedComponents连通体算法
1.简介
连通体算法实现把图划分为多个子图【不进行节点切分】,清除孤岛子图【只要一个节点的子图】。其使用子图中编号最小的顶点ID标记子图。
2.代码实现
1 package graphx 2 3 import org.apache.log4j.{Level, Logger} 4 import org.apache.spark.graphx.GraphLoader 5 import org.apache.spark.sql.SparkSession 6 7 /** 8 * Created by Administrator on 2019/11/27. 9 */ 10 object ConnectedComponents { 11 Logger.getLogger("org").setLevel(Level.WARN) 12 def main(args: Array[String]) { 13 val spark = SparkSession.builder() 14 .master("local[2]") 15 .appName(s"${this.getClass.getSimpleName}") 16 .getOrCreate() 17 val sc = spark.sparkContext 18 val graph = GraphLoader.edgeListFile(sc, "D:\\software\\spark-2.4.4\\data\\graphx\\followers.txt") 19 // 调用connectedComponents连通体算法 20 val cc = graph.connectedComponents().vertices 21 // join 22 val users = sc.textFile("D:\\software\\spark-2.4.4\\data\\graphx\\users.txt").map(line => { 23 val fields = line.split(",") 24 (fields(0).toLong, fields(1)) 25 }) 26 // join 27 val ranksByUsername = users.join(cc).map { 28 case (id, (username, rank)) => (username, rank) 29 } 30 val count = ranksByUsername.count().toInt 31 // print 32 ranksByUsername.map(_.swap).takeOrdered(count).foreach(println) 33 } 34 }
3.执行结果
五.TriangleCount三角计数算法
1.简介
当顶点有两个相邻的顶点且它们之间存在边时,该顶点是三角形的一部分。GraphX在TriangleCount对象中实现三角计数算法,该算法通过确定经过每个顶点的三角形的数量,从而提供聚类的度量。注意,TriangleCount要求边定义必须为规范方向【srcId < dstId】,并且必须使用Graph.partitionBy对图进行分区。
2.代码实现
1 package graphx 2 3 import org.apache.log4j.{Level, Logger} 4 import org.apache.spark.graphx.{PartitionStrategy, GraphLoader} 5 import org.apache.spark.sql.SparkSession 6 7 /** 8 * Created by Administrator on 2019/11/27. 9 */ 10 object TriangleCount { 11 Logger.getLogger("org").setLevel(Level.WARN) 12 def main(args: Array[String]) { 13 val spark = SparkSession.builder() 14 .master("local[2]") 15 .appName(s"${this.getClass.getSimpleName}") 16 .getOrCreate() 17 val sc = spark.sparkContext 18 val graph = GraphLoader.edgeListFile(sc, "D:\\software\\spark-2.4.4\\data\\graphx\\followers.txt", true) 19 .partitionBy(PartitionStrategy.RandomVertexCut) 20 21 // 调用triangleCount三角计数算法 22 val triCounts = graph.triangleCount().vertices 23 // map 24 val users = sc.textFile("D:\\software\\spark-2.4.4\\data\\graphx\\users.txt").map(line => { 25 val fields = line.split(",") 26 (fields(0).toLong, fields(1)) 27 }) 28 // join 29 val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) => 30 (username, tc) 31 } 32 val count = triCountByUsername.count().toInt 33 // print 34 triCountByUsername.map(_.swap).takeOrdered(count).foreach(println) 35 } 36 }
3.执行结果