Spark Graphx
Spark Graphx is a module of Spark, mainly used for graph-centric calculations, and distributed graph calculations. The bottom layer of Graphx is based on RDD calculation and shares a storage form with RDD. In the form of display, it can be represented by a data set or a graph
Spark Graphx abstraction
(1) Vertex
RDD [(VertexId, VD)] means
VertexId represents the vertex ID, which is of type Long
VD is a vertex attribute, it can be any type
(2) Side
RDD [Edge [ED]] display
Edge represents an edge and contains an ED type parameter to set attributes. In addition, the edge also contains the source vertex ID and target vertex ID
(3) Triple
The triple structure is represented by RDD [EdgeTriplet [VD, ED]]
The triplet contains an edge, edge attributes, source vertex ID, source vertex attribute, target vertex ID, target vertex attribute high
(4) Figure
Graph said, built by vertices and edges
Example of Spark Graphx
Scala codepackage Spark
import org.apache.log4j.{Level, Logger}
import org.apache.spark.graphx.{Edge, Graph}
import org.apache.spark.{SparkConf, SparkContext}
object SparkGraph {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
//创建Spark的配置
val conf = new SparkConf().setAppName("Graph").setMaster("local")
//实例化SparkContext
val sc = new SparkContext(conf)
//定义点
val spot = sc.parallelize(Array((2L,("Lily","post")),(3L,("Tom","student")),(5L,("Andy","post")),(7L,("Mary","student"))))
//定义边
val edge = sc.parallelize(Array(Edge(2L,5L,"Colleague"),Edge(5L,3L,"Advisor"),Edge(5L,7L,"PI"),Edge(3L,7L,"Coll")))
//构建图
val graph = Graph(spot,edge)
//统计Post的数量
val post_Count = graph.vertices.filter{case (id,(name,pos)) => pos == "post"}.count()
//打印结果
println("post count is "+post_Count)
//统计边的数量(起始的ID大于终点的ID)
val edge_Count = graph.edges.filter(e => e.srcId > e.dstId).count()
//打印结果
println("the value is "+edge_Count)
}
}
result