1. The basic concept of graph
1. A graph is a network data structure composed of a collection of vertices (vertex) and a collection of relations between vertices (edge)
- Usually expressed as a two-tuple: Gragh=(V, E)
- Can model the relationship between things
2. Application scenarios
- Find the shortest path in the map app
- Social network relationship
- Hyperlink relationship between web pages
Second, the terminology of the graph
1. Vertex, Edge
Graph=(V,E)
集合V={
v1,v2,v3}
集合E={
(v1,v2),(v1,v3),(v2,v3)}
As shown below:
2. Directed graph, undirected graph
- Directed graph
G=(V,E)
V={
A,B,C,D,E}
E={
<A,B>,<B,C>,<B,D>,<C,E>,<D,A>,<E,D>}
As shown below:
- Undirected graph
G=(V,E)
V={
A,B,C,D,E}
E={
(A,B),(A,D),(B,C),(B,D),(C,E),(D,E)}
As shown below:
3. cyclic graph and acyclic graph
- A cyclic graph
contains a series of loops (loops) connected by vertices
- Acyclic graph
DAG is a directed acyclic graph
4. Degree
Degree: the number of all edges of a vertex
- Out degree: refers to the number of edges from the current vertex to other vertices
- In-degree: the number of edges of other vertices pointing to the current vertex
As shown in the figure above:
the degree of vertex 2 is 1, out degree: 0, in degree: 1
degree of vertex 8 is 3, out degree: 1, in degree: 2
degree of vertex 10 is 2, out degree: 0, in degree :
The degree of 2 vertex 11 is 5, the degree of out: 3, the degree of in: 2
…
5. Classical representation of graphs-adjacency matrix
- For each edge, the corresponding cell value in the matrix is 1
- For each cycle, the value of the corresponding cell in the matrix is 2, which is convenient to find the degree of the vertex in the row or column
Three, Spark GraphX
1. Introduction to Spark GraphX
GraphX is a distributed graph computing API provided by Spark
GraphX features:
- Realize data reuse and fast reading based on memory
- Unified graph view and table view through elastic distributed property graph (Property Graph)
- Seamlessly connect with Spark Streaming, Spark SQL, Spark MLlib, etc.
2. GraphX core abstraction
Resilient Distributed Property Graph
- Directed multigraph with attributes on both vertices and edges
- One physical storage, two views
- All operations on the Graph view will eventually be converted to the RDD operation of its associated Table view to complete
3、GraphX API
- Graph[VD,ED]
- VertexRDD[VD]
- EdgeRDD [ED]
- EdgeTriplet[VD,ED]
- Edge: Sample class
- VertexId: Alias of Long
Fourth, create and use Graph
1. Import the Spark Graph package
import org.apache.spark.graphx._
2. Create vertices vertex RDD
val vertices = sc.makeRDD(Seq((1L,1),(2L,2),(3L,3)))
3. Create edges RDD
val edges = sc.makeRDD(Seq(Edge(1L,2L,1),Edge(2L,3L,2)))
4. Create a Graph object
val graph = Graph(vertices,edges)
5. Load Graph object from file
followers.txt
2 3
3 4
1 4
2 4
Load statement:
val graph = GraphLoader.edgeListFile(sc, "file:///opt/followers.txt")
6. Get the vertices (vertices) information of the Graph object
graph.vertices.collect.foreach(println)
7. Get the edges (edge) information of the Graph object
graph.edges.collect.foreach(println)
8. Get triplets (all) information of the Graph object
graph.triplets.collect.foreach(println)
Fifth, build the user partnership attribute map
1. Create a vertex RDD
val users= sc.parallelize(Array((3L,("rxin","student")),
(7L,("jgonzal","postdoc")),
(5L,("franklin","professor")),
(2L,("istoica","professor"))))
2. Create an edge RDD
val relationship=sc.parallelize(Array(
Edge(3L,7L,"Colla"),
Edge(5L,3L,"Advisor"),
Edge(2L,5L,"Colleague"),
Edge(5L,7L,"Pi")))
3. Create a Graph object
val graphUser=Graph(users,relationship)
1) Vertex attributes
- username
- Profession
2) Edge attributes
- Partnership
4. Obtaining the vertices and edges of the Graph graph
graphUser.vertices.collect.foreach(println)
graphUser.edges.collect.foreach(println)
graphUser.triplets.collect.foreach(println)
6. Build user social network relationships
- Vertex: username, age
- Side: number of calls
1. Create a Graph object
//创建顶点RDD
val userRdd=sc.parallelize(Array(
(1L,("Alice",28)),
(2L,("Bob",27)),
(3L,("Charlie",65)),
(4L,("David",42)),
(5L,("Ed",55)),
(6L,("Fran",50))
))
//创建边RDD
val usercallRdd=sc.makeRDD(Array(
Edge(2L,1L,7),
Edge(3L,2L,4),
Edge(4L,1L,1),
Edge(2L,4L,2),
Edge(5L,2L,5),
Edge(5L,3L,8),
Edge(3L,6L,3),
Edge(5L,6L,3)
))
//创建图
val userCallGraph=Graph(userRdd,usercallRdd)
2. Data analysis
Find users older than 30
userCallGraph.vertices.filter(v=>v._2._2.toInt>30).collect.foreach(println)
Analysis result:
(4,(David,42))
(5,(Ed,55))
(6,(Fran,50))
(3,(Charlie,65))
Find out who called more than 5 times
userCallGraph.triplets.filter(x=>x.attr>5).collect.foreach(x=>println(x.srcAttr._1+" like "+ x.dstAttr._1+" stage:"+x.attr))
Analysis result:
Bob like Alice stage:7
Ed like Charlie stage:8