Graph data analysis based on Spark GraphX

1. The basic concept of graph

1. A graph is a network data structure composed of a collection of vertices (vertex) and a collection of relations between vertices (edge)

  • Usually expressed as a two-tuple: Gragh=(V, E)
  • Can model the relationship between things

2. Application scenarios

  • Find the shortest path in the map app
  • Social network relationship
  • Hyperlink relationship between web pages

Second, the terminology of the graph

1. Vertex, Edge

Graph=(V,E)
集合V={
    
    v1,v2,v3}
集合E={
    
    (v1,v2),(v1,v3),(v2,v3)}

As shown below:
Insert picture description here

2. Directed graph, undirected graph

  1. Directed graph
G=(V,E)
V={
    
    A,B,C,D,E}
E={
    
    <A,B>,<B,C>,<B,D>,<C,E>,<D,A>,<E,D>}

As shown below:

Directed graph

  1. Undirected graph
G=(V,E)
V={
    
    A,B,C,D,E}
E={
    
    (A,B),(A,D),(B,C),(B,D),(C,E),(D,E)}

As shown below:

Undirected graph

3. cyclic graph and acyclic graph

  1. A cyclic graph
    contains a series of loops (loops) connected by vertices

Ring graph

  1. Acyclic graph
    DAG is a directed acyclic graph
    Acyclic graph

4. Degree

Degree: the number of all edges of a vertex

  • Out degree: refers to the number of edges from the current vertex to other vertices
  • In-degree: the number of edges of other vertices pointing to the current vertex

Insert picture description here
As shown in the figure above:
the degree of vertex 2 is 1, out degree: 0, in degree: 1
degree of vertex 8 is 3, out degree: 1, in degree: 2
degree of vertex 10 is 2, out degree: 0, in degree :
The degree of 2 vertex 11 is 5, the degree of out: 3, the degree of in: 2

5. Classical representation of graphs-adjacency matrix

Insert picture description here

  • For each edge, the corresponding cell value in the matrix is ​​1
  • For each cycle, the value of the corresponding cell in the matrix is ​​2, which is convenient to find the degree of the vertex in the row or column

Three, Spark GraphX

1. Introduction to Spark GraphX

GraphX ​​is a distributed graph computing API provided by Spark

GraphX ​​features:

  • Realize data reuse and fast reading based on memory
  • Unified graph view and table view through elastic distributed property graph (Property Graph)
  • Seamlessly connect with Spark Streaming, Spark SQL, Spark MLlib, etc.

2. GraphX ​​core abstraction

Resilient Distributed Property Graph

  • Directed multigraph with attributes on both vertices and edges
  • One physical storage, two views
  • All operations on the Graph view will eventually be converted to the RDD operation of its associated Table view to complete

3、GraphX API

  • Graph[VD,ED]
  • VertexRDD[VD]
  • EdgeRDD [ED]
  • EdgeTriplet[VD,ED]
  • Edge: Sample class
  • VertexId: Alias ​​of Long

Fourth, create and use Graph

1. Import the Spark Graph package

import org.apache.spark.graphx._

2. Create vertices vertex RDD

val vertices = sc.makeRDD(Seq((1L,1),(2L,2),(3L,3)))

3. Create edges RDD

val edges = sc.makeRDD(Seq(Edge(1L,2L,1),Edge(2L,3L,2)))

4. Create a Graph object

val graph = Graph(vertices,edges)

5. Load Graph object from file

followers.txt

2 3
3 4
1 4
2 4

Load statement:

val graph =  GraphLoader.edgeListFile(sc, "file:///opt/followers.txt")

6. Get the vertices (vertices) information of the Graph object

graph.vertices.collect.foreach(println)

7. Get the edges (edge) information of the Graph object

graph.edges.collect.foreach(println)

8. Get triplets (all) information of the Graph object

graph.triplets.collect.foreach(println)

Fifth, build the user partnership attribute map

1. Create a vertex RDD

val users= sc.parallelize(Array((3L,("rxin","student")),
(7L,("jgonzal","postdoc")),
(5L,("franklin","professor")),
(2L,("istoica","professor"))))

2. Create an edge RDD

val relationship=sc.parallelize(Array(
Edge(3L,7L,"Colla"),
Edge(5L,3L,"Advisor"),
Edge(2L,5L,"Colleague"),
Edge(5L,7L,"Pi")))

3. Create a Graph object

val graphUser=Graph(users,relationship)

Insert picture description here
1) Vertex attributes

  • username
  • Profession

2) Edge attributes

  • Partnership

4. Obtaining the vertices and edges of the Graph graph

graphUser.vertices.collect.foreach(println)
graphUser.edges.collect.foreach(println)
graphUser.triplets.collect.foreach(println)

6. Build user social network relationships

Insert picture description here

  • Vertex: username, age
  • Side: number of calls

1. Create a Graph object

//创建顶点RDD
val userRdd=sc.parallelize(Array(
(1L,("Alice",28)),
(2L,("Bob",27)),
(3L,("Charlie",65)),
(4L,("David",42)),
(5L,("Ed",55)),
(6L,("Fran",50))
))
//创建边RDD
val usercallRdd=sc.makeRDD(Array(
Edge(2L,1L,7),
Edge(3L,2L,4),
Edge(4L,1L,1),
Edge(2L,4L,2),
Edge(5L,2L,5),
Edge(5L,3L,8),
Edge(3L,6L,3),
Edge(5L,6L,3)
))
//创建图
val userCallGraph=Graph(userRdd,usercallRdd)

2. Data analysis

Find users older than 30

userCallGraph.vertices.filter(v=>v._2._2.toInt>30).collect.foreach(println)

Analysis result:

(4,(David,42))
(5,(Ed,55))
(6,(Fran,50))
(3,(Charlie,65))

Find out who called more than 5 times

userCallGraph.triplets.filter(x=>x.attr>5).collect.foreach(x=>println(x.srcAttr._1+" like "+ x.dstAttr._1+" stage:"+x.attr))

Analysis result:

Bob like Alice stage:7
Ed like Charlie stage:8

Guess you like

Origin blog.csdn.net/qq_42578036/article/details/110184660