Graphx project actual combat-flight network analysis

1. Task description

Requirements overview

  • Explore flight network graph data
  • Build a flight network map
  • Use Spark GraphX ​​to complete the following tasks
  • Count the number of airports in the flight network map
  • Count the number of routes in the flight network map
  • Calculate the longest flight route (Point to Point)
  • Find the busiest airport
  • Find the most important flight route (PageRank)
  • Find the cheapest flight route (SSSP)

2. Specific analysis

Problem analysis 1: Data exploration

  • Download data link: https://pan.baidu.com/s/1fubnDM_sggw_MWS9iI1AoQ Extraction code: xnxv
  • Data format: The file format is CSV, and the separator between fields is ","
  • The order is: #日,周#, airline, aircraft registration number, flight number, departure airport number, departure airport, arrival airport number, arrival airport, estimated departure time (hours and minutes), departure time, departure delay (minutes) , Estimated time of arrival, arrival time, arrival delay (minutes), estimated flight time, flight distance

Insert picture description here

Problem analysis 2: Build a flight network diagram

  • Create attribute graph Graph[VD,ED]
  • Load CSV as RDD, and each airport as a vertex. Key fields: departure airport number, departure airport, arrival airport number, arrival airport, flight distance
  • Initialize the vertex set airports:RDD[(VertexId,String)], the vertex attribute is the airport name
  • Initialize the edge set lines:RDD[Edge], the edge attribute is the flight distance
//导入包
import org.apache.spark.graphx._
//加载数据
val flights = sc.textFile("file:///data/flight.csv").map(_.split(","))
//机场数据
val airports = flights.flatMap(x => Array((x(5).toLong, x(6)), (x(7).toLong,x(8)))).distinct
//航线数据
val lines = flights.map(x => (x(5).toLong, x(7).toLong, x(16).toInt)).distinct.map(x => Edge(x._1, x._2, x._3))
//构建图
val graph = Graph(airports, lines)

Problem analysis 3: Count the number of airports and routes in the flight network map

  • Number of airports: Find the number of vertices: Graph.numVertices
  • Number of routes: find the number of edges: Graph.numEdges
//机场个数
graph.vertices.count
graph.numVertices
//航线数
graph.numEdges

//补充使用spark_sql
//spark_sql
val df= spark.read.format("csv").option("header","false").option("delimiter",",").load("file:///data/flight.csv").toDF("dom","dow","carrier","tail_num","fl_num","origin_id","origin","des_id","dest","crs_dep_time","dep_time","dep_delay_mins","crs_arr_time","arr_time","arr_delay_mins","crs_elapse_time","dist")
val df2 = df.select("origin_id","origin","des_id","dest")
df2.registerTempTable("flight")
//起飞机场多少个
spark.sql("with t1 as (select origin_id from flight group by origin_id) select sum(1) from t1 ").show//301
df.select("origin_id").distinct.count//301
//目的地机场多少个
spark.sql("with t1 as (select des_id from flight group by des_id) select sum(1) from t1 ").show//301
df.select("des_id").distinct.count//301
//起飞机场和目的地机场多少个
df.select("origin_id").union(df.select("des_id")).distinct.count //301
//航线个数
df.select("origin_id","des_id").distinct.count

Problem analysis 4: Calculate the longest flight route

  • The largest edge attribute: sort the triplets by flight distance (descending order) and take the first one
graph.triplets.sortBy(_.attr,false).take(1)
graph.triplets.sortBy(_.attr,false).map(t => "The distance is %d from %s to %s.".format(t.attr, t.srcAttr, t.dstAttr)).take(1)(0)
graph.inDegrees.sortBy(_._2,false).take(1)
graph.outDegrees.filter(x=>x._1==10397).collect
val (x,y)= graph.degrees.sortBy(_._2,false).take(1)(0)

//reduce求最大值
lines.reduce((x,y)=>if (x.attr > y.attr) x else y)
//fold求最大值
lines.fold(Edge(0))((x,y)=>if(x.attr> y.attr) x else y)
spark.sql("select * from flight order by cast(dist as int) desc").show(1)

Problem Analysis 5: Find the busiest airport

  • Which airport arrives the most flights: calculate the in-degree of the vertices and sort
//sort
graph.inDegrees.sortBy(_._2,false).take(1)(0)
//reduce
graph.inDegrees.reduce((x,y)=>if(x._2>y._2) x else y)

Problem Analysis 6: Find the most important airport

  • PageRank: Convergence error: 0.05
graph.pageRank(0.05).vertices.join(airports).sortBy(_._2._1,false).map(_._2._2).take(1)

val g = graph.pageRank(0.0001)
g.vertices.take(10)
val iv = g.vertices.reduce((x,y)=>if(x._2>y._2) x else y)

Problem Analysis 7: Find the cheapest flight route

  • Pricing model : price = 180.0 + distance * 0.15
  • SSSP problem : the shortest distance from the initially specified source point to any point

pregel

  • Initialize the source point and other vertices (Double.PositiveInfinity)
  • Initial message (Double.PositiveInfinity)
  • vprog function to calculate the minimum
  • The sendMsg function calculates whether to proceed to the next iteration
  • The mergeMsg function merges the received messages and takes the minimum value
val count = airports.count//总机场数
var fraction = 1.0//系数
var samples = airports.sample(false, fraction/count, count) //抽样
while ( samples.count < 0 ) {
    
    
fraction = fraction + 1
samples = airports.sample(false, fraction/count, count)
}
val source_id: VertexId = samples.first._1//得到源点 14952
//给初始值,转换定价模型
val init_graph = graph.mapVertices((id, _) => if (id == source_id) 0.0 else Double.PositiveInfinity).mapEdges(e => 180.toDouble + e.attr.toDouble*0.15)
//构建pregel模型	
val pregel_graph = init_graph.pregel(Double.PositiveInfinity)(
(id, dist, new_dist) => math.min(dist, new_dist),
triplet => {
    
    
if ( triplet.srcAttr + triplet.attr < triplet.dstAttr ) {
    
    
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
}
else Iterator.empty
},
(a, b) => math.min(a, b)
)
//航线
val cheap_lines = pregel_graph.edges.map {
    
     case(Edge(src_id, dst_id, price))
=> (src_id, dst_id, price) }.takeOrdered(3)(Ordering.by(_._3))
//机场
val cheap_airports = pregel_graph.vertices.takeOrdered(3)(Ordering.by(_._2))

Guess you like

Origin blog.csdn.net/sun_0128/article/details/107926914