Spark广播变量之broadcast

Spark广播变量用于将一个小的数据集广播到每个节点上,从而使得每个task直接从对应的节点上拉取数据,而不用从Driver端拉取数据,可以节省数据的网络传输和带宽,提高job的运行效率
Spark 广播变量的主要使用是:使用sc.broadcast() 将数据广播出去,在Driver端直接.value 获取值
具体案例如下:

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("BroadCastApp")
    val sc = new SparkContext(conf)

    val flights = sc.parallelize(List(
      ("SEA", "JFK", "DL", "7:00"),
      ("SFO", "LAX", "AA", "7:05"),
      ("SFO", "JFK", "VX", "7:05"),
      ("JFK", "LAX", "DL", "7:10"),
      ("LAX", "SEA", "DL", "7:10")
    ))

    val airports = sc.parallelize(List(
      ("JFK", "John F .Kennedy International Airport", "New York", "NY"),
      ("LAX", "Los Angels International Airport", "Los Angeles", "CA"),
      ("SEA", "Seattle-Tacoma International Airport", "Seattle", "WA"),
      ("SFO", "San Francisco International Airport", "San Francisco", "CA")
    ))

    val airlines = sc.parallelize(List(
      ("AA", "American Airlines"),
      ("DL", "Delta Airlines"),
      ("VX", "Virgin America")
    ))
    
    // 将数据广播到每个节点上面
    val airportsBC = sc.broadcast(airports.map(x => (x._1, x._3)).collectAsMap())
    val airlinesBC = sc.broadcast(airlines.collectAsMap())
	
	// 使用广播变量的值
    flights.map {
      case (a, b, c, d) => (airportsBC.value.get(a).get,
        airportsBC.value.get(b).get,
        airlinesBC.value.get(c).get,
        d
      )
    }.foreach(println)
    sc.stop()
  }

猜你喜欢

转载自blog.csdn.net/qq_43081842/article/details/105322866