Shared variables

By default, if you use an external variable to one operator function, then the value of this variable will be copied to each task in. At this time, each task can only operate their own copy of the variable share. If more than one task you want to share a variable, then this way is impossible.

Spark provides two shared variables for this purpose, one is Broadcast Variable (variable broadcast), the other is the Accumulator (cumulative variable). Broadcast Variable variables will be used to only copy of each node, use is greater optimize performance, reduce network traffic, and memory consumption. Accumulator you can have multiple task operate a common variable, mainly be used for accumulation.

Share broadcast variable (read-only)

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local").setAppName("shared")
    val sc = new SparkContext(conf)

    val factor = 3

    // 创建广播变量 factorBroadcast
    val factorBroadcast = sc.broadcast(factor)

    // Array(1,2,3,4,5)
    val data = Array(1 to 5:_*)

    val rdd = sc.parallelize(data,2)

    // factorBroadcast.value 获取广播变量的值
    rdd.map(num => num * factorBroadcast.value ).foreach(println)
    
    sc.stop()
}
3
6

9
12
15

Share cumulative variable (write only)

def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local").setAppName("shared")
    val sc = new SparkContext(conf)

    // 创建共享累加变量
    var accumulator = sc.accumulator(0)
    // Array(1,2,3,4,5)
    val data = Array(1 to 5: _*)

    val rdd = sc.parallelize(data, 2)

    // accumulator.add(num) 累加rdd的数据
    rdd.foreach(num => accumulator += num)
    
    // 报错! 算子在Excutor端执行,不可读累加广播变量
    //    rdd.foreach(num => println(accumulator.value))

    // 在driver端可以获取共享累加变量的值
    println("共享累加变量的值为:" + accumulator.value)

    sc.stop()
}
共享累加变量的值为:15

Guess you like

Origin www.cnblogs.com/studyNotesSL/p/11432902.html
Recommended