Article Directory
1. Broadcast variables
1.1 Broadcast variable understanding diagram
1.2 Use of broadcast variables
object SparkBroadCast {
def main(args: Array[String]): Unit = {
val conf = new SparkConf();
conf.setMaster("local")
conf.setAppName("SparkBroadCast")
val sparkContext = new SparkContext(conf)
val list = List("spark","hadoop","flink")
//广播 list
val brodcast = sparkContext.broadcast(list)
val wordList = List("spark","hadoop","java","python")
sparkContext.parallelize(wordList).filter(word => {
// 使用广播变量
val value = brodcast.value
value.contains(word)
}).foreach(println)
}
}
1.3 Matters needing attention
-
Can an RDD be broadcast using broadcast variables?
No, because RDD does not store data. You can broadcast the results of the RDD.
-
Broadcast variables can only be defined on the Driver side, not on the Executor side.
-
The value of the broadcast variable can be modified on the Driver side, and the value of the broadcast variable cannot be modified on the Executor side.
2. Accumulator
2.1 The accumulator understanding diagram
2.2 The use of accumulator
object SparkAccumulator {
def main(args: Array[String]): Unit = {
val conf = new SparkConf();
conf.setAppName("SparkAccumulator")
conf.setMaster("local")
val sparkContext = new SparkContext(conf)
// 定义一个累加器
val acc = sparkContext.longAccumulator;
val list = List("kafka","java","spark","hadoop","flink")
// 循环rdd,统计有多少个word
sparkContext.parallelize(list).foreach(word => acc.add(1))
val value = acc.value
println(value)
sparkContext.stop()
}
}
2.3 Custom accumulator
-
Inherit AccumulatorV2, define the input and output type as INT
package com.abcft.spark.acc import org.apache.spark.util.AccumulatorV2 class CustomAcc extends AccumulatorV2[Int ,Int] { var acc= 0 override def isZero: Boolean = acc == 0 override def copy(): AccumulatorV2[Int, Int] = { var v = new CustomAcc() v.acc = this.acc return v; } override def reset(): Unit = { var v = new CustomAcc() } override def add(v: Int): Unit = { this.acc = this.acc + v } override def merge(other: AccumulatorV2[Int, Int]): Unit = { this.acc = other.value + this.acc } override def value: Int = this.acc }
-
Use a custom accumulator
import com.abcft.spark.acc.CustomAcc import org.apache.spark.{ SparkConf, SparkContext} object SparkCustomAcc { def main(args: Array[String]): Unit = { val conf = new SparkConf(); conf.setMaster("local") conf.setAppName("SparkCustomAcc") val sparkContext = new SparkContext(conf); // 创建累加器 val acc = new CustomAcc() sparkContext.register(acc,"CustomAcc") val list = List("kafka","java","spark","hadoop","flink") // 循环rdd,统计有多少个word sparkContext.parallelize(list).foreach(word => acc.add(1)) val value = acc.value println(value) sparkContext.stop() } }
2.4 Main issues
The accumulator is defined on the Driver side and assigned an initial value. The accumulator can only be read on the Driver side and updated on the Executor side.