1. 广播变量
1.1 广播变量理解图
1.2 广播变量使用
object SparkBroadCast {
def main(args: Array[String]): Unit = {
val conf = new SparkConf();
conf.setMaster("local")
conf.setAppName("SparkBroadCast")
val sparkContext = new SparkContext(conf)
val list = List("spark","hadoop","flink")
//广播 list
val brodcast = sparkContext.broadcast(list)
val wordList = List("spark","hadoop","java","python")
sparkContext.parallelize(wordList).filter(word => {
// 使用广播变量
val value = brodcast.value
value.contains(word)
}).foreach(println)
}
}
1.3 注意事项
-
能不能将一个RDD 使用广播变量广播出去?
不能,因为 RDD 是不存储数据的。可以将 RDD 的结果广播出去。
-
广播变量只能在 Driver 端定义,不能在 Executor 端定义。
-
在 Driver 端可以修改广播变量的值, 在 Executor端无法修改广播变量的值。
2. 累加器
2.1 累加器理解图
2.2 累加器的使用
object SparkAccumulator {
def main(args: Array[String]): Unit = {
val conf = new SparkConf();
conf.setAppName("SparkAccumulator")
conf.setMaster("local")
val sparkContext = new SparkContext(conf)
// 定义一个累加器
val acc = sparkContext.longAccumulator;
val list = List("kafka","java","spark","hadoop","flink")
// 循环rdd,统计有多少个word
sparkContext.parallelize(list).foreach(word => acc.add(1))
val value = acc.value
println(value)
sparkContext.stop()
}
}
2.3 自定义累加器
-
继承AccumulatorV2,定义输入输出类型为INT
package com.abcft.spark.acc import org.apache.spark.util.AccumulatorV2 class CustomAcc extends AccumulatorV2[Int ,Int] { var acc= 0 override def isZero: Boolean = acc == 0 override def copy(): AccumulatorV2[Int, Int] = { var v = new CustomAcc() v.acc = this.acc return v; } override def reset(): Unit = { var v = new CustomAcc() } override def add(v: Int): Unit = { this.acc = this.acc + v } override def merge(other: AccumulatorV2[Int, Int]): Unit = { this.acc = other.value + this.acc } override def value: Int = this.acc }
-
使用自定义的累加器
import com.abcft.spark.acc.CustomAcc import org.apache.spark.{ SparkConf, SparkContext} object SparkCustomAcc { def main(args: Array[String]): Unit = { val conf = new SparkConf(); conf.setMaster("local") conf.setAppName("SparkCustomAcc") val sparkContext = new SparkContext(conf); // 创建累加器 val acc = new CustomAcc() sparkContext.register(acc,"CustomAcc") val list = List("kafka","java","spark","hadoop","flink") // 循环rdd,统计有多少个word sparkContext.parallelize(list).foreach(word => acc.add(1)) val value = acc.value println(value) sparkContext.stop() } }
2.4 主要事项
累加器在 Driver 端定义,并赋初始值,累加器只能在 Driver 端读取,在 Executor 端更新。