spark variables broadcast, accumulator

broadcast


 

Official Description of Document:

Broadcast a read-only variable to the cluster, returning a [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions. The variable will be sent to each cluster only once.

Source analysis:

As used herein alarm mode instead of an exception, the user process in order to avoid interruption; some users may create broadcast variables but did not use them;


 

  /**
   * Broadcast a read-only variable to the cluster, returning a
   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
   * The variable will be sent to each cluster only once.
   */
  def broadcast[T: ClassTag](value: T): Broadcast[T] = {
    assertNotStopped()
    require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
      "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
    val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
    val callSite = getCallSite
    logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
    cleaner.foreach(_.registerBroadcastForCleanup(bc))
    bc
  }

 Broadcast variables allow the programmer to a read-only variable cache on each machine, instead of passing variables between tasks. Broadcast variables can be used to effectively set a large copy of the data input to each node. Spark also try to efficiently distribute broadcast variable algorithm, thereby reducing communication overhead. Spark operation through a series of steps, the steps are separated by a distributed shuffle operation. Spark broadcast automatically each step of each task requires a generic data . The broadcast data is cached serialized, deserialized before running out the task. This means that when we need to use the same data across multiple stages of the task, or anti serialized form cache data is very important when explicitly create broadcast variable to be useful.

Examples


List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,5);
final Broadcast<List<Integer>> broadcast = javaSparkContext.broadcast(data);
JavaRDD<Integer> result = javaRDD.map(new Function<Integer, Integer>() {    
  List<Integer> iList = broadcast.value();    
  @Override    
  public Integer call(Integer v1) throws Exception {        
    Integer isum = 0;        
    for(Integer i : iList)            
      isum += i;        
    return v1 + isum;    
  }
});
System.out.println(result.collect());

 

 accumulator


Source analysis:

// Methods for creating shared variables

  /**
   * Create an [[org.apache.spark.Accumulator]] variable of a given type, which tasks can "add"
   * values to using the `+=` method. Only the driver can access the accumulator's `value`.
   */
  @deprecated("use AccumulatorV2", "2.0.0")
  def accumulator[T](initialValue: T)(implicit param: AccumulatorParam[T]): Accumulator[T] = {
    val acc = new Accumulator(initialValue, param)
    cleaner.foreach(_.registerAccumulatorForCleanup(acc.newAcc))
    acc
  }

  /**
   * Create an [[org.apache.spark.Accumulator]] variable of a given type, with a name for display
   * in the Spark UI. Tasks can "add" values to the accumulator using the `+=` method. Only the
   * driver can access the accumulator's `value`.
   */
  @deprecated("use AccumulatorV2", "2.0.0")
  def accumulator[T](initialValue: T, name: String)(implicit param: AccumulatorParam[T])
    : Accumulator[T] = {
    val acc = new Accumulator(initialValue, param, Some(name))
    cleaner.foreach(_.registerAccumulatorForCleanup(acc.newAcc))
    acc
  }

 The accumulator is only variable being accumulated correlation operations can thus effectively be supported in parallel. It can be used to implement the counter and sum. Spark natively supports only numeric type accumulator, developers can add new types of support. If you specify a name when you create the accumulator can be seen in the Spark UI interface. This facilitates the process of understanding each execution stage (for Python does not support).
Accumulator by a call to initialize a variable v SparkContext.accumulator (v) to create. Tasks running on the cluster can be used for accumulation on the accumulator by the add or "+ =" method. However, they can not read its value. Only the driver can read its value by the value of the accumulator method.

class VectorAccumulatorParam the implements AccumulatorParam <the Vector> {     
  @Override     
  // the combined value of two accumulators.
  // parameters r1 is a set of accumulated data
   // parameter r2 accumulated data is another set of 
  public the Vector addInPlace (the Vector r1, r2 the Vector) { 
    r1.addAll (r2); 
    return r1;     
  }     
  @Override 
  // initial value of the    
  public the Vector ZERO (the Vector the initialValue) {        
      return the initialValue;     
  }     
  @Override 
  // add additional data to the accumulation value
   // parameter t1 is the current accumulator
   // parameter t2 is added to the value in the accumulator     
  public Vector addAccumulator(Vector t1, Vector t2) {        
      t1.addAll(t2);        
      return t1;    
  }
}
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,5);

final Accumulator<Integer> accumulator = javaSparkContext.accumulator(0);
Vector initialValue = new Vector();
for(int i=6;i<9;i++)    
  initialValue.add(i);
//自定义累加器
final Accumulator accumulator1 = javaSparkContext.accumulator(initialValue,new VectorAccumulatorParam());
JavaRDD<Integer> result = javaRDD.map(new Function<Integer, Integer>() {    
  @Override    
  public Integer call(Integer v1) throws Exception {        
    accumulator.add(1);        
    Vector term = new Vector();        
    term.add(v1);        
    accumulator1.add(term);        
    return v1;    
  }
});
System.out.println(result.collect());
System.out.println("~~~~~~~~~~~~~~~~~~~~~" + accumulator.value());
System.out.println("~~~~~~~~~~~~~~~~~~~~~" + accumulator1.value());

Reference article:


 

 https://www.cnblogs.com/jinggangshan/p/8117155.html

Guess you like

Origin www.cnblogs.com/AlanWilliamWalker/p/10960858.html