--Accumulator Spark of shared variables (accumulator)

I. Introduction

In order to facilitate the management of some common statistics and information, Spark defines two shared variables --Broadcast (broadcast variables) and Accumulator (accumulator), you can easily share some of the variables or data to each node in the cluster, take a look at today Accumulator.

Accumulator overall maintenance by the end of Driver, Driver reads the current value is at the end, on their respective Task Executor also in the maintenance of the Accumulator variable, but only local accumulate operations, will run the complete end to the Driver to merge the accumulated result. Accumulator has two properties:

1, may accumulate only, i.e., combined cumulative;

2, without changing the operating characteristics of Lazy Spark performed, i.e. without operating the trigger action value in the accumulator in case there may be the job of the initial value.

Two, Accumulator classification (Spark2.x):

1, Spark comes with the type of accumulator

(. 1) LongAccumulator (Long type which accumulates integer)

(2) DoubleAccumulator (Double accumulates the float type)

(. 3) CollectionAccumulator (collection type accumulates the collection element)

Created as follows:

LongAccumulator = sc.sc LongAccumulator () longAccumulator ( "longAccumulator");. // wherein longAccumulator accumulator for the name on the web UI

2, custom accumulator - Accumulator abstract class needs to inherit AccumulatorV2

Wherein the add () method, Merge () method, value () method and the like need not necessarily need to implement the method;

Below I have implemented a custom accumulator a string concatenation of:

package com.renyang.sparkproject.spark.session;

import com.renyang.sparkproject.constant.Constants;
import com.renyang.sparkproject.util.StringUtils;
import org.apache.spark.util.AccumulatorV2;

public class SessionAggrStatAccumulatorV2 extends AccumulatorV2<String, String> {
    private static final long serialVersionUID = 6311074555136039130L;

    private String data = "session_count=0|1s_3s=0|4s_6s=3|7s_9s=0|10s_30s=0|30s_60s=0|1m_3m=0|3m_10m=0|10m_30m=0|30m=0|1_3=0|4_6=1|7_9=0|10_30=0|30_60=0|60=0";

    private String zero = data;

    @Override
    public boolean isZero() {
        return data.equals(zero);
    }

    @Override
    public AccumulatorV2<String, String> copy() {
        return new SessionAggrStatAccumulatorV2();
    }

    @Override
    public void reset() {
        data = zero;
    }

    public void add(String v) {
        data = add(data, v);
    }

    @Override
    public void merge(AccumulatorV2<String, String> other) {
        SessionAggrStatAccumulatorV2 o =(SessionAggrStatAccumulatorV2)other;
        String[] words = data.split("\\|");
        String[] owords = o.data.split("\\|");
        for (int i = 0; i < words.length; i++) {
            for (int j = 0; j < owords.length; j++) {
                if (words[i].split("=")[0].equals(owords[j].split("=")[0])){
                    int value = Integer.valueOf(words[i].split("=")[1]) +Integer.valueOf(owords[j].split("=")[1]);
                    String ns StringUtils.setFieldInConcatString = (Data, "\\ |", owords [J] .split ( "=") [0 ], String.valueOf (value));
                     // every time you merge, update STR 
                    Data = NS; 
                } 
            } 
        } 
    } 

    @Override 
    public String value () {
         return Data; 
    } 

    / ** 
     * statistics calculation logic for the session 
     * @param V1 connection string 
     * @param range interval v2 
     * @return future update connection string
      * / 
    Private String the Add (String V1, v2 String) {
         // check: v1 is empty, return directly v2
        IF (StringUtils.isEmpty (v1)) {
             return v2; 
        } 

        // use StringUtils tools, from in v1, v2 corresponding to the extracted value, and accumulate. 1 
        String oldValue StringUtils.getFieldFromConcatString = (v1, "\\ |" , v2 );
         IF (oldValue =! null ) {
             // the value range of the original interval, accumulating. 1 
            int newValue = Integer.valueOf (oldValue) +. 1 ;
             // use StringUtils tools, in the v1, v2 corresponding value, set to the new value of the accumulated 
            return StringUtils.setFieldInConcatString (V1, "\\ |" , V2, String.valueOf (newValue)); 
        } 

        return V1; 
    } 
}

Three, Accumulator operating logic

1, Driver side is responsible for defining and registering accumulator

Driver accumulator is defined in the end and is initialized, the need to register SparkContext, in order to distribute accumulator variable to each node in the cluster, after completion of each Task will run until the end of the recovery accumulator Driver result of the merger, the merger process is based on Task execution case may be, as long as the completion of Task accumulator variable will be updated.

2, Executor end

After receiving the Executor Task, not only RDD deserialized and Function, also deserialized Accumulator, Executor been performed after Task, the result will be returned together to the Driver along Accumulator end.

--Accumulator Spark of shared variables (accumulator)

Guess you like