flink Reduce、GroupReduce、GroupCombine笔记

1, reduce operation on the packet using the dataset may be used in the dataset are not grouped

Reduce the DataSet applied to the packet converter using user-defined reduce function will be reduced to a single element in each group. For each set of input elements, the reduce function continuously combined into one element of the element, until each element of a group only. Note that , for ReduceFunction, the returned object key field should match the input value. This is because the combination of the implicit reduce (combine), and the light emitted from the object in the combine operator packet transfer press key again to reduce operator.

1.1 dataset key expressions will be reduce

Each expression specifies a key element DataSet or more fields. Each key expression is the name of a public getter method or field. With dots are used to drill down the object. key expression "*" to select all the fields. The following code shows how to use the key expression of POJO DataSet grouping function and reduce the use of its Statute. 



Ordinary some POJOs // 
public class the WC { 
  public String Word; 
  public int COUNT; 
  // [...] 
} 

// Integer ReduceFunction that sums of A POJOs Attributes 
public class WordCounter the implements ReduceFunction <the WC> { 
  @Override 
  public the reduce the WC ( IN1 the WC, the WC IN2) { 
    return new new the WC (in1.word, in1.count in2.count +); 
  } 
} 

// [...] 
the DataSet <the WC> words // = [...] 
the DataSet <the WC> = words wordcounts  
                         // the DataSet GROUPING ON Field, "Word"
                         .groupBy ( "Word")
                         // apply ReduceFunction on grouped DataSet
                         .reduce(new WordCounter());

1.2 dataset for use KeySelector reduce the function

key selector function key extracted from each element of the DataSet. DataSet extracted key for grouping. The following code shows how to use the function key selector POJO DataSet group, and its use reduce functions Reduction operation. 


Ordinary some POJOs // 
public class the WC { 
  public String Word; 
  public int COUNT; 
  // [...] 
} 

// Integer ReduceFunction that sums of A POJOs Attributes 
public class WordCounter the implements ReduceFunction <the WC> { 
  @Override 
  public the reduce the WC ( IN1 the WC, the WC IN2) { 
    return new new the WC (in1.word, in1.count in2.count +); 
  } 
} 

// [...] 
the DataSet <the WC> words // = [...] 
the DataSet <the WC> = words wordcounts 
                         // the DataSet GROUPING ON Field, "Word" 
                         .groupBy (new new SelectWord ())
                         // apply ReduceFunction on grouped DataSet
                         .reduce(new WordCounter());

public class SelectWord implements KeySelector<WC, String> {
  @Override
  public String getKey(Word w) {
    return w.word;
  }
}

reduce 1.3 Tuple applied on tuple fields may be used to indicate the position numbers, similar to the index

Field position key field is used to specify one or more packets 


the DataSet <Tuple3 <String, Integer, Double >> tuples = // [...] 
the DataSet <Tuple3 <String, Integer, Double >> reducedTuples = tuples
                                          // Group the DataSet First and SECOND Field of ON Tuple 
                                         .groupBy (0,. 1 )
                                          // Apply the DataSet ReduceFunction ON Grouped 
                                         .reduce ( new new MyTupleReducer ());

1.4 Application reduce the entire data set

Reduce conversion may be user-defined reduce function applies to all elements DataSet. The reduce function is then combined into one element of the elements, until only one element. 

Reduce the use of the full conversion of the Statute of the DataSet means the final Reduce operation can not be done in parallel. However, the reduce function can automatically grouped so Reduce the conversion does not limit the embodiment with the most scalability 

The following code shows how to summing all elements Integer DataSet: 


// ReduceFunction that sums Integers 
public  class IntSummer the implements ReduceFunction <Integer> { 
  @ override 
  public Integer the reduce (num1 Integer, Integer num2) {
     return num1 + num2; 
  } 
} 

// [...] 
the DataSet <Integer> = intNumbers // [...] 
the DataSet <Integer> = SUM intNumbers.reduce ( new new IntSummer ());

2, packet reduce, i.e. GroupReduce

Group-reduce function to the DataSet GroupReduce packet call each packet a user-defined conversion. 
This difference Reduce user-defined function that will immediately get the whole group. Use Iterable call the function on all the elements of the group, and may return the result of any number of elements.

2.1 GroupReduce key for a packet in the same redeuce

以下代码显示如何从Integer分组的DataSet中删除重复的字符串。


public class DistinctReduce
         implements GroupReduceFunction<Tuple2<Integer, String>, Tuple2<Integer, String>> {

  @Override
  public void reduce(Iterable<Tuple2<Integer, String>> in, Collector<Tuple2<Integer, String>> out) {

    Set<String> uniqStrings = new HashSet<String>();
    Integer key = null;

    // add all strings of the group to the set
    for (Tuple2<Integer, String> t : in) {
      key = t.f0;
      uniqStrings.add(t.f1);
    }

    // emit all unique strings.
    for (String s : uniqStrings) {
      out.collect(new Tuple2<Integer, String>(key, s));
    }
  }
}

// [...]
DataSet<Tuple2<Integer, String>> input = // [...]
DataSet<Tuple2<Integer, String>> output = input
                           .groupBy(0)            // group DataSet by the first tuple field
                           .reduceGroup(new DistinctReduce());  // apply GroupReduceFunction

2.2 GroupReduce applied ordered set of data packets

Group- the reduce function uses the element Iterable access group. Iterable element groups may be distributed (optional) in the order specified. In many cases, this can help reduce a user-defined set of functions to reduce complexity and improve its efficiency. 

The following code shows how to remove another example of packets by the press Integer String sorted DataSet duplicate strings. 


// GroupReduceFunction Removes consecutive Identical Elements that 
public  class DistinctReduce
          the implements GroupReduceFunction <Tuple2 <Integer, String>, Tuple2 <Integer, String >> { 

  @Override 
  public  void the reduce (the Iterable <Tuple2 <Integer, String >> in, Collector <Tuple2 < Integer, String >> OUT) { 
    Integer Key = null ; 
    String CoMP = null ; 

    for (Tuple2 <Integer, t : in) {
      key = t.f0;
      String next = t.f1;

      // check if strings are different
      if (com == null || !next.equals(comp)) {
        out.collect(new Tuple2<Integer, String>(key, next));
        comp = next;
      }
    }
  }
}

// [...]
DataSet<Tuple2<Integer, String>> input = // [...]
DataSet<Double> output = input
                         .groupBy(0)                         // group DataSet by first field
                         .sortGroup(1, Order.ASCENDING)      // sort groups on second tuple field
                         .reduceGroup(new DistinctReduce());

3, the function can be combined GroupReduce

Compared with reduce function, group-reduce function are not implicitly combined. In order to Group- the reduce function combination, it must implement GroupCombineFunction interface. 

Important: GroupCombineFunction general purpose input and output interface type must equal GroupReduceFunction generic input type, as shown in the following example: 


// Combinable GroupReduceFunction that COMPUTES A SUM. 
Public  class MyCombinableGroupReducer the implements 
  GroupReduceFunction <Tuple2 <String, Integer>, String> , 
  GroupCombineFunction <Tuple2 <String, Integer>, Tuple2 <String, Integer >> 
{ 
  @Override 
  public  void the reduce (the Iterable <Tuple2 <String, Integer >> in, 
                     Collector <String> OUT) { 

    String Key =null;
    int sum = 0;

    for (Tuple2<String, Integer> curr : in) {
      key = curr.f0;
      sum += curr.f1;
    }
    // concat key and sum and emit
    out.collect(key + "-" + sum);
  }

  @Override
  public void combine(Iterable<Tuple2<String, Integer>> in,
                      Collector<Tuple2<String, Integer>> out) {
    String key = null;
    int sum = 0;

    for (Tuple2<String, Integer> curr : in) {
      key = curr.f0;
      sum += curr.f1;
    }
    // emit tuple with key and sum
    out.collect(new Tuple2<>(key, sum));
  }
}

4, GroupCombine packet connection

GroupCombine conversion are combinable in the general form of GroupReduceFunction combinations of steps. It is summarized in the sense to allow any type of input to output type I O composition . 
In contrast, GroupReduce combinations of steps from an input-only type I to type I combined output. This is because the reduce step, GroupReduceFunction desired input type I. In some applications, it is desirable to perform additional prior to transformation (e.g., reduce the data size) will be combined into an intermediate format DataSet. This can be achieved at very low cost by CombineGroup conversion. Note: GroupCombine packet data set used on greedy strategy execution in memory, the policy may not process all the data once but a plurality of process steps.
It can also be performed on each partition, without the need for exchange of data conversion as image GroupReduce. This can cause the output of the partial results,
it is not a substitute for GroupReduce GroupCombine operation, although their operation contents may all look the same. The following example demonstrates how to convert a standby CombineGroup WordCount implemented.

DataSet<String> input = [..] // The words received as input

DataSet<Tuple2<String, Integer>> combinedWords = input
  .groupBy(0) // group identical words
  .combineGroup(new GroupCombineFunction<String, Tuple2<String, Integer>() {

    public void combine(Iterable<String> words, Collector<Tuple2<String, Integer>>) { // combine
        String key = null;
        int count = 0;

        for (String word : words) {
            key = word;
            count++;
        }
        // emit tuple with word and count
        out.collect(new Tuple2(key, count));
    }
});

DataSet<Tuple2<String, Integer>> output = combinedWords
  .groupBy(0)                              // group by words again
  .reduceGroup(new GroupReduceFunction() { // group reduce with full data exchange

    public void reduce(Iterable<Tuple2<String, Integer>>, Collector<Tuple2<String, Integer>>) {
        String key = null;
        int count = 0;

        for (Tuple2<String, Integer> word : words) {
            key = word;
            count++;
        }
        // emit tuple with word and count
        out.collect(new Tuple2(key, count));
    }
});

 

 

Guess you like

Origin www.cnblogs.com/asker009/p/11111546.html