1, reduce operation on the packet using the dataset may be used in the dataset are not grouped
Reduce the DataSet applied to the packet converter using user-defined reduce function will be reduced to a single element in each group. For each set of input elements, the reduce function continuously combined into one element of the element, until each element of a group only. Note that , for ReduceFunction, the returned object key field should match the input value. This is because the combination of the implicit reduce (combine), and the light emitted from the object in the combine operator packet transfer press key again to reduce operator.
1.1 dataset key expressions will be reduce
Each expression specifies a key element DataSet or more fields. Each key expression is the name of a public getter method or field. With dots are used to drill down the object. key expression "*" to select all the fields. The following code shows how to use the key expression of POJO DataSet grouping function and reduce the use of its Statute.
Ordinary some POJOs //
public class the WC {
public String Word;
public int COUNT;
// [...]
}
// Integer ReduceFunction that sums of A POJOs Attributes
public class WordCounter the implements ReduceFunction <the WC> {
@Override
public the reduce the WC ( IN1 the WC, the WC IN2) {
return new new the WC (in1.word, in1.count in2.count +);
}
}
// [...]
the DataSet <the WC> words // = [...]
the DataSet <the WC> = words wordcounts
// the DataSet GROUPING ON Field, "Word"
.groupBy ( "Word")
// apply ReduceFunction on grouped DataSet
.reduce(new WordCounter());
1.2 dataset for use KeySelector reduce the function
key selector function key extracted from each element of the DataSet. DataSet extracted key for grouping. The following code shows how to use the function key selector POJO DataSet group, and its use reduce functions Reduction operation. Ordinary some POJOs // public class the WC { public String Word; public int COUNT; // [...] } // Integer ReduceFunction that sums of A POJOs Attributes public class WordCounter the implements ReduceFunction <the WC> { @Override public the reduce the WC ( IN1 the WC, the WC IN2) { return new new the WC (in1.word, in1.count in2.count +); } } // [...] the DataSet <the WC> words // = [...] the DataSet <the WC> = words wordcounts // the DataSet GROUPING ON Field, "Word" .groupBy (new new SelectWord ()) // apply ReduceFunction on grouped DataSet .reduce(new WordCounter()); public class SelectWord implements KeySelector<WC, String> { @Override public String getKey(Word w) { return w.word; } }
reduce 1.3 Tuple applied on tuple fields may be used to indicate the position numbers, similar to the index
Field position key field is used to specify one or more packets the DataSet <Tuple3 <String, Integer, Double >> tuples = // [...] the DataSet <Tuple3 <String, Integer, Double >> reducedTuples = tuples // Group the DataSet First and SECOND Field of ON Tuple .groupBy (0,. 1 ) // Apply the DataSet ReduceFunction ON Grouped .reduce ( new new MyTupleReducer ());
1.4 Application reduce the entire data set
Reduce conversion may be user-defined reduce function applies to all elements DataSet. The reduce function is then combined into one element of the elements, until only one element. Reduce the use of the full conversion of the Statute of the DataSet means the final Reduce operation can not be done in parallel. However, the reduce function can automatically grouped so Reduce the conversion does not limit the embodiment with the most scalability The following code shows how to summing all elements Integer DataSet: // ReduceFunction that sums Integers public class IntSummer the implements ReduceFunction <Integer> { @ override public Integer the reduce (num1 Integer, Integer num2) { return num1 + num2; } } // [...] the DataSet <Integer> = intNumbers // [...] the DataSet <Integer> = SUM intNumbers.reduce ( new new IntSummer ());
2, packet reduce, i.e. GroupReduce
Group-reduce function to the DataSet GroupReduce packet call each packet a user-defined conversion.
This difference Reduce user-defined function that will immediately get the whole group. Use Iterable call the function on all the elements of the group, and may return the result of any number of elements.
2.1 GroupReduce key for a packet in the same redeuce
以下代码显示如何从Integer分组的DataSet中删除重复的字符串。 public class DistinctReduce implements GroupReduceFunction<Tuple2<Integer, String>, Tuple2<Integer, String>> { @Override public void reduce(Iterable<Tuple2<Integer, String>> in, Collector<Tuple2<Integer, String>> out) { Set<String> uniqStrings = new HashSet<String>(); Integer key = null; // add all strings of the group to the set for (Tuple2<Integer, String> t : in) { key = t.f0; uniqStrings.add(t.f1); } // emit all unique strings. for (String s : uniqStrings) { out.collect(new Tuple2<Integer, String>(key, s)); } } } // [...] DataSet<Tuple2<Integer, String>> input = // [...] DataSet<Tuple2<Integer, String>> output = input .groupBy(0) // group DataSet by the first tuple field .reduceGroup(new DistinctReduce()); // apply GroupReduceFunction
2.2 GroupReduce applied ordered set of data packets
Group- the reduce function uses the element Iterable access group. Iterable element groups may be distributed (optional) in the order specified. In many cases, this can help reduce a user-defined set of functions to reduce complexity and improve its efficiency. The following code shows how to remove another example of packets by the press Integer String sorted DataSet duplicate strings. // GroupReduceFunction Removes consecutive Identical Elements that public class DistinctReduce the implements GroupReduceFunction <Tuple2 <Integer, String>, Tuple2 <Integer, String >> { @Override public void the reduce (the Iterable <Tuple2 <Integer, String >> in, Collector <Tuple2 < Integer, String >> OUT) { Integer Key = null ; String CoMP = null ; for (Tuple2 <Integer, t : in) { key = t.f0; String next = t.f1; // check if strings are different if (com == null || !next.equals(comp)) { out.collect(new Tuple2<Integer, String>(key, next)); comp = next; } } } } // [...] DataSet<Tuple2<Integer, String>> input = // [...] DataSet<Double> output = input .groupBy(0) // group DataSet by first field .sortGroup(1, Order.ASCENDING) // sort groups on second tuple field .reduceGroup(new DistinctReduce());
3, the function can be combined GroupReduce
Compared with reduce function, group-reduce function are not implicitly combined. In order to Group- the reduce function combination, it must implement GroupCombineFunction interface. Important: GroupCombineFunction general purpose input and output interface type must equal GroupReduceFunction generic input type, as shown in the following example: // Combinable GroupReduceFunction that COMPUTES A SUM. Public class MyCombinableGroupReducer the implements GroupReduceFunction <Tuple2 <String, Integer>, String> , GroupCombineFunction <Tuple2 <String, Integer>, Tuple2 <String, Integer >> { @Override public void the reduce (the Iterable <Tuple2 <String, Integer >> in, Collector <String> OUT) { String Key =null; int sum = 0; for (Tuple2<String, Integer> curr : in) { key = curr.f0; sum += curr.f1; } // concat key and sum and emit out.collect(key + "-" + sum); } @Override public void combine(Iterable<Tuple2<String, Integer>> in, Collector<Tuple2<String, Integer>> out) { String key = null; int sum = 0; for (Tuple2<String, Integer> curr : in) { key = curr.f0; sum += curr.f1; } // emit tuple with key and sum out.collect(new Tuple2<>(key, sum)); } }
4, GroupCombine packet connection
GroupCombine conversion are combinable in the general form of GroupReduceFunction combinations of steps. It is summarized in the sense to allow any type of input to output type I O composition .
In contrast, GroupReduce combinations of steps from an input-only type I to type I combined output. This is because the reduce step, GroupReduceFunction desired input type I.
In some applications, it is desirable to perform additional prior to transformation (e.g., reduce the data size) will be combined into an intermediate format DataSet. This can be achieved at very low cost by CombineGroup conversion.
Note: GroupCombine packet data set used on greedy strategy execution in memory, the policy may not process all the data once but a plurality of process steps.
It can also be performed on each partition, without the need for exchange of data conversion as image GroupReduce. This can cause the output of the partial results,
it is not a substitute for GroupReduce GroupCombine operation, although their operation contents may all look the same.
The following example demonstrates how to convert a standby CombineGroup WordCount implemented.
DataSet<String> input = [..] // The words received as input DataSet<Tuple2<String, Integer>> combinedWords = input .groupBy(0) // group identical words .combineGroup(new GroupCombineFunction<String, Tuple2<String, Integer>() { public void combine(Iterable<String> words, Collector<Tuple2<String, Integer>>) { // combine String key = null; int count = 0; for (String word : words) { key = word; count++; } // emit tuple with word and count out.collect(new Tuple2(key, count)); } }); DataSet<Tuple2<String, Integer>> output = combinedWords .groupBy(0) // group by words again .reduceGroup(new GroupReduceFunction() { // group reduce with full data exchange public void reduce(Iterable<Tuple2<String, Integer>>, Collector<Tuple2<String, Integer>>) { String key = null; int count = 0; for (Tuple2<String, Integer> word : words) { key = word; count++; } // emit tuple with word and count out.collect(new Tuple2(key, count)); } });