The principle Java8 in-depth understanding of Stream

Stream Pipelines

We have already learned how to use Stream API, together with really cool, but the following simple methods seem endless hidden secrets, how is such a powerful API to achieve it? For example, how is the implementation of the Pipeline, each method call will result in one iteration it? Automatic parallelization how they do it, the number of threads is the number? In this section we learn the principles Stream pipeline, which is the key to achieve the Stream.

First, look at the way the implementation of container Lambda expressions to ArrayList.forEach()the method, for example, the specific code as follows:

// ArrayList.forEach()
public void forEach(Consumer<? super E> action) { ... for (int i=0; modCount == expectedModCount && i < size; i++) { action.accept(elementData[i]);// 回调方法 } ... }

We see the ArrayList.forEach()main logic method is a for loop, the for loop continually calls in action.accept()the callback method to complete traversal of the elements. This is absolutely no novelty, the callback method is widely used in the Java GUI listener. Lambda expressions effect is equivalent to a callback method, which is well understood.

Stream API extensive use of Lambda expressions as a callback method, but this is not critical. Stream understand we are more concerned about two other questions: automatic parallelization and pipelining. The Stream may easily write code in the form:

int longestStringLengthStartingWithA
        = strings.stream()
              .filter(s -> s.startsWith("A")) .mapToInt(String::length) .max();

The code obtained with the letter A maximum length of the string begins with a straightforward way to perform every function call the first iteration, this function can be achieved, but the efficiency is certainly unacceptable. Class libraries with the use of the pipeline ( Pipeline way) cleverly avoids multiple iterations, the basic idea is as much in one iteration to perform user-specified action. Help us to explain all the operations are summarized in the Stream.

Stream operation classification
Intermediate operations (Intermediate operations) Stateless (Stateless) unordered() filter() map() mapToInt() mapToLong() mapToDouble() flatMap() flatMapToInt() flatMapToLong() flatMapToDouble() peek()
There are state (Stateful) distinct() sorted() sorted() limit() skip()
End (Terminal operations) Non-short-circuit operation forEach() forEachOrdered() toArray() reduce() collect() max() min() count()
Short-circuit operation (short-circuiting) anyMatch () allMatch () noneMatch () findFirst () findAny ()

All operations on the Stream into two categories: operational and middle end of the operation, the operation is only a middle mark, only the end of the operation will trigger the actual calculation. Intermediate operation can be divided into stateless ( Stateless ) and has state ( the Stateful ), no intermediate operation state refers to preceding elements not affect the processing of the element, while the intermediate state is to wait until the operation must know all the processing elements the end result, there is a state such as sorting operation, and sort the results can not be determined before reading all the elements; the end of the operation can be divided into short-circuit operation and non-operation of short-circuit, short-circuit operation refers not deal with all the elements you can return the results, such as found the first to meet the conditions of the elements . The reason to be so finely divided, because the bottom is treated differently in each case. For a better understanding of the operation terminal and operation of the intermediate stream, through the following two code point of view of their execution.

IntStream.range(1, 10)
   .peek(x -> System.out.print("\nA" + x)) .limit(3) .peek(x -> System.out.print("B" + x)) .forEach(x -> System.out.print("C" + x));

Output: A1B1C1 A2B2C2 A3B3C3 intermediate operations are lazy, that is, the middle of the operation will not do any operation on the data until it reaches the final operation. The final operation, are more enthusiastic. They will forward back all the intermediate operations. That is when the implementation of the final forEach operation when it will be back to the middle of its previous operation, the intermediate step of the operation, will go back to the middle step of the operation, ... until the initial first step. The first time forEach executed backtracks peek operation, then peek backtracking limit will operate more on the step, and then limit will be back more step on the peek operation, there is no top-level operation, and start from the top down started, output: when A1B1C1 second forEach perform, and then backtracks peek operation, then peek looks back on more step operation limit, then the limit will be back more step peek operation, there is no top-level operation, and start from the top down started, output: A2B2C2

... When the fourth forEach perform, and then backtracks peek operation, then peek backtracking limit will operate on more step, to limit the time and found that limit (3) The job has been completed, the equivalent circulating inside here the break operation, out to terminate the loop.

Let's look at the second paragraph of code:

IntStream.range(1, 10)
   .peek(x -> System.out.print("\nA" + x)) .skip(6) .peek(x -> System.out.print("B" + x)) .forEach(x -> System.out.print("C" + x));

The output is: when A1 A2 A3 A4 A5 A6 A7B7C7 A8B8C8 A9B9C9 first forEach execution backtracks peek operation, then peek will skip back more on the step of the operation, skip back to the previous step peek operation, the top layer is not operated, start from the top down started, to skip the execution time, because the execution to skip, the meaning of this operation is skipped, the following are not performed, which is equivalent to the inside of the loop continue, the end of this cycle. Output: A1

The second time forEach execution backtracks peek operation, then peek will skip back more on the step of the operation, skip back to peek on the step of the operation, there is no top-level operation, and start from the top down started, to skip execution of I found this is the second skip, the end of this cycle. Output: A2

...

Seventh forEach when executed, will peek back operation, then peek will skip back more on the step of the operation, skip back to peek on the step of the operation, there is no top-level operation, and start from the top down started, to skip execution of I found this is the seventh time skip, has more than 6, and it has performed over skip (6) of the job. The skip will skip, continue with the following operation. Output: A7B7C7

... until the end of the cycle.

 

One kind of straightforward implementation

Stream_pipeline_naive

Consider the above procedure still find the longest string in a straightforward pipelined implementation for each function call performed iteration, and the result is stored in some kind of processing intermediate data structures (such as arrays, containers, etc.). Specifically, is to call filter()the method implemented immediately, in order to elect all A at the beginning of the string and put a list of list1, then let list1 passed to the mapToInt()method and executed immediately, the result in the generation of list2, the last traverse list2 find the maximum number as the final result. Execution flow of the program as shown in:

I do realize it is very simple and intuitive, but there are two obvious disadvantages:

  1. The number of iterations and more. Equal with the number of iterations of function calls.
  2. Frequently generated intermediate result. Each function call is generated once the intermediate result, storage costs can not be accepted.

These drawbacks make the bottom of efficiency, can not be accepted. If you do not use Stream API we all know how to accomplish the above code in one iteration, is roughly the following form:

longest int = 0;
 for ( String STR : strings) { IF (STR .startsWith ( "A ")) { // 1. filter (), in order to retain the string beginning A int len = STR .length (); / / 2. mapToInt (), converted to a length of longest = the Math .max (len, longest); // 3. max (), the length of the longest retained}}

In this way we will not only reduce the number of iterations, and avoid storing intermediate results, obviously this is the assembly line, because we have three operations on the first iteration of them. As long as we know in advance the user's intention, always using the above manner with Stream API equivalent functionality, but the problem is Stream library designers do not know what the intention of the user Yes. How to implement a pipeline under the premise can not be assumed that the user behavior is the question library designers to consider.

 

Stream Pipeline Solutions

We can roughly think, should be used in some way every step of recording operation of the user, when the user calls the end of the operation before the recording operation superimposed together in one iteration to perform all out. Along the way, there are several problems to be solved:

  1. How to record a user's operation?
  2. How superposition?
  3. How to perform the operation after the superposition?
  4. Implementation of the results (if any) Where?

 

>> how to operate Record

Java_stream_pipeline_classes

Note the use of the " Operation (Operation) ," the term refers to the "middle Stream operation" operation, the operation will require a lot of Stream callback function (Lambda expressions), it is therefore a complete operating < data source, operating , callback >-tuples. Stage Stream concept used to describe a complete operation, and some After instantiating PipelineHelper represented Stage, each having a sequence of Stage connected together, comprise the entire pipeline. Stream classes and interfaces associated with inheritance icon.

There IntPipeline, LongPipeline, DoublePipeline not shown in the drawing, three classes dedicated to three basic types (not the type of packaging) and customized, with ReferencePipeline is a parallel relationship. FIG Head for indicating a first Stage, i.e. such call calling Collection.stream () Stage generating method, it is clear that any operation is not included in Stage; StatelessOp and StatefulOp represent stateless and stateful Stage, corresponding to stateless and stateful intermediate operation.

Stream schematic pipeline is organized as follows:

Stream_pipeline_example

图中通过Collection.stream()方法得到Head也就是stage0,紧接着调用一系列的中间操作,不断产生新的Stream。这些Stream对象以双向链表的形式组织在一起,构成整个流水线,由于每个Stage都记录了前一个Stage和本次的操作以及回调函数,依靠这种结构就能建立起对数据源的所有操作。这就是Stream记录操作的方式。

 

>> 操作如何叠加

以上只是解决了操作记录的问题,要想让流水线起到应有的作用我们需要一种将所有操作叠加到一起的方案。你可能会觉得这很简单,只需要从流水线的head开始依次执行每一步的操作(包括回调函数)就行了。这听起来似乎是可行的,但是你忽略了前面的Stage并不知道后面Stage到底执行了哪种操作,以及回调函数是哪种形式。换句话说,只有当前Stage本身才知道该如何执行自己包含的动作。这就需要有某种协议来协调相邻Stage之间的调用关系。

这种协议由Sink接口完成,Sink接口包含的方法如下表所示:

方法名 作用
void begin(long size) 开始遍历元素之前调用该方法,通知Sink做好准备。
void end() 所有元素遍历完成之后调用,通知Sink没有更多的元素了。
boolean cancellationRequested() 是否可以结束操作,可以让短路操作尽早结束。
void accept(T t) 遍历元素时调用,接受一个待处理元素,并对元素进行处理。Stage把自己包含的操作和回调方法封装到该方法里,前一个Stage只需要调用当前Stage.accept(T t)方法就行了。

有了上面的协议,相邻Stage之间调用就很方便了,每个Stage都会将自己的操作封装到一个Sink里,前一个Stage只需调用后一个Stage的accept()方法即可,并不需要知道其内部是如何处理的。当然对于有状态的操作,Sink的begin()end()方法也是必须实现的。比如Stream.sorted()是一个有状态的中间操作,其对应的Sink.begin()方法可能创建一个盛放结果的容器,而accept()方法负责将元素添加到该容器,最后end()负责对容器进行排序。对于短路操作,Sink.cancellationRequested()也是必须实现的,比如Stream.findFirst()是短路操作,只要找到一个元素,cancellationRequested()就应该返回true,以便调用者尽快结束查找。Sink的四个接口方法常常相互协作,共同完成计算任务。实际上Stream API内部实现的的本质,就是如何重写Sink的这四个接口方法

有了Sink对操作的包装,Stage之间的调用问题就解决了,执行时只需要从流水线的head开始对数据源依次调用每个Stage对应的Sink.{begin(), accept(), cancellationRequested(), end()}方法就可以了。一种可能的Sink.accept()方法流程是这样的:

void accept(U u){
    1. 使用当前Sink包装的回调函数处理u
    2. 将处理结果传递给流水线下游的Sink }

Sink接口的其他几个方法也是按照这种[处理->转发]的模型实现。下面我们结合具体例子看看Stream的中间操作是如何将自身的操作包装成Sink以及Sink是如何将处理结果转发给下一个Sink的。先看Stream.map()方法:

// Stream.map(),调用该方法将产生一个新的Stream
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) { ... return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE, StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) { @Override /*opWripSink()方法返回由回调函数包装而成Sink*/ Sink<P_OUT> opWrapSink(int flags, Sink<R> downstream) { return new Sink.ChainedReference<P_OUT, R>(downstream) { @Override public void accept(P_OUT u) { R r = mapper.apply(u);// 1. 使用当前Sink包装的回调函数mapper处理u downstream.accept(r);// 2. 将处理结果传递给流水线下游的Sink } }; } }; }

上述代码看似复杂,其实逻辑很简单,就是将回调函数mapper包装到一个Sink当中。由于Stream.map()是一个无状态的中间操作,所以map()方法返回了一个StatelessOp内部类对象(一个新的Stream),调用这个新Stream的opWripSink()方法将得到一个包装了当前回调函数的Sink。

再来看一个复杂一点的例子。Stream.sorted()方法将对Stream中的元素进行排序,显然这是一个有状态的中间操作,因为读取所有元素之前是没法得到最终顺序的。抛开模板代码直接进入问题本质,sorted()方法是如何将操作封装成Sink的呢?sorted()一种可能封装的Sink代码如下:

// Stream.sort()方法用到的Sink实现
class RefSortingSink<T> extends AbstractRefSortingSink<T> { private ArrayList<T> list;// 存放用于排序的元素 RefSortingSink(Sink<? super T> downstream, Comparator<? super T> comparator) { super(downstream, comparator); } @Override public void begin(long size) { ... // 创建一个存放排序元素的列表 list = (size >= 0) ? new ArrayList<T>((int) size) : new ArrayList<T>(); } @Override public void end() { list.sort(comparator);// 只有元素全部接收之后才能开始排序 downstream.begin(list.size()); if (!cancellationWasRequested) {// 下游Sink不包含短路操作 list.forEach(downstream::accept);// 2. 将处理结果传递给流水线下游的Sink } else {// 下游Sink包含短路操作 for (T t : list) {// 每次都调用cancellationRequested()询问是否可以结束处理。 if (downstream.cancellationRequested()) break; downstream.accept(t);// 2. 将处理结果传递给流水线下游的Sink } } downstream.end(); list = null; } @Override public void accept(T t) { list.add(t);// 1. 使用当前Sink包装动作处理t,只是简单的将元素添加到中间列表当中 } }

上述代码完美的展现了Sink的四个接口方法是如何协同工作的:

  1. 首先begin()方法告诉Sink参与排序的元素个数,方便确定中间结果容器的的大小;
  2. 之后通过accept()方法将元素添加到中间结果当中,最终执行时调用者会不断调用该方法,直到遍历所有元素;
  3. 最后end()方法告诉Sink所有元素遍历完毕,启动排序步骤,排序完成后将结果传递给下游的Sink;
  4. 如果下游的Sink是短路操作,将结果传递给下游时不断询问下游cancellationRequested()是否可以结束处理。

 

>> 叠加之后的操作如何执行

Stream_pipeline_Sink

Sink完美封装了Stream每一步操作,并给出了[处理->转发]的模式来叠加操作。这一连串的齿轮已经咬合,就差最后一步拨动齿轮启动执行。是什么启动这一连串的操作呢?也许你已经想到了启动的原始动力就是结束操作(Terminal Operation),一旦调用某个结束操作,就会触发整个流水线的执行。

结束操作之后不能再有别的操作,所以结束操作不会创建新的流水线阶段(Stage),直观的说就是流水线的链表不会在往后延伸了。结束操作会创建一个包装了自己操作的Sink,这也是流水线中最后一个Sink,这个Sink只需要处理数据而不需要将结果传递给下游的Sink(因为没有下游)。对于Sink的[处理->转发]模型,结束操作的Sink就是调用链的出口。

我们再来考察一下上游的Sink是如何找到下游Sink的。一种可选的方案是在PipelineHelper中设置一个Sink字段,在流水线中找到下游Stage并访问Sink字段即可。但Stream类库的设计者没有这么做,而是设置了一个Sink AbstractPipeline.opWrapSink(int flags, Sink downstream)方法来得到Sink,该方法的作用是返回一个新的包含了当前Stage代表的操作以及能够将结果传递给downstream的Sink对象。为什么要产生一个新对象而不是返回一个Sink字段?这是因为使用opWrapSink()可以将当前操作与下游Sink(上文中的downstream参数)结合成新Sink。试想只要从流水线的最后一个Stage开始,不断调用上一个Stage的opWrapSink()方法直到最开始(不包括stage0,因为stage0代表数据源,不包含操作),就可以得到一个代表了流水线上所有操作的Sink,用代码表示就是这样:

// AbstractPipeline.wrapSink()
// 从下游向上游不断包装Sink。如果最初传入的sink代表结束操作,
// 函数返回时就可以得到一个代表了流水线上所有操作的Sink。 final <P_IN> Sink<P_IN> wrapSink(Sink<E_OUT> sink) { ... for (AbstractPipeline p=AbstractPipeline.this; p.depth > 0; p=p.previousStage) { sink = p.opWrapSink(p.previousStage.combinedFlags, sink); } return (Sink<P_IN>) sink; }

现在流水线上从开始到结束的所有的操作都被包装到了一个Sink里,执行这个Sink就相当于执行整个流水线,执行Sink的代码如下:

// AbstractPipeline.copyInto(), 对spliterator代表的数据执行wrappedSink代表的操作。
final <P_IN> void copyInto(Sink<P_IN> wrappedSink, Spliterator<P_IN> spliterator) { ... if (!StreamOpFlag.SHORT_CIRCUIT.isKnown(getStreamAndOpFlags())) { wrappedSink.begin(spliterator.getExactSizeIfKnown());// 通知开始遍历 spliterator.forEachRemaining(wrappedSink);// 迭代 wrappedSink.end();// 通知遍历结束 } ... }

上述代码首先调用wrappedSink.begin()方法告诉Sink数据即将到来,然后调用spliterator.forEachRemaining()方法对数据进行迭代(Spliterator是容器的一种迭代器,参阅),最后调用wrappedSink.end()方法通知Sink数据处理结束。逻辑如此清晰。

 

>> 执行后的结果在哪里

最后一个问题是流水线上所有操作都执行后,用户所需要的结果(如果有)在哪里?首先要说明的是不是所有的Stream结束操作都需要返回结果,有些操作只是为了使用其副作用(Side-effects),比如使用Stream.forEach()方法将结果打印出来就是常见的使用副作用的场景(事实上,除了打印之外其他场景都应避免使用副作用),对于真正需要返回结果的结束操作结果存在哪里呢?

特别说明:副作用不应该被滥用,也许你会觉得在Stream.forEach()里进行元素收集是个不错的选择,就像下面代码中那样,但遗憾的是这样使用的正确性和效率都无法保证,因为Stream可能会并行执行。大多数使用副作用的地方都可以使用归约操作更安全和有效的完成。

// 错误的收集方式
ArrayList<String> results = new ArrayList<>(); stream.filter(s -> pattern.matcher(s).matches()) .forEach(s -> results.add(s)); // Unnecessary use of side-effects! // 正确的收集方式 List<String>results = stream.filter(s -> pattern.matcher(s).matches()) .collect(Collectors.toList()); // No side-effects!

回到流水线执行结果的问题上来,需要返回结果的流水线结果存在哪里呢?这要分不同的情况讨论,下表给出了各种有返回结果的Stream结束操作。

返回类型 对应的结束操作
boolean anyMatch() allMatch() noneMatch()
Optional findFirst() findAny()
归约结果 reduce() collect()
数组 toArray()
  1. 对于表中返回boolean或者Optional的操作(Optional是存放 一个 值的容器)的操作,由于值返回一个值,只需要在对应的Sink中记录这个值,等到执行结束时返回就可以了。
  2. 对于归约操作,最终结果放在用户调用时指定的容器中(容器类型通过收集器指定)。collect(), reduce(), max(), min()都是归约操作,虽然max()和min()也是返回一个Optional,但事实上底层是通过调用reduce()方法实现的。
  3. 对于返回是数组的情况,毫无疑问的结果会放在数组当中。这么说当然是对的,但在最终返回数组之前,结果其实是存储在一种叫做Node的数据结构中的。Node是一种多叉树结构,元素存储在树的叶子当中,并且一个叶子节点可以存放多个元素。这样做是为了并行执行方便。关于Node的具体结构,我们会在下一节探究Stream如何并行执行时给出详细说明。

 

结语

本文详细介绍了Stream流水线的组织方式和执行过程,学习本文将有助于理解原理并写出正确的Stream代码,同时打消你对Stream API效率方面的顾虑。如你所见,Stream API实现如此巧妙,即使我们使用外部迭代手动编写等价代码,也未必更加高效。

注:留下本文所用的JDK版本,以便有考究癖的人考证:

$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) Server VM (build 25.101-b13, mixed mode)
阅读自:https://github.com/CarpenterLee/JavaLambdaInternals/blob/master/6-Stream%20Pipelines.md

Guess you like

Origin www.cnblogs.com/sunshinekevin/p/11576893.html