Java8 new features - easy to play stream stream data operations (parallel)

The previous article gave a detailed introduction to the serial programming of the stream. The complex data processing operations can be easily handled through the stream. However, if the amount of data is large, the middle part of the operation can actually be operated in parallel by multiple threads. , thereby improving efficiency.

Previous: Java8 new features - easy to play stream data operations (serial)
java8 new features column:

0 Preface

  • Parallelism: It emphasizes that after decomposing a large task into multiple small tasks, in order to reduce IO waiting time, etc., these small tasks are executed through polling, and multiple intermediate results are obtained and then aggregated into a final result.
  • Concurrency: The emphasis is on starting multiple tasks at the same time, usually involving context switching between multiple threads

But in the era of multi-CPU and distributed, the concepts of concurrency and parallelism are getting closer and closer. At least in Java's Stream, we can understand concurrency and parallelism as the same meaning: based on multi-threading technology, split a large task into multiple small tasks and assign them to different threads for execution.thereby improving efficiency, get multiple intermediate results and then aggregate them into one final result.

Stream's parallel programming, the bottom layer is based on ForkJoinPool technology to achieve. ForkJoinPool is a task framework introduced by Java 7 for parallel execution. The core idea is to split a large task into multiple small tasks (ie fork), and then aggregate the processing results of multiple small tasks into one result (ie join). However, the programming process of the Fork-Join mode in java7 is relatively complicated, and java8 applies this mode to the stream, which greatly simplifies the development code.

1. Fork-Join mode

1.1 Fork/Join framework

Split (fork) a large task into several small tasks (when it can no longer be split), and then recursively join and summarize the calculation results of each small task.
insert image description here

Adopt the "work-stealing" mode (work-stealing):
when executing a new task, it can split it into smaller task execution, and add the small task to the thread queue, and then from the queue of a random thread Steal one and put it in your own queue. Reduce subtask waiting time, thusImprove overall efficiency

1.2 Differences between Fork-Join and Map-Reduce

Fork-Join is similar to the idea of ​​Map-Reduce in big data, but there are certain differences:

  • Different application scenarios:
    • FJ seems designed to work on a single Java VM,
    • MR is explicitly designed to work on large clusters of machines
  • There are different ways to divide subtasks:
    • FJ provides facilities to split tasks into multiple subtasks, in a recursive manner; more layers, possibility of "cross-fork" communication at this stage
    • MR just does one big split, maps splits that don't speak to each other, then reduces everything together. Single layer, no communication between splits until reduced, and massively scalable.

1.3 Program Example

Here it is tested through an accumulation program, taking 1-30 billion as an example

  • RecursiveTask : has a return value
  • RecursiveAction: no return value
public class ForkJoinCalculate extends RecursiveTask<Long> {
    
    
    private static final long serialVersionUID = 12345678925L;
    private long start;
    private long end;
    // 没1w个一组划分子任务累加
    private static final long THRESHOLD = 10000L;

    public ForkJoinCalculate(long start, long end) {
    
    
        this.start = start;
        this.end = end;
    }

    @Override
    protected Long compute() {
    
    
        long length = end - start;
        if (length <= THRESHOLD) {
    
    
            long sum = 0L;
            for (long i = start; i <= end; i++) {
    
    
                sum += i;
            }
            return sum;
        } else {
    
    
            long mid = (start + end) / 2;
            ForkJoinCalculate left = new ForkJoinCalculate(start, mid);
            left.fork(); // 通过fork拆分子任务,并压入线程队列中

            ForkJoinCalculate right = new ForkJoinCalculate(mid+1, end);
            right.fork();

            // 利用join方法,递归,返回合并结果
            return left.join() + right.join();
        }
    }
}

The difference between the three submitted tasks:

  • execute(ForkJoinTask) execute tasks asynchronously, no return value
  • invoke(ForkJoinTask) has Join, tasks will be synchronized to the main process, and will return to the main thread after the tasks are executed
  • submit(ForkJoinTask) is executed asynchronously, with Task return value, which can be synchronized to the main thread through task.get
    @Test
    public void test15() {
        Instant start = Instant.now();

        ForkJoinPool pool = new ForkJoinPool();
        ForkJoinTask<Long> task = new ForkJoinCalculate(1, 30000000000L);
        Long invoke = pool.invoke(task);

        Instant end = Instant.now();
        System.out.println(Duration.between(start,end).toMillis());

    }

    @Test
    public void test16() {
        Instant start = Instant.now();

        long num = 30000000000L;
        long sum = 0L;
        for (long i = 1; i <= num; i++) {
            sum += i;
        }
        Instant end = Instant.now();
        System.out.println(Duration.between(start,end).toMillis());
    }

2. Creation of parallel streams

  1. A parallel stream can be obtained through the Collection.parallelStream method
  2. Conversion between serial & parallel, free conversion in intermediate operations
    1. BaseStream.parallel() string -> parallel
    2. BaseStream.sequential() and -> String
  • BaseStream.isParallel() can determine whether a stream is a parallel stream.
        Instant start = Instant.now();
        OptionalLong reduce = LongStream.rangeClosed(0, 30000000000L).parallel().reduce(Long::sum);
        System.out.println(reduce.getAsLong());

        Instant end = Instant.now();
        System.out.println(Duration.between(start,end).toMillis());

3. Sequence

  • Ordered: The streams generated by List and Array are all ordered streams, and using the BaseStream.unordered() method can remove the order restrictions and become unordered streams.
  • Unordered: The stream generated by HashSet is an unordered stream. You can forcibly add an encounter order constraint to the stream through the sorting method sort() to become an ordered stream.

Notice:

  • unordered does not disrupt the order, it just lifts the restriction, the order is no longer guaranteed, and then some operations can be donespecial optimization
  • Taking the most common Stream.forEach as an example, inparallel execution, even if the data source is a List, the order in which the forEach method processes elements is out of order. To guarantee processing order, use the methodStream.forEachOrdered

4. Thread Safety

4.1 Pure functions

Pure function (purely function): The function call does not change the state other than the function, in other words, the function calldoes not change the value of variables defined outside the function

To ensure data security, all stream operations are required to use pure function operations.

When the following are in parallel, data security cannot be guaranteed

    ArrayList<String> results = new ArrayList<>();
    provinces.parallelStream()
            // 过滤掉以 G 开头的省份
            .filter(s -> !s.startsWith("G"))
            // 在 lambda表达式中修改了 results 的值,
            // 说明了 "s -> results.add(s)" 并非一个纯函数,
            // 带来了不必要的 "副作用",
            // 在并行执行时,会导致线程不安全的问题。
            .forEach(s -> results.add(s));

Should be of the form:

    List<String> provinces = Arrays.asList("Guangdong", "Jiangsu", "Guangxi", "Jiangxi", "Shandong");
    
    List<String> results = provinces.parallelStream()
            // 过滤掉以 G 开头的省份
            .filter(s -> !s.startsWith("G"))
            // 没有 "副作用"
            .collect(Collectors.toList());

4.2 Reduce operation (reduce)

T reduce(T identity, BinaryOperator<T> accumulator);
  • identity: is the initial value of the specification operation. For any value t, accumulator.apply(identity, t) == t must be satisfied.Otherwise, erroneous results will result
  • accumulator: Binary expression, which requires associative, otherwise it will lead to indeterminate sequence or parallel scenariosIncorrectthe result of. This is because there may bemultiple elementsOperate with the initial value.

4.3 Collector (collect)

<R> R collect(Supplier<R> supplier,
              BiConsumer<R, ? super T> accumulator,
              BiConsumer<R, R> combiner);

  • R: The type of the return value, usually a container class (such as Collection or Map).
  • T is the element type in the Stream.
  • supplier: is a function used to create a container instance.
  • accumulator: A function that merges an element in a Stream into a container.
  • combiner is a function that merges two containers into one, only inused in parallel execution

In the scenario of parallel execution, we have some additional requirements:

  1. The combiner function satisfies the associative law
  2. Combiner and accumulator are required to be compatible (compatible), that is, for any r and t, combiner.accept(r, accumulator.accept(supplier.get(), t)) == accumulator.accept(r, t)

Guess you like

Origin blog.csdn.net/caoyuan666/article/details/124621402