Explain the usage and principle of collect in JAVA Stream

Getting to know Collector

Let's look at a simple scenario:

List of all personnel in the existing group, from which all personnel of the Shanghai subsidiary need to be screened out

Assume that the personnel information data is as follows:

Name Subsidiary department age salary
strong Shanghai company R & D a 28 3000
Er Niu Shanghai company R & D a 24 2000
iron pillar Shanghai company R & D department two 34 5000
Cuihua Nanjing Company test one 27 3000
Lingling Nanjing Company Test part two 31 4000

If you have used Stream before, or you have read my previous article on the usage of Stream, then you can easily realize the above demands with Stream:

public void filterEmployeesByCompany() {
    
    
    List<Employee> employees = getAllEmployees().stream()
            .filter(employee -> "上海公司".equals(employee.getSubCompany()))
            .collect(Collectors.toList());
    System.out.println(employees);
}

In the above code, the stream is first created, and then processed at the business level through a series of intermediate stream operations (filter method), and then the processed result is output as a List object through the termination operation (collect method).
insert image description here
However, in the demand scenarios we actually face, there are often some more complex demands, such as:

Existing list of all personnel in the group, from which all personnel of the Shanghai subsidiary need to be screened out and grouped by department

In fact, it is to add a new grouping appeal, that is, first implement the logic based on the previous code, and then group the results:

public void filterEmployeesThenGroup() {
    
    
    // 先 筛选
    List<Employee> employees = getAllEmployees().stream()
            .filter(employee -> "上海公司".equals(employee.getSubCompany()))
            .collect(Collectors.toList());
    // 再 分组
    Map<String, List<Employee>> resultMap = new HashMap<>();
    for (Employee employee : employees) {
    
    
    	// 如果不存在这个key就添加  否则返回 
        List<Employee> groupList = resultMap
                .computeIfAbsent(employee.getDepartment(), k -> new ArrayList<>());
        groupList.add(employee);
    }
    System.out.println(resultMap);
}

It seems that there is nothing wrong with it. I believe that many students also deal with it in actual coding. But in fact, we can also use the Stream operation to complete it directly:

public void filterEmployeesThenGroupByStream() {
    
    
    Map<String, List<Employee>> resultMap = getAllEmployees().stream()
            .filter(employee -> "上海公司".equals(employee.getSubCompany()))
            .collect(Collectors.groupingBy(Employee::getDepartment));
    System.out.println(resultMap);
}

Both ways of writing can get the same result:

{
    
    
    研发二部=[Employee(subCompany=上海公司, department=研发二部, name=铁柱, age=34, salary=5000)], 
    研发一部=[Employee(subCompany=上海公司, department=研发一部, name=大壮, age=28, salary=3000),              Employee(subCompany=上海公司, department=研发一部, name=二牛, age=24, salary=2000)]
}

Compared with the above two writing methods, is the second one much more concise in code? And is there a taste of self-commenting?

Through the reasonable and appropriate use of the collect method, Stream can be adapted to more practical usage scenarios and greatly improve our development and coding efficiency. Let's take a comprehensive look at collect and unlock more advanced operations.

collect\Collector\Collectors difference and association

When I first came into contact with the Stream collector, many students would be confused by the concepts of collect, Collector, and Collectors . Even though many people have used Stream for many years, they only know that collect needs to pass in something like **Collectors. ToList()** is such a simple usage, and the details behind it are not well understood.
Here is the simplest usage scenario of a collect collector to analyze and explain the relationship:
insert image description here
Generally speaking:

1️⃣ collect is a termination method of Stream, which will use the incoming collector (input parameter) to perform related operations on the result. This collector must be a specific implementation class of the Collector interface. 2️⃣ Collector is an interface, and the collect
method The collector is the concrete implementation class of the Collector interface
3️⃣ Collectors is a tool class that provides many static factory methods and many concrete implementation classes of the Collector interface. It is a pre-set general collector for the convenience of programmers (If you don't use the Collectors class, but implement the Collector interface yourself, you can).

Collector use and analysis

So far we can see that the essence of the Stream result collection operation is actually to process the elements in the Stream through the function processing logic defined by the collector, and then output the processed results.
insert image description here
According to the type of operation it performs, collectors can be divided into several different categories:
insert image description here

Identity Processing Collector

The so-called identity processing refers to the fact that the elements of the Stream are completely unchanged before and after being processed by the Collector function. For example, the toList() operation only takes the result out of the Stream and puts it into the List object, and does not do anything to the elements themselves. Change processing:
insert image description here
Collector of identity processing type is the most commonly used one in actual coding, such as:

list.stream().collect(Collectors.toList());
list.stream().collect(Collectors.toSet());
list.stream().collect(Collectors.toCollection());

Reduction Summary Collector

For the operations of the reduction and summary class, the elements in the Stream flow are traversed one by one, enter the Collector processing function, and then merge with the processing result of the previous element to obtain a new result, and so on, until the traversal is completed After that, output the final result. For example, the processing logic of the **Collectors.summingInt()** method is as follows:
insert image description here
For example, the example given at the beginning of this article, if it is necessary to calculate the total salary of employees that need to be paid by the Shanghai subsidiary every month, use Collectors.summingInt() to achieve this:

public void calculateSum() {
    
    
    Integer salarySum = getAllEmployees().stream()
            .filter(employee -> "上海公司".equals(employee.getSubCompany()))
            .collect(Collectors.summingInt(Employee::getSalary));
    System.out.println(salarySum);
}

It should be noted that the summary calculation here is not just a cumulative summary at the mathematical level, but a general concept of summary, that is, to process multiple elements and finally generate a result operation, such as calculating the maximum value in Stream In the end, the operation is also among multiple elements, and finally a result is obtained:
insert image description here
still using the example given before, now we need to know the information of the employee with the highest salary in the Shanghai subsidiary. How can we achieve this:

public void findHighestSalaryEmployee() {
    
    
    Optional<Employee> highestSalaryEmployee = getAllEmployees().stream()
            .filter(employee -> "上海公司".equals(employee.getSubCompany()))
            .collect(Collectors.maxBy(Comparator.comparingInt(Employee::getSalary)));
    System.out.println(highestSalaryEmployee.get());
}

Because we want to demonstrate the usage of collect here , the above-mentioned writing method is used. In practice, JDK also provides a simplified encapsulation of the above logic for ease of use. We can directly use the **max()** method to simplify, that is, the above code is equivalent to the following:

public void findHighestSalaryEmployee2() {
    
    
    Optional<Employee> highestSalaryEmployee = getAllEmployees().stream()
            .filter(employee -> "上海公司".equals(employee.getSubCompany()))
            .max(Comparator.comparingInt(Employee::getSalary));
    System.out.println(highestSalaryEmployee.get());
}

Group Partition Collector

The groupingBy method is provided in the Collectors tool class to obtain a grouping operation Collector. For its internal processing logic, please refer to the description in the figure below: The
insert image description here
**groupingBy()** operation needs to specify two key inputs, namely, the grouping function and the value collector:

  • Grouping function : a processing function, which is used to process based on the specified element, and returns a value for grouping (that is, the Key value of the grouping result HashMap). For elements that return the same value after being processed by this function, they will be assigned to the same in a group.

  • Value collector : For the further processing and conversion logic of the grouped data elements, here is still a conventional Collector collector, which is exactly the same as the collector passed in the collect() method (think of Russian nesting dolls, a concept) .

For the groupingBy operation, both the grouping function and the value collector are necessary. For convenience, in the Collectors tool class, two groupingBy overloaded implementations are provided , one of which only needs to pass in a grouping function, because it uses toList() as the value collector by default: for
insert image description here
example: When just doing a regular data grouping operation, you can just pass in a grouping function:

public void groupBySubCompany() {
    
    
    // 按照子公司维度将员工分组
    Map<String, List<Employee>> resultMap =
            getAllEmployees().stream()
                    .collect(Collectors.groupingBy(Employee::getSubCompany));
    System.out.println(resultMap);
}

In this way, the result returned by collect is a HashMap , and the value of each HashValue is a List type .

And if not only grouping is required, but also the grouped data needs to be processed, the grouping function and value collector need to be given at the same time:

public void groupAndCaculate() {
    
    
    // 按照子公司分组,并统计每个子公司的员工数
    Map<String, Long> resultMap = getAllEmployees().stream()
            .collect(Collectors.groupingBy(Employee::getSubCompany,
                    Collectors.counting()));
    System.out.println(resultMap);
}

In this way, the processing operations of grouping and data within the group are realized at the same time:

{
    
    南京公司=2, 上海公司=3}

In the above code, Collectors.groupingBy() is a grouping Collector, and a reduction summary Collector Collectors.counting() is passed into it , that is, another collector is nested in one collector. In addition to the scenarios demonstrated above, there is also a special grouping operation whose key type is only Boolean. In this case, we can also
implement it through the partition collector provided by Collectors.partitioningBy() .
For example:

Count the total number of employees of Shanghai companies and non-Shanghai companies, true means Shanghai companies, false means non-Shanghai companies

Using the partition collector, it can be implemented like this:

public void partitionByCompanyAndDepartment() {
    
    
    Map<Boolean, Long> resultMap = getAllEmployees().stream()
            .collect(Collectors.partitioningBy(e -> "上海公司".equals(e.getSubCompany()),
                    Collectors.counting()));
    System.out.println(resultMap);
}

The result is as follows:

{
    
    false=2, true=3}

The Collectors.partitioningBy() partitioning collector is used in the same way as the Collectors.groupingBy() grouping collector . Purely from the perspective of usage, the return value of the grouping function of the grouping collector is a Boolean value, and the effect is equivalent to a partition collector.

Collector's superposition nesting

Sometimes, we need to group according to a certain dimension first, then further group according to the second dimension, and then further process the grouped results. In this scenario, we can use the Collector Overlay nesting is used to achieve.
For example the following requirements:

There is a list of all employees of the entire group, and it is necessary to count the number of employees in each department of each subsidiary.

Using Stream's nested Collector, we can achieve this:

public void groupByCompanyAndDepartment() {
    
    
    // 按照子公司+部门双层维度,统计各个部门内的人员数
    Map<String, Map<String, Long>> resultMap = getAllEmployees().stream()
            .collect(Collectors.groupingBy(Employee::getSubCompany,
                    Collectors.groupingBy(Employee::getDepartment,
                            Collectors.counting())));
    System.out.println(resultMap);
}

You can look at the output results, which meet the demand expectations:

{
    
    
    南京公司={
    
    
        测试二部=1, 
        测试一部=1}, 
    上海公司={
    
    
        研发二部=1, 
        研发一部=2}
}

The above code is a typical example of Collector nested processing, and it is also a typical implementation logic of multi-level grouping. Analyze the overall processing process of the code, the general logic is as follows:

insert image description here
With the nested use of multiple Collectors, we can unlock a lot of complex scene processing capabilities. You can think of this operation as a doll operation, and you can nest infinitely if you want (it is unlikely that there will be such an absurd scene in practice).

Collectors provided by Collectors

For the convenience of programmers, the Collectors tool class package in JDK provides many ready-made Collector implementation classes, which can be used directly when coding. The commonly used collectors are introduced as follows:

method Meaning
toList Collect the elements in the stream into a List
toSet Collect the elements of the stream into a Set
toCollection Collect the elements in the stream into a Collection
toMap Collect the element mappings in the stream into a Map
counting Count the number of elements in the stream
summingInt Computes the cumulative sum of the specified int field in the stream. There are different methods for different types of numbers, such as summingDouble, etc.
averagingInt Computes the average of the specified int field in the stream. There are different methods for different types of numbers, such as averagingLong, etc.
joining Splice the string values ​​of all elements (or specified fields of elements) in the stream, and you can specify a splicing connector, or splicing characters at the beginning and end
maxBy According to the given comparator, select the element with the highest value
my city According to the given comparator, select the element with the smallest value
groupingBy Group according to the value of the given grouping function and output a Map object
partitioningBy Partition according to the value of the given partition function, output a Map object, and the key is always a Boolean value type
collectingAndThen Wrap another collector that performs secondary processing transformations on its results
reducing Starting from the given initial value, the elements are processed one by one, and finally all elements are calculated as the final 1 value output

Most of the above-mentioned methods have been used in the previous examples. Here is a supplementary introduction to collectAndThen.
The collector corresponding to collectAndThen must pass in an actual collector downstream that is actually used for result collection and processing and a finisher method. After the downstream collector calculates the result, it uses the finisher method to perform secondary processing on the result, and uses the processing result as The final result is returned.
insert image description here
Or take the previous example as an example:

Given a list of all employees in a group, find the employee with the highest salary in the Shanghai company.

We can write the following code:

public void findHighestSalaryEmployee() {
    
    
    Optional<Employee> highestSalaryEmployee = getAllEmployees().stream()
            .filter(employee -> "上海公司".equals(employee.getSubCompany()))
            .collect(Collectors.maxBy(Comparator.comparingInt(Employee::getSalary)));
    System.out.println(highestSalaryEmployee.get());
}

But the final output of this result is an Optional<Employee> type, it is more troublesome to use, so can we directly return the type of Employee we need? Here you can use collectAndThen to achieve:

public void testCollectAndThen() {
    
    
    Employee employeeResult = getAllEmployees().stream()
            .filter(employee -> "上海公司".equals(employee.getSubCompany()))
            .collect(
                    Collectors.collectingAndThen(
                            Collectors.maxBy(Comparator.comparingInt(Employee::getSalary)),
                            Optional::get)
            );
    System.out.println(employeeResult);
}

That's it, isn't it super simple?

Develop a custom collector

Earlier we demonstrated the usage of many collectors provided in the Collectors tool class. The commonly used collectors provided by Collectors listed in the previous section can also cover the development demands of most scenarios.
But maybe in the project, we will encounter some customized scenarios, and the existing collectors cannot meet our demands. At this time, we can also implement customized collectors ourselves.

Collector interface introduction

We know that the so-called collector is actually a concrete implementation class of the Collector interface. So if you want to customize your own collector, you must first understand which methods of the Collector interface we need to implement, and the functions and uses of each method.
When we create a new MyCollector class and declare to implement the Collector interface, we will find that we need to implement 5 interfaces:
insert image description here

interface name Description of function meaning
supplier Create a new result container, which can be a container or an accumulator instance, in short, it is used to store the result data
accumlator The specific processing operation of elements entering the collector
finisher When all elements are processed, the final processing operation on the result before returning the result, of course, you can also choose not to do any processing and return directly
combiner How to merge the processing results of each sub-stream together. For example, in a parallel stream processing scenario, the elements will be divided into multiple fragments for parallel processing. Finally, the data of each fragment needs to be merged into an overall result, that is, through this method to specify the merge logic for sub-results
characteristics Supplementary description of the processing behavior of this collector, such as whether this collector allows processing in parallel streams, whether the finisher method is required, etc. Here, a Set collection is returned, and the candidate values ​​​​in it are several fixed options.

For the optional values ​​in the characteristics return set collection, the instructions are as follows:

value Meaning
UNORDERED Declares that the summary reduction result of this collector has nothing to do with the traversal order of the elements of the Stream stream, and is not affected by the element processing order
CONCURRENT Declares that this collector can be processed in parallel by multiple threads, allowing processing in parallel streams
IDENTITY_FINISH Declare that the finisher method of this collector is an identity operation and can be skipped

Now that we know the meanings and uses of these five interface methods, as a Collector collector, how do these interfaces cooperate with each other and collect Stream data as the required output results? The following picture can clearly illustrate this process:
insert image description here
Of course, if our Collector supports use in parallel streams, the processing process will be slightly different:
insert image description here
In order to have an intuitive understanding of the above method, we can look at Collectors.toList() implementation source code of this collector:

static final Set<Collector.Characteristics> CH_ID
            = Collections.unmodifiableSet(EnumSet.of(Collector.Characteristics.IDENTITY_FINISH));

public static <T> Collector<T, ?, List<T>> toList() {
    
    
    return new CollectorImpl<>((Supplier<List<T>>) ArrayList::new, List::add,
                               (left, right) -> {
    
     left.addAll(right); return left; },
                               CH_ID);
}

The disassembly and analysis of the above code is as follows:

  • Supplier method: ArrayList::new, that is, a new ArrayList is used as the result storage container.
  • accumulator method: List::add, that is, for each element in the stream, call the list.add() method to add to the result container tracking.
  • Combiner method: (left, right) -> { left.addAll(right); return left; }, that is, the results of each sub-ArrayList generated by parallel operations are finally combined into the final result through the list.addAll() method.
  • finisher method: not provided, the default one is used, because there is no need to do any processing, it is an identity operation
  • characteristics: IDENTITY_FINISH is returned, that is, the final result is returned directly, without secondary processing by the finisher method. Note that CONCURRENT is not declared here, because ArrayList is a non-thread-safe container, so this collector does not support use in concurrent processes.

Through the above method-by-method description, and then think about the specific performance of Collectors.toList(), you must have a more intuitive understanding of the meaning of each interface method, right?

Implement the Collector interface

Now that you have figured out the functions of the main methods in the Collector interface, you can start writing your own collector. Create a new class class, then declare to implement the Collector interface, and then implement the specific interface method.
As mentioned earlier, the Collectors.summingInt collector is used to calculate the sum of an int type field in each element. Suppose we need a new accumulation function:

Computes the sum of the squares of some int field value for each element in the stream

Next, let's define a collector together to realize this function.

  • supplier method

The responsibility of the supplier method is to create a container for storing and accumulating results. Since we want to calculate the cumulative result of multiple values, we must first declare an int sum = 0 to store the cumulative result. But in order to allow our collector to support use in concurrent mode, we can use thread-safe AtomicInteger to implement it here.
So we can determine the implementation logic of the supplier method:

@Override
public Supplier<AtomicInteger> supplier() {
    
    
    // 指定用于最终结果的收集,此处返回new AtomicInteger(0),后续在此基础上累加
    return () -> new AtomicInteger(0);
}
  • accumulator method

The accumulator method implements specific calculation logic and is also the method where the core business logic of the entire Collector resides. When the collector is processing, the elements in the Stream flow will enter the Collector one by one, and then the accumulator method will perform calculations one by one:

@Override
public BiConsumer<AtomicInteger, T> accumulator() {
    
    
    // 每个元素进入的时候的遍历策略,当前元素值的平方与sum结果进行累加
    return (sum, current) -> {
    
    
        int intValue = mapper.applyAsInt(current);
        sum.addAndGet(intValue * intValue);
    };
}

It is also added here that among the several methods in the collector, only the accumulator needs to be executed repeatedly, and several elements will be executed several times, and the rest of the methods will not directly deal with the elements in the Stream.

  • combiner method

Because we used the thread-safe AtomicInteger as the result container in the previous supplier method, it supports use in parallel streams. According to the above introduction, the parallel stream is to divide the Stream into multiple fragments, and then calculate and process the fragments separately to obtain the respective results of the fragments. Finally, the results of these fragments need to be merged into the same total result. How to merge them , which is what we need to achieve here:

@Override
public BinaryOperator<AtomicInteger> combiner() {
    
    
    // 多个分段结果处理的策略,直接相加
    return (sum1, sum2) -> {
    
    
        sum1.addAndGet(sum2.get());
        return sum1;
    };
}

Because we are here to do a sum of the squares of numbers, so for the results after fragmentation, we can directly add them together.

  • finisher method

The target result of our collector is to output an accumulated Integer result value, but in order to ensure thread safety in concurrent streams, we use AtomicInteger as the result container. That is, in the end we need to convert the internal AtomicInteger object into an Integer object, so our implementation logic of the finisher method is as follows:

@Override
public Function<AtomicInteger, Integer> finisher() {
    
    
    // 结果处理完成之后对结果的二次处理
    // 为了支持多线程并发处理,此处内部使用了AtomicInteger作为了结果累加器
    // 但是收集器最终需要返回Integer类型值,此处进行对结果的转换
    return AtomicInteger::get;
}
  • characteristicsMethod

Here, we declare some characteristics of the Collector collector:

Because the collector we implemented is allowed to be used in parallel streams, we declare the CONCURRENT attribute;
as an operation of accumulating numbers to calculate the sum, it does not matter to the order in which the elements are calculated, so we also declare the UNORDERED attribute;
because we In the finisher method, a result processing conversion operation is performed, not an identity processing operation, so the IDENTITY_FINISH attribute cannot be declared here.

Based on this analysis, the implementation of this method is as follows:

@Override
public Set<Characteristics> characteristics() {
    
    
    Set<Characteristics> characteristics = new HashSet<>();
    // 指定该收集器支持并发处理(前面也发现我们采用了线程安全的AtomicInteger方式)
    characteristics.add(Characteristics.CONCURRENT);
    // 声明元素数据处理的先后顺序不影响最终收集的结果
    characteristics.add(Characteristics.UNORDERED);
    // 注意:这里没有添加下面这句,因为finisher方法对结果进行了处理,非恒等转换
    // characteristics.add(Characteristics.IDENTITY_FINISH);
    return characteristics;
}

Let's use the collector defined by ourselves to see:

public void testMyCollector() {
    
    
    Integer result = Stream.of(new Score(1), new Score(2), new Score(3), new Score(4))
            .collect(new MyCollector<>(Score::getScore));
    System.out.println(result);
}

Output result:

30

Exactly as we expected, the custom collector is implemented.

Guess you like

Origin blog.csdn.net/doublepg13/article/details/128582800