java 8, most efficient method to return duplicates from a list (not remove them)?

Jonathon Hoaglin :

I have an ArrayList of Strings, and I want to find and return all values which exist more than once in the list. Most cases are looking for the opposite (removing the duplicate items like distinct()), and so example code is hard to come by.

I was able to come up with this:

public synchronized List<String> listMatching(List<String> allStrings) {

    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> Collections.frequency(allStrings, string) > 1)
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for Collections.frequency(): "+ elapsedTime);

    return duplicates;
}

But this uses Collections.frequency, which loops through the whole list for each item and counts every occurrence. This takes about 150ms to run on my current list of about 4,000 strings. This is a bit slow for me and will only get worse as the list size increases. I took the frequency method and rewrote it to return immediately on the 2nd occurrence:

protected boolean moreThanOne(Collection<?> c, Object o) {
    boolean found = false;
    if (o != null) {
        for (Object e : c) {
            if (o.equals(e)) {
                if (found) {
                    return found;
                } else {
                    found = true;
                }
            }
        }
    }
    return found;
}

and changed my method to use it:

public synchronized List<String> listMatching(List<String> allStrings)   {
    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> moreThanOne(allStrings, string))
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for moreThanOne(): "+ elapsedTime);

    return duplicates;
}

This seems to work as expected, but does not really increase the speed as much as I was hoping, clocking in at about 120ms. This is probably due to it also needing to loop through the whole list for each item, but I am not sure how to avoid that and still accomplish the task.

I know this might seem like premature optimization, but my List can easily be 1mil+, and this method is a critical piece of my app that influences the timing of everything else.

Do you see any way that I could further optimize this code? Perhaps using some sort of fancy Predicate? An entirely different approach?

EDIT: Thanks to all your suggestions, I was able to come up with something significantly faster:

public synchronized Set<String> listMatching(List<String> allStrings) {

    Set<String> allItems = new HashSet<>();
    Set<String> duplicates = allStrings.stream()
            .filter(string -> !allItems.add(string))
            .collect(Collectors.toSet());

    return duplicates;
}

Running under the same conditions, this is able to go through my list in <5ms. All the HashMap suggestions would have been great though, if I had needed to know the counts. Not sure why the Collections.frequency() method doesn't use that technique.

dspano :

An easy way to find duplicates is to iterate over the list and use the add() method to add the item to some other temp set. It will return false if the item already exists in the set.

public synchronized List<String> listMatching(List<String> allStrings) {
   Set<String> tempSet = new HashSet();
   Set<String> duplicates = new HashSet();

   allStrings.forEach( item -> {
       if (!tempSet.add(item)) duplicates.add(item);
   });

   return duplicates;
}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=477102&siteId=1