Deduplication using a Java Set

Michael Kay :

I have a collection of objects, let's call them A, B, C, D,... and some are equal to others. If A and C are equal, then I want to replace every reference to C with a reference to A. This means (a) object C can be garbage collected, freeing up memory, and (b) I can later use "==" to compare objects in place of an expensive equals() operation. (These objects are large and the equals() operation is slow.)

My instinct was to use a java.util.Set. When I encounter C I can easily see if there is an entry in the Set equal to C. But if there is, there seems to be no easy way to find out what that entry is, and replace my reference to the existing entry. Am I mistaken? Iterating over all the entries to find the one that matches is obviously a non-starter.

Currently, instead of a Set, I'm using a Map in which the value is always the same as the key. Calling map.get(C) then finds A. This works, but it feels incredibly convoluted. Is there a more elegant way of doing it?

Stephen C :

This problem is not simple de-duplication: it is a form of canonicalization.

The standard approach is to use a Map rather than a Set. Here's a sketch of how to do it:

public <T> List<T> canonicalizeList(List<T> input) {
    HashMap<T, T> map = new HashMap<>();
    List<T> output = new ArrayList<>();
    for (T element: input) {
        T canonical = map.get(element);
        if (canonical == null) {
            element = canonical;
            map.put(canonical, canonical);
        }
        output.add(canonical);
    }
    return output;
}

Note that this is O(N). If you can safely assume that the percentage of duplicates in input is likely to be small, then you could set the capacity of map and output to the size of input.


Now you seem to be saying that you are doing it this way already (last paragraph), and you are asking if there is a better way. As far as I know, there isn't one. (The HashSet API lets would let you test if a set contains a value equal to element, but it does not let you find out what it is in O(1).)

For what it is worth, under the hood the HashSet<T> class is implemented as a HashMap<T, T>. So you would not be saving time or space by using a HashSet directly ...

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=36990&siteId=1