List collection deduplication optimization

Article directory

Recently, there was another round of code review, and I found some codes that implement duplication removal. I am using list.contain...

I thought about it, do many beginners also have this problem of deduplication?

So I chose to sort this out and share it.

text

First, create a List of simulated data, with a total of 2,000 items. Half of the data, 1,000 items, are duplicates:

public static List<String> getTestList() {
    
    
    List<String> list = new ArrayList<>();
    for (int i = 1; i <= 10000; i++) {
    
    
        list.add(String.valueOf(i));
    }
    for (int i = 10000; i >= 1; i--) {
    
    
        list.add(String.valueOf(i));
    }
    return list;
}

contains to remove duplicates

Let’s first look at the code we use contain to remove duplicates:

/**
 * 使用 list.contain 去重
 *
 * @param testList
 */
private static void useContain2Distinct(List<String> testList) {
    
    
    System.out.println("contains 开始去重，条数：" + testList.size());
    List<String> testListDistinctResult = new ArrayList<>();
    for (String str : testList) {
    
    
        if (!testListDistinctResult.contains(str)) {
    
    
            testListDistinctResult.add(str);
        }
    }
    System.out.println("contains 去重完毕，条数：" + testListDistinctResult.size());
}

Let’s call it and see how long it takes:

public static void main(String[] args) {
    
    
    List<String> testList = getTestList();
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();
    useContainDistinct(testList);
    stopWatch.stop();
    System.out.println("去重 最终耗时" + stopWatch.getTotalTimeMillis());
}

time consuming:

Evaluation: The efficiency of list.contain. My suggestion is, just know it and don’t use it.

As we all know, there is no duplicate data in Set, so let’s take a look at the performance of using HashSet to remove duplication:

Set to remove duplicates

ps: Here we use the add method of set to remove duplicates.

/**
 * 使用set去重
 *
 * @param testList
 */
private static void useSetDistinct(List<String> testList) {
    
    
    System.out.println("HashSet.add 开始去重，条数：" + testList.size());
    List<String> testListDistinctResult = new ArrayList<>(new HashSet(testList));
    System.out.println("HashSet.add 去重完毕，条数：" + testListDistinctResult.size());
}

Let’s call it and see how long it takes:

public static void main(String[] args) {
    
    
    List<String> testList = getTestList();
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();
    useSetDistinct(testList);
    stopWatch.stop();
    System.out.println("去重 最终耗时" + stopWatch.getTotalTimeMillis());
}

time consuming:

Evaluation: The efficiency of HashSet, my suggestion is recommended.

Why is the time difference so big?

Without further ado, let’s look at the source code:

list.contains(o)：

You can see that index(o) is used inside:

Time complexity: O(n) n: number of elements

So let's see what set.add(o) looks like:

Add of map, I won’t talk about the cliché. After hashing, it is directly stuffed into a certain position. Time complexity: O(1).

So which one is faster, O(n) or O(1)? Obviously.

ps: By the way, let’s talk about the contain of hashset.

The time complexity is also: O(1)

So let’s finally look at other deduplications:

Double for loop, remove to remove duplicates

/**
 * 使用双for循环去重
 * @param testList
 */
private static void use2ForDistinct(List<String> testList) {
    
    
    System.out.println("list 双循环 开始去重，条数：" + testList.size());
    for (int i = 0; i < testList.size(); i++) {
    
    
        for (int j = i + 1; j < testList.size(); j++) {
    
    
            if (testList.get(i).equals(testList.get(j))) {
    
    
                testList.remove(j);
            }
        }
    }
    System.out.println("list 双循环  去重完毕，条数：" + testList.size());
}
public static void main(String[] args) {
    
    
    List<String> testList = getTestList();
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();
    use2ForDistinct(testList);
    stopWatch.stop();
    System.out.println("去重 最终耗时" + stopWatch.getTotalTimeMillis());
}

time consuming:

Evaluation: Just know it, it’s just for fun, don’t use it, it’s too slow, and the code looks messy

Distinct deduplication of stream:

/**
 * 使用Stream 去重
 *
 * @param testList
 */
private static void useStreamDistinct(List<String> testList) {
    
    
    System.out.println("stream 开始去重，条数：" + testList.size());
    List<String> testListDistinctResult = testList.stream().distinct().collect(Collectors.toList());
    System.out.println("stream 去重完毕，条数：" + testListDistinctResult.size());
}
public static void main(String[] args) {
    
    
    List<String> testList = getTestList();
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();
    useStreamDistinct(testList);
    stopWatch.stop();
    System.out.println("去重 最终耗时" + stopWatch.getTotalTimeMillis());
}

time consuming:

Evaluation: Not bad, mainly because the code is quite concise and a little bit tempting.