I have a list suppose
listA=[679,890,907,780,5230,781]
and want to delete some elements that is existed in another
listB=[907,5230]
in minimum time complexity?
I can do this problem by using two "for loops" means O(n2) time complexity, but I want to reduce this complexity to O(nlog(n)) or O(n)? Is it possible?
It's possible - if one of the lists is sorted. Assuming that list A is sorted and list B is unsorted, with respective dimensions M
and N
, the minimum time complexity to remove all of list B's elements from list A will be O((N+M)*log(M))
. The way you can achieve this is by binary search - each lookup for an element in list A takes O(log(M))
time, and there are N
lookups (one for each element in list B). Since it takes O(M*log(M))
time to sort A, it's more efficient for huge lists to sort and then remove all elements, with total time complexity O((N+M)*log(M))
.
On the other hand, if you don't have a sorted list, just use Collection.removeAll, which has a time complexity of O(M*N)
in this case. The reason for this time complexity is that removeAll
does (by default) something like the following pseudocode:
public boolean removeAll(Collection<?> other)
for each elem in this list
if other contains elem
remove elem from this list
Since contains
has a time complexity of O(N)
for lists, and you end up doing M
iterations, this takes O(M*N)
time in total.
Finally, if you want to minimize the time complexity of removeAll
(with possibly degraded real world performance) you can do the following:
List<Integer> a = ...
List<Integer> b = ...
HashSet<Integer> lookup = new HashSet<>(b);
a.removeAll(lookup);
For bad values of b, the time to construct lookup
could take up to time O(N*log(N))
, as shown here (see "pathologically distributed keys"). After that, invoking removeAll
will take O(1)
for contains
over M
iterations, taking O(M)
time to execute. Therefore, the time complexity of this approach is O(M + N*log(N))
.
So, there are three approaches here. One provides you with time complexity O((N+M)*log(M))
, another provides you with time complexity O(M*N)
, and the last provides you with time complexity O(M + N*log(N))
. Considering that the first and last approaches are similar in time complexity (as log
tends to be very small even for large numbers), I would suggest going with the naive O(M*N)
for small inputs, and the simplest O(M + N*log(N))
for medium-sized inputs. At the point where your memory usage starts to suffer from creating a HashSet to store the elements of B (very large inputs), I would finally switch to the more complex O((N+M)*log(M))
approach.
You can find an AbstractCollection.removeAll implementation here.
Edit:
The first approach doesn't work so well for ArrayLists - removing from the middle of list A takes O(M)
time, apparently. Instead, sort list B (O(N*log(N))
), and iterate through list A, removing items as appropriate. This takes O((M+N)*log(N))
time and is better than the O(M*N*log(M))
that you end up with when using an ArrayList. Unfortunately, the "removing items as appropriate" part of this algorithm requires that you create data to store the non-removed elements in O(M)
, as you don't have access to the internal data array of list A. In this case, it's strictly better to go with the HashSet approach. This is because (1) the time complexity of O((M+N)*log(N))
is actually worse than the time complexity for the HashSet method, and (2) the new algorithm doesn't save on memory. Therefore, only use the first approach when you have a List with O(1)
time for removal (e.g. LinkedList) and a large amount of data. Otherwise, use removeAll. It's simpler, often faster, and supported by library designers (e.g. ArrayList has a custom removeAll
implementation that allows it to take linear instead of quadratic time using negligible extra memory).