Why is set() constructor slower than list()

Ch3steR :

I timed set() and list() constructors. set() was significantly slower than list(). I benchmarked them using values where no duplicates exist. I know set use hashtables is it reason it's slower?

I'm using Python 3.7.5 [MSC v.1916 64 bit (AMD64)], Windows 10, as of this writing( 8th March).

#No significant changed observed.
timeit set(range(10))
517 ns ± 4.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
timeit list(range(10))
404 ns ± 4.71 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

When the size increases set() became very slower than list()

# When size is 100
timeit set(range(100))
2.13 µs ± 12.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
timeit list(range(100))
934 ns ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# when size is ten thousand.
timeit set(range(10000))
325 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timeit list(range(10000))
240 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# When size is one million.
timeit set(range(1000000))
86.9 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
timeit list(range(1000000))
37.7 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Both of them take O(n) asymptotically. When there are no duplicates shouldn't set(...) approximately equal be to list(...).

To my surprise set comprehension and list comprehension didn't show those huge deviations like set() and list() showed.

# When size is 100. 
timeit {i for i in range(100)}
3.96 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
timeit [i for i in range(100)]
3.01 µs ± 265 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# When size is ten thousand.
timeit {i for i in range(10000)}
434 µs ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timeit [i for i in range(10000)]
395 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# When size is one million.
timeit {i for i in range(1000000)}
95.1 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
timeit [i for i in range(1000000)]
87.3 ms ± 760 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Martijn Pieters :

Why should they be the same? Yes, they are both O(n) but set() needs to hash each element and needs to account for elements not being unique. This translates to a higher fixed cost per element.

Big O says nothing about absolute times, only how the time taken will grow as the size of the input grows. Two O(n) algorithms, given the same inputs, can take vastly different amounts of time to complete. All you can say is that when the size of the input doubles, the amount of time taken will (roughly) double, for both functions.

If you want to understand Big O better, I highly recommend Ned Batchelder’s introduction to the subject.

When there are no duplicates shouldn't set(...) approximately equal be to list(...).

No, they are not equal, because list() doesn't hash. That there are no duplicates doesn't figure.

To my suprise set comprehension and list comprehension didn't show those huge deviations like set() and list() showed.

The additional loop executed by the Python interpreter loop adds overhead that dominates the time taken. The higher fixed cost of set() is then less prominent.

There are other differences that may make a difference:

  • Given a sequence with a known length, list() can pre-allocate enough memory to fit those elements. Sets can't pre-allocate as they can't know how many duplicates there will be. Pre-allocating avoids the (amortised) cost of having to grow the list dynamically.
  • List and set comprehensions add one element at a time, so list objects can't preallocate, increasing the fixed per-item cost slightly.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=169522&siteId=1