I accidentally saw the article saying that the set speed is faster than the list, so I was puzzled, so I did an experiment myself
Experiment 1: Comparison of query performance of the three with almost no duplicate data
import time
import numpy as np
##准备数据
mylist=[int(np.random.rand()*10000000) for i in range(10000000)]
myset=set(mylist)
mydic={i:i for i in mylist}
##list
st=time.time()
for i in mylist:
t1=i
print(time.time()-st)
#set
st=time.time()
for j in myset:
t2=j
print(time.time()-st)
#dict
st=time.time()
for m in mydic:
t3=mydic[m]
#print(t3)
print(time.time()-st)
operation result:
1.7017035484313965
1.8434038162231445
2.733205556869507
We can see that when the data has almost no repeated values, dict is actually the slowest
Experiment 2: Comparison of query performance of the three when increasing data duplication
import time
import numpy as np
##准备数据
##np.random.rand()这里比实验1少一个0,以增加重复数据比例,下面的实验同理
mylist=[int(np.random.rand()*1000000) for i in range(10000000)]
myset=set(mylist)
mydic={i:i for i in mylist}
##list
st=time.time()
for i in mylist:
t1=i
print(time.time()-st)
#set
st=time.time()
for j in myset:
t2=j
print(time.time()-st)
#dict
st=time.time()
for m in mydic:
t3=mydic[m]
#print(t3)
print(time.time()-st)
operation result:
1.7004029750823975
0.2652003765106201
0.5343024730682373
We can see that now the list is the slowest, the set is the fastest, and the plugin is still relatively large
The reason for this situation is that the set will be removed, and the set will sort the list after the operation. If you don’t believe it, you can try
The dict is fast because the bottom layer has a hash. This experiment shows that there are many repeated values, and the list will be very slow
Experiment 3: Comparison of the query performance of the three when continuing to increase the repeated ratio data
import time
import numpy as np
##准备数据
mylist=[int(np.random.rand()*100000) for i in range(10000000)]
myset=set(mylist)
mydic={i:i for i in mylist}
##list
st=time.time()
for i in mylist:
t1=i
print(time.time()-st)
#set
st=time.time()
for j in myset:
t2=j
print(time.time()-st)
#dict
st=time.time()
for m in mydic:
t3=mydic[m]
#print(t3)
print(time.time()-st)
operation result:
1.7057056427001953
0.031199932098388672
0.04680013656616211
The current data is more obvious. After all, after the set is deduplicated, the data in the set will be very small. The hash also reflects the meaning in this experiment. The number of dict and list data is the same, but the dict is faster
As mentioned earlier, the role of set sorting, sorting can indeed increase the data query speed to a certain extent
import time
import numpy as np
ls1=[i for i in range(10000000)]
ls2=[int(np.random.rand()*10000000) for i in range(10000000)]
st=time.time()
for i in ls1:
t1=i
print(time.time()-st)
st=time.time()
for j in ls2:
t2=j
print(time.time()-st)
operation result:
1.591202974319458
2.0755040645599365
Ordered data can become faster when querying, but the magnitude is to be determined. This experimental result provides a relatively large