Fall in love with the python series ------ python performance (12): pd.eval() accelerated experiment proved invalid

pd.eval() claims to use Numexpr, which can speed up operations between DataFrames

I personally experiment and show that the existing version of pandas is not supported at all, and only applies to the old version of pandas

Both the pandas version and Numexpr are the latest

pd.eval() supports many operations, such as four arithmetic operations, comparison operations, bit operations, etc.

Here is the first to verify the addition operation in the four arithmetic operations:

(1) Addition experiment 1:

import numexpr
import numpy as np
import pandas  as pd

df1=pd.DataFrame([int(np.random.rand()*100) for i in range(1<<25)])
df2=pd.DataFrame([int(np.random.rand()*100) for i in range(1<<25)])

#不使用pd.eval()
%timeit df1+df2
结果:10 loops, best of 3: 144 ms per loop


#使用pd.eval()
%timeit pd.eval('df1 + df2')
结果:10 loops, best of 3: 142 ms per loop

Almost consistent

(2) Addition experiment 2:

import numexpr
import numpy as np
import pandas  as pd

df1=pd.DataFrame([int(np.random.rand()*100) for i in range(1<<20)])
df2=pd.DataFrame([int(np.random.rand()*100) for i in range(1<<20)])

#不使用pd.eval()
%timeit df1+df2
结果:100 loops, best of 3: 5.32 ms per loop


#使用pd.eval()
%timeit pd.eval('df1 + df2')
结果:100 loops, best of 3: 5.72 ms per loop

Almost the same, even slower

(3) Addition experiment 3:

import numexpr
import numpy as np
import pandas  as pd

df1=pd.DataFrame(np.random.rand(1000000,100))
df2=pd.DataFrame(np.random.rand(1000000,100))

#不使用pd.eval()
%timeit df1+df2
结果:1 loop, best of 3: 406 ms per loop


#使用pd.eval()
%timeit pd.eval('df1 + df2')
结果:1 loop, best of 3: 408 ms per loop

 Almost consistent

(4) Addition experiment 4:

import numexpr
import numpy as np
import pandas  as pd

df1=pd.DataFrame(np.random.rand(1000000,100))
df2=pd.DataFrame(np.random.rand(1000000,100))
df3=pd.DataFrame(np.random.rand(1000000,100))
df4=pd.DataFrame(np.random.rand(1000000,100))
df5=pd.DataFrame(np.random.rand(1000000,100))

#不使用pd.eval()
%timeit df1+df2+df3+df4+df5
结果:1 loop, best of 3: 1.67 s per loop


#使用pd.eval()
%timeit pd.eval('df1+df2+df3+df4+df5')
结果:1 loop, best of 3: 1.62 s per loop

Almost consistent

(5) Addition experiment 5:

import numexpr
import numpy as np
import pandas  as pd

df1=pd.DataFrame(np.random.rand(100000,100))
df2=pd.DataFrame(np.random.rand(100000,100))
df3=pd.DataFrame(np.random.rand(100000,100))
df4=pd.DataFrame(np.random.rand(100000,100))
df5=pd.DataFrame(np.random.rand(100000,100))

#不使用pd.eval()
%timeit df1+df2+df3+df4+df5
结果:10 loops, best of 3: 172 ms per loop


#使用pd.eval()
%timeit pd.eval('df1+df2+df3+df4+df5')
结果:10 loops, best of 3: 170 ms per loop

Multiplication experiment:

import numexpr
import numpy as np
import pandas  as pd

df1=pd.DataFrame(np.random.rand(1000,1000))
df2=pd.DataFrame(np.random.rand(1000,1000))
df3=pd.DataFrame(np.random.rand(1000,1000))
df4=pd.DataFrame(np.random.rand(1000,1000))
df5=pd.DataFrame(np.random.rand(1000,1000))

#不使用pd.eval()
%timeit df1*df2*df3*df4*df5
结果:10 loops, best of 3: 20.1 ms per loop


#使用pd.eval()
%timeit pd.eval('df1*df2*df3*df4*df5')
结果:10 loops, best of 3: 22 ms per loop

Comparison operation:

import numexpr
import numpy as np
import pandas  as pd

df1=pd.DataFrame(np.random.rand(1000,1000))
df2=pd.DataFrame(np.random.rand(1000,1000))
df3=pd.DataFrame(np.random.rand(1000,1000))
df4=pd.DataFrame(np.random.rand(1000,1000))
df5=pd.DataFrame(np.random.rand(1000,1000))

#不使用pd.eval()
%timeit (df1 < df2) & (df2 <= df3) & (df3 == df4)& (df4 != df5)
结果:1 loop, best of 3: 1.94 s per loop


#使用pd.eval()
%timeit pd.eval('(df1 < df2) & (df2 <= df3) & (df3 == df4)& (df4 != df5)')
结果:1 loop, best of 3: 1.92 s per loop

Maybe now pandas has been optimized very powerfully, and there is no need to optimize in this way.

Guess you like

Origin blog.csdn.net/zhou_438/article/details/109317449