pd.eval() claims to use Numexpr, which can speed up operations between DataFrames
I personally experiment and show that the existing version of pandas is not supported at all, and only applies to the old version of pandas
Both the pandas version and Numexpr are the latest
pd.eval() supports many operations, such as four arithmetic operations, comparison operations, bit operations, etc.
Here is the first to verify the addition operation in the four arithmetic operations:
(1) Addition experiment 1:
import numexpr
import numpy as np
import pandas as pd
df1=pd.DataFrame([int(np.random.rand()*100) for i in range(1<<25)])
df2=pd.DataFrame([int(np.random.rand()*100) for i in range(1<<25)])
#不使用pd.eval()
%timeit df1+df2
结果:10 loops, best of 3: 144 ms per loop
#使用pd.eval()
%timeit pd.eval('df1 + df2')
结果:10 loops, best of 3: 142 ms per loop
Almost consistent
(2) Addition experiment 2:
import numexpr
import numpy as np
import pandas as pd
df1=pd.DataFrame([int(np.random.rand()*100) for i in range(1<<20)])
df2=pd.DataFrame([int(np.random.rand()*100) for i in range(1<<20)])
#不使用pd.eval()
%timeit df1+df2
结果:100 loops, best of 3: 5.32 ms per loop
#使用pd.eval()
%timeit pd.eval('df1 + df2')
结果:100 loops, best of 3: 5.72 ms per loop
Almost the same, even slower
(3) Addition experiment 3:
import numexpr
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.rand(1000000,100))
df2=pd.DataFrame(np.random.rand(1000000,100))
#不使用pd.eval()
%timeit df1+df2
结果:1 loop, best of 3: 406 ms per loop
#使用pd.eval()
%timeit pd.eval('df1 + df2')
结果:1 loop, best of 3: 408 ms per loop
Almost consistent
(4) Addition experiment 4:
import numexpr
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.rand(1000000,100))
df2=pd.DataFrame(np.random.rand(1000000,100))
df3=pd.DataFrame(np.random.rand(1000000,100))
df4=pd.DataFrame(np.random.rand(1000000,100))
df5=pd.DataFrame(np.random.rand(1000000,100))
#不使用pd.eval()
%timeit df1+df2+df3+df4+df5
结果:1 loop, best of 3: 1.67 s per loop
#使用pd.eval()
%timeit pd.eval('df1+df2+df3+df4+df5')
结果:1 loop, best of 3: 1.62 s per loop
Almost consistent
(5) Addition experiment 5:
import numexpr
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.rand(100000,100))
df2=pd.DataFrame(np.random.rand(100000,100))
df3=pd.DataFrame(np.random.rand(100000,100))
df4=pd.DataFrame(np.random.rand(100000,100))
df5=pd.DataFrame(np.random.rand(100000,100))
#不使用pd.eval()
%timeit df1+df2+df3+df4+df5
结果:10 loops, best of 3: 172 ms per loop
#使用pd.eval()
%timeit pd.eval('df1+df2+df3+df4+df5')
结果:10 loops, best of 3: 170 ms per loop
Multiplication experiment:
import numexpr
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.rand(1000,1000))
df2=pd.DataFrame(np.random.rand(1000,1000))
df3=pd.DataFrame(np.random.rand(1000,1000))
df4=pd.DataFrame(np.random.rand(1000,1000))
df5=pd.DataFrame(np.random.rand(1000,1000))
#不使用pd.eval()
%timeit df1*df2*df3*df4*df5
结果:10 loops, best of 3: 20.1 ms per loop
#使用pd.eval()
%timeit pd.eval('df1*df2*df3*df4*df5')
结果:10 loops, best of 3: 22 ms per loop
Comparison operation:
import numexpr
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.rand(1000,1000))
df2=pd.DataFrame(np.random.rand(1000,1000))
df3=pd.DataFrame(np.random.rand(1000,1000))
df4=pd.DataFrame(np.random.rand(1000,1000))
df5=pd.DataFrame(np.random.rand(1000,1000))
#不使用pd.eval()
%timeit (df1 < df2) & (df2 <= df3) & (df3 == df4)& (df4 != df5)
结果:1 loop, best of 3: 1.94 s per loop
#使用pd.eval()
%timeit pd.eval('(df1 < df2) & (df2 <= df3) & (df3 == df4)& (df4 != df5)')
结果:1 loop, best of 3: 1.92 s per loop
Maybe now pandas has been optimized very powerfully, and there is no need to optimize in this way.