Pandas performance optimization study notes

Summary

This article describes the common use Pandas data mining techniques to accelerate.

lab environment

import numpy as np
import pandas as pd
print(np.__version__)
print(pd.__version__)
1.16.5
0.25.2

Performance analysis tools

As used herein, the performance analysis tools, reference: Python performance evaluation study notes

data preparation

tsdf = pd.DataFrame(np.random.randint(1, 1000, (1000, 3)), columns=['A', 'B', 'C'],
                    index=pd.date_range('1/1/1900', periods=1000))
tsdf['D'] = np.random.randint(1, 3, (1000, ))
tsdf.head(3)
            A   B   C
1900-01-01  820 827 884 1
1900-01-02  943 196 513 1
1900-01-03  693 194 6   2

Acceleration calculation using arrays numpy

map, applymap, the difference between the apply, reference: Difference BETWEEN Map, and applymap Apply Methods in Pandas

apply(func, raw=True)

Finally, apply() takes an argument raw which is False by default, which converts each row or column into a Series before applying the function. When set to True, the passed function will instead receive an ndarray object, which has positive performance implications if you do not need the indexing functionality.
Pandas 官方文档

DataFrame.apply () parameter supported raw, is True, ndarray directly input function, using the acceleration numpy parallelization.
How fast?

%%timeit
tsdf.apply(np.mean)  # raw=False (default)
# 740 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
tsdf.apply(np.mean, raw=True)
# 115 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Decreased from 740 microseconds to 115 microseconds.
It can be used and under what conditions?

  1. Only DataFrame.apply () support, Series.apply () and Series.map () is not supported;
  2. When not in use func Series index.
tsdf.apply(np.argmax)  # raw=False, 保留索引
A   2019-12-08
B   2021-03-14
C   2020-04-09
D   2019-11-30
dtype: datetime64[ns]
tsdf.apply(np.argmax, raw=True)  # 索引丢失
A      8
B    470
C    131
D      0
dtype: int64

.values

When a plurality of computing Series, Series may be used to convert .values ​​ndarray recalculated.

%%timeit
tsdf.A * tsdf.B
# 123 µs ± 2.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
tsdf.A.values * tsdf.B.values
# 11.1 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Decreases from 11 microseconds to 123 microseconds.
Supplement
noticed Pandas 0.24.0 introduces .array and .to_numpy (), refer to . But the speed of the two methods is better values, recommended continued use of numeric data values in the case.

%%timeit
tsdf.A.array * tsdf.B.array
# 37.9 µs ± 938 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
tsdf.A.to_numpy() * tsdf.B.to_numpy()
# 15.6 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Both methods are seen in slower values ​​of 11 microseconds.

String operation optimization

data preparation

tsdf['S'] = tsdf.D.map({1: '123_abc', 2: 'abc_123'})
%%timeit
tsdf.S.str.split('_', expand=True)[0]  # 得到'_'之前的字符串
# 1.44 ms ± 97.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

An optimization idea is: for a specific scene, without the use of split, you can use partition:

%%timeit
tsdf.S.str.partition('_', expand=True)[0]
# 1.39 ms ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Rate slightly improved. Try apply:

%%timeit
tsdf.S.apply(lambda a: a.partition('_')[0])
# 372 µs ± 8.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Visible but the apply rate is faster than the string processing method Pandas own, this may be because the multi Pandas supported data types, some redundancy determination processing.
Notes that only two kinds of original data, in theory, for each data value only needs to be calculated once the other values directly map on the line. Therefore, consider switching to Categorical type:

tsdf['S_category'] = tsdf.S.astype('category')
%%timeit
tsdf.S_category.apply(lambda a: a.partition('_')[0])
# 246 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Processed reduced to 246 microseconds.

IO optimization

Guess you like

Origin www.cnblogs.com/tac-kit/p/12114635.html