The efficiency of Pandas processing data is still very good. Compared with large-scale data sets, as long as the correct method is mastered, it can save a lot of time in data processing time.
Pandas is built on NumPy array structures, and many operations are performed in C, either through NumPy or through Pandas' own library of Python extension modules written in Cython and compiled to C. In theory, the processing speed should be very fast.
So why is the same data processed by two people, the processing time will be vastly different when the equipment is the same?
To be clear, this is not a guide on how to over-optimize Pandas code. Pandas has been built to run fast if used properly. Also there is a big difference between optimizing and writing clean code.
Here's a guide to using Pandas the Pythonic way to take advantage of its powerful and easy-to-use built-in features.
Article directory
data preparation
The goal of this example is to apply a time-of-use energy tariff to calculate the total cost of energy consumption for a year. That is, at different times of the day, the price of electricity will be different, so the task is to multiply the amount of electricity consumed per hour by the correct price to consume that hour.
Read the data from a CSV file with two columns , one for the date plus time, and one for the electricity consumed in kilowatt-hours (kWh).
Datetime data optimization
import pandas as pd
df = pd.read_csv('数据科学必备Pandas实操数据处理加速技巧汇总/demand_profile.csv')
df.head()
date_time energy_kwh
0 1/1/13 0:00 0.586
1 1/1/13 1:00 0.580
2 1/1/13 2:00 0.572
3 1/1/13 3:00 0.596
4 1/1/13 4:00 0.592
At first glance this looks good, but there is one small problem. Pandas and NumPy have a concept of dtypes (data types). If no arguments are specified, date_time takes the object dtype.
df.dtypes
date_time object
energy_kwh float64
dtype: object
type(df.iat[0, 0])
str
object is not just a container for str, but any column that doesn't fit exactly into one data type. Handling dates as strings would be laborious and inefficient (which would also lead to memory inefficiencies). To process time series data, the date_time column needs to be formatted as an array of datetime objects ( Timestamp ).
df['date_time'] = pd.to_datetime(df['date_time'])
df['date_time'].dtype
datetime64[ns]
There is now a DataFrame called df with two columns and a numeric index to refer to the row.
df.head()
date_time energy_kwh
0 2013-01-01 00:00:00 0.586
1 2013-01-01 01:00:00 0.580
2 2013-01-01 02:00:00 0.572
3 2013-01-01 03:00:00 0.596
4 2013-01-01 04:00:00 0.592
Use the %%time timer decorator that comes with Jupyter for testing.
def convert(df, column_name):
return pd.to_datetime(df[column_name])
%%time
df['date_time'] = convert(df, 'date_time')
Wall time: 663 ms
def convert_with_format(df, column_name):
return pd.to_datetime(df[column_name],format='%d/%m/%y %H:%M')
%%time
df['date_time'] = convert(df, 'date_time')
Wall time: 1.99 ms
The processing efficiency is increased by nearly 350 times. In the case of processing large-scale data, the time for processing data will be infinitely enlarged.
Simple loop of data
Now that the date and time formats are processed, it's time to start calculating your electricity bill. Costs vary by hour, so the cost factor needs to be conditionally applied to each hour of the day.
In this example, the usage time cost will be defined in three parts.
data_type = {
# 高峰
"Peak":{
"Cents per kWh":28,"Time Range":"17:00 to 24:00"},
# 正常时段
"Shoulder":{
"Cents per kWh":20,"Time Range":"7:00 to 17:00"},
# 非高峰
"Off-Peak":{
"Cents per kWh":12,"Time Range":"0:00 to 7:00"},
}
If the price is 28 cents per kWh per hour of the day.
df['cost_cents'] = df['energy_kwh'] * 28
date_time energy_kwh cost_cents
0 2013-01-01 00:00:00 0.586 16.408
1 2013-01-01 01:00:00 0.580 16.240
2 2013-01-01 02:00:00 0.572 16.016
3 2013-01-01 03:00:00 0.596 16.688
4 2013-01-01 04:00:00 0.592 16.576
...
But costing depends on the time of day. This is where you'll see a lot of people use Pandas in unexpected ways, by writing a loop to do conditional computations.
def apply_tariff(kwh, hour):
"""计算给定小时的电费"""
if 0 <= hour < 7:
rate = 12
elif 7 <= hour < 17:
rate = 20
elif 17 <= hour < 24:
rate = 28
else:
raise ValueError(f'无效时间: {
hour}')
return rate * kwh
def apply_tariff(kwh, hour):
"""计算给定小时的电费"""
if 0 <= hour < 7:
rate = 12
elif 7 <= hour < 17:
rate = 20
elif 17 <= hour < 24:
rate = 28
else:
raise ValueError(f'无效时间: {
hour}')
return rate * kwh
def apply_tariff_loop(df):
energy_cost_list = []
for i in range(len(df)):
# 循环数据直接修改df
energy_used = df.iloc[i]['energy_kwh']
hour = df.iloc[i]['date_time'].hour
energy_cost = apply_tariff(energy_used, hour)
energy_cost_list.append(energy_cost)
df['cost_cents'] = energy_cost_list
Wall time: 2.59 s
Loop the .itertuples() and .iterrows() methods
Pandas actually makes the syntax potentially redundant for i in range(len(df)) by introducing the DataFrame.itertuples() and DataFrame.iterrows() methods, which are generator methods that yield one row at a time.
- .itertuples() generates a named tuple for each row, with the row index value as the first element of the tuple. Name tuples are data structures from the Python collections module that behave like Python tuples, but have fields accessible via attribute lookup.
- .iterrows() generates (index, Series) pairs (tuples) for each row in the DataFrame.
def apply_tariff_iterrows(df):
energy_cost_list = []
for index, row in df.iterrows():
energy_used = row['energy_kwh']
hour = row['date_time'].hour
energy_cost = apply_tariff(energy_used, hour)
energy_cost_list.append(energy_cost)
df['cost_cents'] = energy_cost_list
%%time
apply_tariff_iterrows(df)
Wall time: 808 ms
The speed is increased by as much as 3 times.
.apply() method
This operation can be further improved using the .apply() method. Pandas' .apply() method takes functions (callables) and applies them along the DataFrame's axes (all rows or all columns).
The lambda function passes two columns of data to apply_tariff() .
def apply_tariff_withapply(df):
df['cost_cents'] = df.apply(
lambda row: apply_tariff(
kwh=row['energy_kwh'],
hour=row['date_time'].hour),
axis=1)
%%time
apply_tariff_withapply(df)
Wall time: 181 ms
The syntax advantage of .apply() is obvious, the code is concise, easy to read, and clear. The time taken in this case is about 1/4 the time of the .iterrows() method.
.isin() data selection
But how to apply conditional computation as a vectorized operation in Pandas? One trick is to select and group parts of a DataFrame based on a condition, then apply a vectorization operation to each selected group.
Use Pandas' .isin() method to select rows and then apply them in a vectorized operation. Before doing this, it would be more convenient if the date_time column was set as the DataFrame's index.
df.set_index('date_time', inplace=True)
def apply_tariff_isin(df):
peak_hours = df.index.hour.isin(range(17, 24))
shoulder_hours = df.index.hour.isin(range(7, 17))
off_peak_hours = df.index.hour.isin(range(0, 7))
df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12
%%time
apply_tariff_isin(df)
Wall time: 53.5 ms
where the whole process method returns a boolean list.
[False, False, False, ..., True, True, True]
.cut() data binning
Setting up a list of time slices and a function formula for that calculation makes it easier to operate, but the code is a bit difficult to read for a novice.
def apply_tariff_cut(df):
cents_per_kwh = pd.cut(x=df.index.hour,
bins=[0, 7, 17, 24],
include_lowest=True,
labels=[12, 20, 28]).astype(int)
df['cost_cents'] = cents_per_kwh * df['energy_kwh']
%%time
apply_tariff_cut(df)
Wall time: 2.99 ms
Numpy method processing
Pandas Series and DataFrames are designed on top of the NumPy library. This provides greater computational flexibility as Pandas works seamlessly with NumPy arrays and operations.
Use NumPy's digitize() function. It's similar to Pandas' cut() in that the data will be binned, but this time it will be represented by an array of indices indicating which bin each hour belongs to. These indices are then applied to the price array.
import numpy as np
def apply_tariff_digitize(df):
prices = np.array([12, 20, 28])
bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
df['cost_cents'] = prices[bins] * df['energy_kwh'].values
%%time
apply_tariff_digitize(df)
Wall time: 1.99 ms
Processing efficiency comparison
Compare the efficiency of the above several different processing methods.
Features | Running time (seconds) |
---|---|
apply_tariff_loop() | 2.59 s |
apply_tariff_iterrows() | 808 ms |
apply_tariff_withapply() | 181 ms |
apply_tariff_isin() | 53.5 ms |
apply_tariff_cut() | 2.99 ms |
apply_tariff_digitize() | 1.99 ms |
HDFStore prevents reprocessing
Often when building complex data models, it is convenient to do some preprocessing on the data. If you have 10 years of minute-frequency electricity usage data, simply converting date and time to datetime may take 20 minutes even if the format parameter is specified. You only need to do this once instead of requiring a test or analysis every time you run the model.
One very useful thing that can be done here is preprocessing and then storing the data in a processed form so that it can be used when needed. But how can I store the data in the correct format without having to reprocess it again? If you were to save to CSV you would just lose your datetime object and have to reprocess it when accessing it again.
Pandas has a built-in solution using HDF5, a high-performance storage format designed for storing arrays of tabular data. Pandas' HDFStore class allows a DataFrame to be stored in an HDF5 file so that it can be accessed efficiently while still preserving column types and other metadata. dict is a dictionary-like class, so can be read and written like Python objects.
Store the preprocessed power consumption DataFrame df in HDF5 file.
data_store = pd.HDFStore('processed_data.h5')
# 将 DataFrame 放入对象中,将键设置为 preprocessed_df
data_store['preprocessed_df'] = df
data_store.close()
A method for accessing data from an HDF5 file, preserving the data type.
data_store = pd.HDFStore('processed_data.h5')
preprocessed_df = data_store['preprocessed_df']
data_store.close()