Data Science Essential Pandas Practical Data Processing Acceleration Skills Summary

The efficiency of Pandas processing data is still very good. Compared with large-scale data sets, as long as the correct method is mastered, it can save a lot of time in data processing time.

Pandas is built on NumPy array structures, and many operations are performed in C, either through NumPy or through Pandas' own library of Python extension modules written in Cython and compiled to C. In theory, the processing speed should be very fast.

So why is the same data processed by two people, the processing time will be vastly different when the equipment is the same?

insert image description here
To be clear, this is not a guide on how to over-optimize Pandas code. Pandas has been built to run fast if used properly. Also there is a big difference between optimizing and writing clean code.

Here's a guide to using Pandas the Pythonic way to take advantage of its powerful and easy-to-use built-in features.

data preparation

The goal of this example is to apply a time-of-use energy tariff to calculate the total cost of energy consumption for a year. That is, at different times of the day, the price of electricity will be different, so the task is to multiply the amount of electricity consumed per hour by the correct price to consume that hour.

Read the data from a CSV file with two columns , one for the date plus time, and one for the electricity consumed in kilowatt-hours (kWh).
insert image description here

Datetime data optimization

insert image description here

import pandas as pd
df = pd.read_csv('数据科学必备Pandas实操数据处理加速技巧汇总/demand_profile.csv')
df.head()
     date_time  energy_kwh
0  1/1/13 0:00       0.586
1  1/1/13 1:00       0.580
2  1/1/13 2:00       0.572
3  1/1/13 3:00       0.596
4  1/1/13 4:00       0.592

At first glance this looks good, but there is one small problem. Pandas and NumPy have a concept of dtypes (data types). If no arguments are specified, date_time takes the object dtype.

df.dtypes
date_time      object
energy_kwh    float64
dtype: object

type(df.iat[0, 0])
str

object is not just a container for str, but any column that doesn't fit exactly into one data type. Handling dates as strings would be laborious and inefficient (which would also lead to memory inefficiencies). To process time series data, the date_time column needs to be formatted as an array of datetime objects ( Timestamp ).

df['date_time'] = pd.to_datetime(df['date_time'])
df['date_time'].dtype
datetime64[ns]

There is now a DataFrame called df with two columns and a numeric index to refer to the row.

df.head()
               date_time    energy_kwh
0    2013-01-01 00:00:00         0.586
1    2013-01-01 01:00:00         0.580
2    2013-01-01 02:00:00         0.572
3    2013-01-01 03:00:00         0.596
4    2013-01-01 04:00:00         0.592

Use the %%time timer decorator that comes with Jupyter for testing.

def convert(df, column_name):
	return pd.to_datetime(df[column_name])

%%time
df['date_time'] = convert(df, 'date_time')

Wall time: 663 ms

def convert_with_format(df, column_name):
	return pd.to_datetime(df[column_name],format='%d/%m/%y %H:%M')

%%time
df['date_time'] = convert(df, 'date_time')

Wall time: 1.99 ms

The processing efficiency is increased by nearly 350 times. In the case of processing large-scale data, the time for processing data will be infinitely enlarged.

Simple loop of data

Now that the date and time formats are processed, it's time to start calculating your electricity bill. Costs vary by hour, so the cost factor needs to be conditionally applied to each hour of the day.

In this example, the usage time cost will be defined in three parts.

data_type = {
    
    
	# 高峰
	"Peak":{
    
    "Cents per kWh":28,"Time Range":"17:00 to 24:00"},
	# 正常时段
	"Shoulder":{
    
    "Cents per kWh":20,"Time Range":"7:00 to 17:00"},
	# 非高峰
	"Off-Peak":{
    
    "Cents per kWh":12,"Time Range":"0:00 to 7:00"}, 
}

If the price is 28 cents per kWh per hour of the day.

df['cost_cents'] = df['energy_kwh'] * 28

               date_time    energy_kwh       cost_cents
0    2013-01-01 00:00:00         0.586           16.408
1    2013-01-01 01:00:00         0.580           16.240
2    2013-01-01 02:00:00         0.572           16.016
3    2013-01-01 03:00:00         0.596           16.688
4    2013-01-01 04:00:00         0.592           16.576
...

But costing depends on the time of day. This is where you'll see a lot of people use Pandas in unexpected ways, by writing a loop to do conditional computations.

def apply_tariff(kwh, hour):
    """计算给定小时的电费"""    
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'无效时间: {
      
      hour}')
    return rate * kwh

def apply_tariff(kwh, hour):
    """计算给定小时的电费"""    
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'无效时间: {
      
      hour}')
    return rate * kwh

def apply_tariff_loop(df):
    energy_cost_list = []
    for i in range(len(df)):
    	# 循环数据直接修改df
        energy_used = df.iloc[i]['energy_kwh']
        hour = df.iloc[i]['date_time'].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)

    df['cost_cents'] = energy_cost_list

Wall time: 2.59 s

Loop the .itertuples() and .iterrows() methods

Pandas actually makes the syntax potentially redundant for i in range(len(df)) by introducing the DataFrame.itertuples() and DataFrame.iterrows() methods, which are generator methods that yield one row at a time.

  • .itertuples() generates a named tuple for each row, with the row index value as the first element of the tuple. Name tuples are data structures from the Python collections module that behave like Python tuples, but have fields accessible via attribute lookup.
  • .iterrows() generates (index, Series) pairs (tuples) for each row in the DataFrame.
def apply_tariff_iterrows(df):
    energy_cost_list = []
    for index, row in df.iterrows():
        energy_used = row['energy_kwh']
        hour = row['date_time'].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df['cost_cents'] = energy_cost_list

%%time
apply_tariff_iterrows(df)

Wall time: 808 ms

The speed is increased by as much as 3 times.

.apply() method

This operation can be further improved using the .apply() method. Pandas' .apply() method takes functions (callables) and applies them along the DataFrame's axes (all rows or all columns).

The lambda function passes two columns of data to apply_tariff() .

def apply_tariff_withapply(df):
    df['cost_cents'] = df.apply(
        lambda row: apply_tariff(
            kwh=row['energy_kwh'],
            hour=row['date_time'].hour),
        axis=1)

%%time
apply_tariff_withapply(df)

Wall time: 181 ms

The syntax advantage of .apply() is obvious, the code is concise, easy to read, and clear. The time taken in this case is about 1/4 the time of the .iterrows() method.

.isin() data selection

But how to apply conditional computation as a vectorized operation in Pandas? One trick is to select and group parts of a DataFrame based on a condition, then apply a vectorization operation to each selected group.

Use Pandas' .isin() method to select rows and then apply them in a vectorized operation. Before doing this, it would be more convenient if the date_time column was set as the DataFrame's index.

df.set_index('date_time', inplace=True)

def apply_tariff_isin(df):
    peak_hours = df.index.hour.isin(range(17, 24))
    shoulder_hours = df.index.hour.isin(range(7, 17))
    off_peak_hours = df.index.hour.isin(range(0, 7))

    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
    df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
    df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12

%%time
apply_tariff_isin(df)

Wall time: 53.5 ms

where the whole process method returns a boolean list.

[False, False, False, ..., True, True, True]

.cut() data binning

Setting up a list of time slices and a function formula for that calculation makes it easier to operate, but the code is a bit difficult to read for a novice.

def apply_tariff_cut(df):
    cents_per_kwh = pd.cut(x=df.index.hour,
                           bins=[0, 7, 17, 24],
                           include_lowest=True,
                           labels=[12, 20, 28]).astype(int)
    df['cost_cents'] = cents_per_kwh * df['energy_kwh']
    
%%time
apply_tariff_cut(df)

Wall time: 2.99 ms

Numpy method processing

Pandas Series and DataFrames are designed on top of the NumPy library. This provides greater computational flexibility as Pandas works seamlessly with NumPy arrays and operations.

Use NumPy's digitize() function. It's similar to Pandas' cut() in that the data will be binned, but this time it will be represented by an array of indices indicating which bin each hour belongs to. These indices are then applied to the price array.

import numpy as np

def apply_tariff_digitize(df):
    prices = np.array([12, 20, 28])
    bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
    df['cost_cents'] = prices[bins] * df['energy_kwh'].values

%%time
apply_tariff_digitize(df)

Wall time: 1.99 ms

Processing efficiency comparison

Compare the efficiency of the above several different processing methods.
insert image description here

Features Running time (seconds)
apply_tariff_loop() 2.59 s
apply_tariff_iterrows() 808 ms
apply_tariff_withapply() 181 ms
apply_tariff_isin() 53.5 ms
apply_tariff_cut() 2.99 ms
apply_tariff_digitize() 1.99 ms

HDFStore prevents reprocessing

Often when building complex data models, it is convenient to do some preprocessing on the data. If you have 10 years of minute-frequency electricity usage data, simply converting date and time to datetime may take 20 minutes even if the format parameter is specified. You only need to do this once instead of requiring a test or analysis every time you run the model.

One very useful thing that can be done here is preprocessing and then storing the data in a processed form so that it can be used when needed. But how can I store the data in the correct format without having to reprocess it again? If you were to save to CSV you would just lose your datetime object and have to reprocess it when accessing it again.

Pandas has a built-in solution using HDF5, a high-performance storage format designed for storing arrays of tabular data. Pandas' HDFStore class allows a DataFrame to be stored in an HDF5 file so that it can be accessed efficiently while still preserving column types and other metadata. dict is a dictionary-like class, so can be read and written like Python objects.

Store the preprocessed power consumption DataFrame df in HDF5 file.

data_store = pd.HDFStore('processed_data.h5')

# 将 DataFrame 放入对象中,将键设置为 preprocessed_df 
data_store['preprocessed_df'] = df
data_store.close()

A method for accessing data from an HDF5 file, preserving the data type.

data_store = pd.HDFStore('processed_data.h5')

preprocessed_df = data_store['preprocessed_df']
data_store.close()

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124244006
Recommended