Time series resampling and introduction to pandas’ resample method

Resampling is a fundamental technique for processing time series data in time series analysis. It is about transforming time series data from one frequency to another, which can change the time interval of the data, increasing the granularity through upsampling, or decreasing the granularity through downsampling. In this article, we will delve into the key issues of resampling in Pandas.

Why is resampling important?

Time series data often arrives with timestamps that may not match the desired analysis interval. For example, data is collected at irregular intervals but needs to be modeled or analyzed at a consistent frequency.

Resample classification

There are two main types of resampling:

1、Upsampling

Upsampling can increase the frequency or granularity of your data. This means converting the data into smaller time intervals.

2、Downsampling

Downsampling involves reducing the frequency or granularity of data. Convert the data to larger time intervals.

Resampling applications

Resampling has a wide range of applications:

In financial analysis, stock prices or other financial indicators may be recorded at irregular intervals. This data can be aligned with the time frame of the trading strategy (such as daily or weekly).

Internet of Things (IoT) devices often generate data at different frequencies. Resampling normalizes analysis data to ensure consistent time intervals.

When creating time series visualizations, you often need to display data at different frequencies. Resampling allows you to adjust the level of detail in your drawing.

Many machine learning models require data with consistent time intervals. Resampling is essential when preparing time series data for model training.

resampling process

The resampling process usually includes the following steps:

Start by selecting the time series data you want to resample. This data can be in a variety of formats, including numeric, text, or categorical data.

Determine how often you want to resample your data. This can be increasing the granularity (upsampling) or decreasing the granularity (downsampling).

Select a resampling method. Common methods include averaging, summing, or using interpolation techniques to fill gaps in the data.

When upsampling, you may encounter missing data points between the original timestamps. Interpolation methods, such as linear or cubic spline interpolation, can be used to estimate these values.

For downsampling, data points are typically aggregated within each target interval. Common aggregate functions include sum, mean, or median.

Evaluate the resampled data to ensure it meets the analysis objectives. Check data for consistency, completeness and accuracy.

resample() method in Pandas

resample can operate on both Pandas Series and DataFrame objects. It is used to perform operations such as aggregation, transformation, or downsampling and upsampling of time series data.

Below is

resample()

Basic usage of methods and some common parameters:

 import pandas as pd
 
 # 创建一个示例时间序列数据框
 data = {'date': pd.date_range(start='2023-01-01', end='2023-12-31', freq='D'),
         'value': range(365)}
 df = pd.DataFrame(data)
 
 # 将日期列设置为索引
 df.set_index('date', inplace=True)
 
 # 使用resample()方法进行重新采样
 # 将每日数据转换为每月数据并计算每月的总和
 monthly_data = df['value'].resample('M').sum()
 
 # 将每月数据转换为每季度数据并计算每季度的平均值
 quarterly_data = monthly_data.resample('Q').mean()
 
 # 将每季度数据转换为每年数据并计算每年的最大值
 annual_data = quarterly_data.resample('Y').max()
 
 print(monthly_data)
 print(quarterly_data)
 print(annual_data)
 

In the above example, we first created a sample time series data frame and used

resample()

Methods convert it to different time frequencies (monthly, quarterly, yearly) and apply different aggregation functions (sum, mean, maximum).

resample()

Method parameters:

  • The first parameter is a time frequency string that specifies the target frequency for resampling. Common options include 'D'(Daily), 'M'(Monthly), 'Q'(Quarterly), 'Y'(Yearly), etc.
  • You can specify the aggregate function through the second parameter how, such as 'sum', 'mean', 'max'etc., the default is 'mean'.
  • You can also use closedparameters to specify the closed endpoints of each interval. Optional values ​​include 'right', 'left', 'both', 'neither'and the default is 'right'.
  • Use labelparameters to specify which timestamp to use for resampled labels. Optional values ​​include 'right', 'left', 'both', 'neither'and the default is 'right'.
  • Parameters can be used loffsetto adjust the offset of the resampled timestamps.
  • Finally, you can specify the minimum number of non-NA values ​​using specific parameters of an aggregate function, such as 'sum'the function's arguments.min_count

1. Specify column name

By default, Pandas' resample() method uses indexes from a Dataframe or Series, which should be of type Time. However, if you wish to resample based on a specific column, you can use the on parameter. This allows you to select a specific column for resampling, even if it is not an index.

 df.reset_index(drop=False, inplace=True)
 df.resample('W', on='index')['C_0'].sum().head()

In this code, weekly resampling is performed on the 'index' column using the resample() method, calculating the weekly sum of the 'C_0' column.

2. Specify the start and end time intervals

The closed parameter allows control of opening and closing intervals during resampling. By default, some frequencies like 'M', 'A', 'Q', 'BM', 'BA', 'BQ' and 'W' are right-closed, which means the right boundary is included, while other frequencies is left-closed, including the left border. When converting data frequencies, the off interval can be manually set if desired.

 df = generate_sample_data_datetime()
 pd.concat([df.resample('W', closed='left')['C_0'].sum().to_frame(name='left_closed'),
            df.resample('W', closed='right')['C_0'].sum().to_frame(name='right_closed')],
           axis=1).head(5)

In this code, we demonstrate the difference between left closed intervals and right closed intervals when converting daily frequencies to weekly frequencies.

3. Output result control

The label parameter can control the label of the output results during resampling. By default, some frequencies use the right boundary within the group as the output label, while other frequencies use the left boundary. When converting data frequency, you can specify whether you want to use the left or right boundary as the output label.

 df = generate_sample_data_datetime()
 df.resample('W', label='left')['C_0'].sum().to_frame(name='left_boundary').head(5)
 df.resample('W', label='right')['C_0'].sum().to_frame(name='right_boundary').head(5)

In this code, the output label changes depending on whether "left" or "right" is specified in the label parameter. It is recommended to specify it explicitly in actual applications to reduce confusion.

4. Summary statistics

Resampling can perform aggregate statistics, similar to using groupby. Use aggregation methods such as sum, mean, min, max, etc. to summarize the data within the resampling interval. These aggregation methods are similar to those available for groupby operations.

 df.resample('D').sum()
 df.resample('W').mean()
 df.resample('M').min()
 df.resample('Q').max()
 df.resample('Y').count()
 df.resample('W').std()
 df.resample('M').var()
 df.resample('D').median()
 df.resample('M').quantile([0.25, 0.5, 0.75])
 custom_agg = lambda x: x.max() - x.min()
 df.resample('W').apply(custom_agg)

Upsampling and padding

In time series data analysis, upsampling and downsampling are techniques used to manipulate the frequency of data observations. These techniques are valuable for adjusting the granularity of time series data to match analytical needs.

Let’s generate some data first

 import pandas as pd
 import numpy as np
 
 
 def generate_sample_data_datetime():
     np.random.seed(123)
     number_of_rows = 365 * 2
     num_cols = 5
     start_date = '2023-09-15'  # You can change the start date if needed
     cols = ["C_0", "C_1", "C_2", "C_3", "C_4"]
     df = pd.DataFrame(np.random.randint(1, 100, size=(number_of_rows, num_cols)), columns=cols)
     df.index = pd.date_range(start=start_date, periods=number_of_rows)
     return df
 
 df = generate_sample_data_datetime()

Upsampling involves increasing the granularity of the data, which means converting the data from lower frequencies to higher frequencies.

Let's say you have the daily data generated above and want to convert it to a 12-hour frequency and calculate the sum of "C_0" in each interval:

 df.resample('12H')['C_0'].sum().head(10)

The code resamples the data into 12 hour intervals and applies sum aggregation on 'C_0' within each interval. This .head(10) is used to display the first 10 rows of the result.

During the upsampling process, especially when converting from lower to higher frequencies, you will encounter missing data points due to the gaps introduced by the new frequencies. Therefore, the gap data needs to be filled. The following methods are generally used for filling:

Forward fill - fills missing values ​​with the previous available value. The amount of forward padding can be limited using the limit parameter.

 df.resample('8H')['C_0'].ffill(limit=1)

Backfill - Fill missing values ​​with the next available value.

 df.resample('8H')['C_0'].bfill(limit=1)

Recent Fill - Fills missing data with the nearest available value, which can be forward or backward.

 df.resample('8H')['C_0'].nearest(limit=1)

Fillna — combines the functionality of the previous three methods. You can specify the method (for example, 'pad'/'fill', 'bfill', 'nearest') and use the limit parameter for quantity control.

 df.resample('8H')['C_0'].fillna(method='pad', limit=1)

Asfreq - Specify a fixed value to fill in all missing parts once. For example, you can use -999 to fill in missing values.

 df.resample('8H')['C_0'].asfreq(-999)

Interpolation methods - Various interpolation algorithms can be applied.

 df.resample('8H').interpolate(method='linear').applymap(lambda x: round(x, 2))

Some commonly used functions

1. Use agg for aggregation

 result = df.resample('W').agg(
     {
         'C_0': ['sum', 'mean'],
         'C_1': lambda x: np.std(x, ddof=1)
     }
 ).head()

Use the agg method to resample daily time series data to weekly frequency. And specify different aggregate functions for different columns. For "C_0" the sum and mean are calculated, while for "C_1" the standard deviation is calculated.

2. Use apply aggregation

 def custom_agg(x):
     agg_result = {
         'C_0_mean': round(x['C_0'].mean(), 2),
         'C_1_sum': x['C_1'].sum(),
         'C_2_max': x['C_2'].max(),
         'C_3_mean_plus1': round(x['C_3'].mean() + 1, 2)
     }
     return pd.Series(agg_result)
 
 result = df.resample('W').apply(custom_agg).head()

A custom aggregate function called custom_agg is defined which takes a DataFrame x as input and calculates various aggregations on different columns. Use the apply method to resample the data to a weekly frequency and apply a custom aggregation function.

3. Use transform to transform

 df['C_0_cumsum'] = df.resample('W')['C_0'].transform('cumsum')
 df['C_0_rank'] = df.resample('W')['C_0'].transform('rank')
 result = df.head(10)

Use the transform method to calculate the cumulative sum ranking of the 'C_0' variable in the weekly group. The original index structure of DF remains unchanged.

4. Use pipe for pipeline operations

 result = df.resample('W')['C_0', 'C_1'] \
     .pipe(lambda x: x.cumsum()) \
     .pipe(lambda x: x['C_1'] - x['C_0'])
 result = result.head(10)

Chain the downsampled 'C_0' and 'C_1' variables using the pipeline method. The cumsum function calculates the cumulative sum and the second pipeline operation calculates the difference between 'C_1' and 'C_0' for each group. Performs sequential operations like a pipeline.

Summarize

Resampling of time series is the conversion of time series data from one time frequency (such as daily) to another time frequency (such as monthly or yearly), and is usually accompanied by an aggregation operation on the data. Resampling is a key operation in time series data processing, through which trends and patterns in the data can be better understood.

In Python, you can use the Pandas library

resample()

Method to perform time series resampling.

https://avoid.overfit.cn/post/cf6fba5f6cbe49619738f2181b1bbd70

Author: JIN

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/133013860