Time series classification 05: Python yield implements sliding window to intercept time series data (adjustable sliding step size)

When dealing with time series prediction or time series classification tasks, we often encounter the concept of "sliding window", including more or less in the more than 20 articles written about time series prediction or classification. This concept is also mentioned repeatedly. Similar to the principle of convolutional neural network processing images, the "sliding window" in sequence data is used to intercept sequence fragments, thereby reshaping the original data into samples of a specified length, so as to model the model.

There are many ways to implement sliding windows, and you can use a for loop, but it seems to be stretched when dealing with large amounts of data. At this time, Python yield comes in handy. About the principle and use of Python yield, you need a certain foundation. I have written an article specifically before. If you are interested, you can refer to: Analyze Python yield in a simple way .

In daily business needs, the sliding window function is generally defined separately from the division of the training set test set. If it is a relatively simple situation, it can be defined in a function.


method 1

The following example is a sliding window implementation method in the example of household electricity consumption prediction.

First look at the situation of the data set:
Insert picture description here
from the above figure, the shape of the training set used (train in the following code) is (1442,8), the sampling points are 1442, and the number of features (features) is 8 (8 columns ). The above sampling data is divided into samples by a combination of sliding window and slicing. The width of the sliding window and the sliding step (sw_width) can be set as parameters to facilitate the later adjustment of the parameters and find the business needs that suit your needs. Window width and sliding step. Code:

def sliding_window(train, sw_width=7, n_out=7, in_start=0):
    '''
    该函数实现窗口宽度为7、滑动步长为1的滑动窗口截取序列数据
    '''
    data = train.reshape((train.shape[0] * train.shape[1], train.shape[2])) # 将以周为单位的样本展平为以天为单位的序列
    X, y = [], []
    
    for _ in range(len(data)):
        in_end = in_start + sw_width
        out_end = in_end + n_out
        
        # 保证截取样本完整,最大元素索引不超过原序列索引,则截取数据;否则丢弃该样本
        if out_end < len(data):
            # 训练数据以滑动步长1截取
            train_seq = data[in_start:in_end, 0]
            train_seq = train_seq.reshape((len(train_seq), 1))
            X.append(train_seq)
            y.append(data[in_end:out_end, 0])
        in_start += 1
        
    return np.array(X), np.array(y)

With the above instructions, a little basic should not be difficult to understand. If you are interested, you can refer to this article: Time Series Prediction 15: Multi-input / Multi-head CNN to realize electricity consumption / generation forecast


Method 2

This method is a single column of data interception, which can be combined with for loop and np.vstack method to stack multiple columns of data. This part is a point in the project. When the project is completed, it will be introduced in a future article. This method can specify the width and sliding step of the sliding window. First look at the situation of the data set to facilitate understanding of the functions implemented.
Insert picture description here
It can be seen from the above figure that the shape of the data set (train in the following code) is (875,10), the sampling point is 875, and the number of features (features) is 10 (10 columns). Code:

def _slide_window(rows, sw_width, sw_steps):
    '''
    函数功能:
    按指定窗口宽度和滑动步长实现单列数据截取
    --------------------------------------------------
    参数说明:
    rows:单个文件中的行数;
    sw_width:滑动窗口的窗口宽度;
    sw_steps:滑动窗口的滑动步长;
    '''
    start = 0
    s_num = (rows - sw_width) // sw_steps # 计算滑动次数
    new_rows = sw_width + (sw_steps * s_num) # 完整窗口包含的行数,丢弃少于窗口宽度的采样数据;
    
    while True:
        if (start + sw_width) > new_rows: # 如果窗口结束索引超出最大索引,结束截取;
            return
        yield start, start + sw_width
        start += sw_steps

Use the data set above to test:

_test_list = []
for start,end in _slide_window(test_concat_file.shape[0], 100, 40):
	'''
	此处可以添加for循环或者其他方式,以实现处理多列数据
	'''
    _test_list.append(test_concat_file['ax'][start:end])
    '''
    此处可添加判断条件,以实现数组堆叠
	'''

Check the length of the generated sample list:

len(_test_list)

Output:

20

View the sampling data information contained in the last sample:

_test_list[19]

Output:

760      0.1245
761    0.124742
762    0.124991
763    0.125238
764    0.125482
         ...   
855    0.185577
856    0.185849
857    0.186137
858    0.186466
859    0.186857
Name: ax, Length: 100, dtype: object

Of course, there are many ways to achieve it. This article only cites two methods in use, hoping to help those in need.

Published 167 original articles · praised 686 · 50,000+ views

Guess you like

Origin blog.csdn.net/weixin_39653948/article/details/105498685