4-3 Working with time series

Data download used in this article

Data from a Washington, D.C., bike-sharing system reporting the hourly count of rental bikes in 2011–2012 in the Capital Bikeshare system, along with weather and seasonal information.

insert image description here

Our goal will be to take a flat, 2D dataset and transform it into a 3D one, as shown in figure 4.5.
(figure 4.5 shows a transposed version of this to better fit on the printed page)
We want to change the row-per-hour organization so that we have one axis that increases at a rate of one day per index increment, and another axis that represents the hour of the day (independent of the date). The third axis will be our different columns of data (weather, temperature, and so on).

insert image description here
insert image description here

import numpy as np
bikes_numpy=np.loadtxt("D:/bike-sharing-dataset/hour-fixed.csv",dtype=np.float32,delimiter=",",skiprows=1,converters={
    
    1:lambda x:float(x[8:10])})

The data in the file is separated by commas, the first row (header row) is skipped, and the data in the second column (1st index, starting from 0) is specially converted, using a lambda function to convert the characters String indices 8 and 9 are converted to floats.

Recap: lambda functions

insert image description here
Each date is 10 digits, the format is 2011-01-01, the index starts from 0,
so the "day" digit is converted into a floating point number

insert image description here
insert image description here

insert image description here
insert image description here

Each row is a row of data
insert image description here

Then convert it to tensor

import torch
bikes=torch.from_numpy(bikes_numpy)
bikes

insert image description here

bikes.shape, bikes.stride()

insert image description here

关于表格中部分标题的说明
Season: season (1: spring, 2: summer, 3: fall, 4: winter)
Year: yr (0: 2011, 1: 2012)
Month: mnth (1 to 12)
Hour: hr (0 to 23)
Weather situation: weathersit (1: clear, 2:mist, 3: light rain/snow, 4: heavy rain/snow)
Temperature in °C: temp
Perceived temperature in °C: atemp
Humidity: hum
Wind speed: windspeed
Number of casual users: casual
Number of registered users: registered
Count of rental bikes: cnt

As can be seen from the table, we monitor 24 hours a day. A total of 17,520 hours, 17 columns

Now let’s reshape the data to have 3 axes—day, hour, and then our 17 columns:

daily_bikes=bikes.view(-1,24,bikes.shape[1])
daily_bikes.shape

insert image description here

We use -1 as a placeholder for “however many indexes are left, given the other dimensions and the original number of elements.”
-1 means to automatically calculate the new dimension size based on the size of the other dimensions to keep the total number of elements constant. Here, it would be calculated as 730, which means the original 17520 hours of data were merged into 730 groups of 24 hours of data each. (17520 hours / 24 hours per day = 730 days)

daily_bikes.stride()

insert image description here
For 408: interval by day, 17 rows, 24 hours a day, 17×24=408

insert image description here
As you learned in the previous chapter, calling view on a tensor returns a new tensor that changes the number of dimensions and the striding information, without changing the storage. This means we can rearrange our tensor at basically zero cost, because no data will be copied.

We see that the rightmost dimension is the number of columns in the original dataset. Then, in the middle dimension, we have time, split into chunks of 24 sequential hours. In other words, we now have N sequences of L hours in a day, for C channels.

We get N sets of C sequences of length L. In other words, our time series dataset will be a tensor of dimension 3 with shape N×L×C. C will hold our 17 channels and L will be the 24 hours of the day, so N represents the number of days 730

To get to our desired N × C × L ordering, we need to transpose the tensor:

daily_bikes=daily_bikes.transpose(1,2)
daily_bikes.shape,daily_bikes.stride()

insert image description here
Now let’s apply some of the techniques we learned earlier to this dataset.

In order to make it easier to render our data, we’re going to limit ourselves to the first day for a moment. We initialize a zero-filled matrix with a number of rows equal to the number of hours in the day and number of columns equal to the number of weather levels:

first_day=bikes[:24].long()
first_day.shape

insert image description here

If we decided to go with categorical, we would turn the variable into a one-hot-encoded vector and concatenate the columns with the dataset. We
focus on the weather situation column, which takes values ​​from 0 to 3, so the following matrix gives 4 columns

weather_onehot=torch.zeros(first_day.shape[0],4)
weather_onehot

insert image description here

first_day[:,9]

insert image description here
Then we scatter ones into our matrix according to the corresponding level at each row.

weather_onehot.scatter_(dim=1,index=first_day[:,9].unsqueeze(1).long()-1,value=1.0)
# Decreases the values by 1 because weather situation ranges from 1 to 4, while indices are 0-based

insert image description here
Last, we concatenate our matrix to our original dataset using the cat function.
Let’s look at the first of our results:

torch.cat((bikes[:24],weather_onehot),1)[:1]

bikes[:24] takes all the data of the first 24 hours.
The torch.cat function connects bikes[:24] (24×17) and weather_onehot (24×4) in the first dimension (column dimension)
Note: column dimension Splicing can be regarded as splicing for "increasing columns", which is left and right splicing
insert image description here

[:1] is the 0th row of the tensor after connection
insert image description here
We could have done the same with the reshaped daily_bikes tensor.

daily_bikes.shape

insert image description here

daily_weather_onehot=torch.zeros(daily_bikes.shape[0],4,daily_bikes.shape[2])
daily_weather_onehot[0]

insert image description here

daily_weather_onehot.scatter_(1,daily_bikes[:,9,:].long().unsqueeze(1)-1,1.0)
daily_weather_onehot[0]

insert image description here
daily_weather_onehot[0] indicates the 24-hour data of the first day, and each line uses 1.0 to indicate that weathersit is the hour number of the line. For example, the first line indicates the hour number when weathersit is 1, and the second line indicates the hour number when weathersit is 2

insert image description here

daily_bikes.shape,daily_weather_onehot.shape

insert image description here

daily_bikes=torch.cat((daily_bikes,daily_weather_onehot),dim=1)
daily_bikes[0]

insert image description here
(continued) Complete row splicing
insert image description here

We mentioned earlier that this is not the only way to treat our “weather situation” variable. Indeed, its labels have an ordinal relationship, so we could pretend they are special values of a continuous variable. We could just transform the variable so that it runs from 0.0 to 1.0:(数据标准化)

Method 1: Decrease the minimum value/range
The range refers to the difference between the maximum value and the minimum value of the data
For example: the value of the above weathersit is 1~3 (the possible value is 1~4)

daily_bikes[:,9,:]=(daily_bikes[:,9,:]-1.0)/3

In this way, the data can also be limited to 0 to 1

Apply to the same data, for example we use the temp line (line 10)

temp=daily_bikes[:,10,:]
temp_min=torch.min(temp)
temp_max=torch.max(temp)
daily_bikes[:,10,:]=(daily_bikes[:,10,:]-temp_min)/(temp_max-temp_min)

Method 2: Subtract the mean and divide by the standard deviation

temp=daily_bikes[:,10,:]
daily_bikes[:,10,:]=(daily_bikes[:,10,:]-torch.mean(temp))/torch.std(temp)

Rescaling variables to the [0.0, 1.0] interval or the [-1.0, 1.0] interval is something we’ll want to do for all quantitative variables, like temperature (column 10 in our dataset). We’ll see why later; for now, let’s just say that this is beneficial to the training process.

We’ve built another nice dataset, and we’ve seen how to deal with time series data.

Summary:
This article mainly introduces how to deal with time series data. In the previous section, we covered how to represent data organized in flat tables. Each row is independent, their order is not important, or equivalently, there is no column encoding information to indicate which rows come first and which come after.

Next, we introduce a new dataset, the Washington DC bike-sharing system dataset. This dataset reports the number of bicycles rented per hour in the capital's bike-sharing system between 2011 and 2012, along with weather and seasonal information. We transform this flat 2D dataset into a 3D time series dataset. A time series dataset is a three-dimensional tensor where one dimension represents the number of days, another dimension represents the hour of the day, and a third dimension represents the different columns of the data (e.g. weather, temperature, etc.).

To achieve this goal, we first convert the original dataset into a 3D tensor, where the data for each hour is stacked together. The data is then further reshaped into an N×L×C form using the view function, where N is the number of samples, C is the number of columns of data, and L is the number of hours per day. Finally, use the transpose function to rearrange the data into the form of N×C×L.

Next, we deal with weather information in a time series, which is ordered discrete values. Convert the weather information into a one-hot encoding and stitch it together with the original dataset so that the neural network can process it.

Finally we mentioned two ways of data normalization which will help us with training data.

Guess you like

Origin blog.csdn.net/weixin_45825865/article/details/131988297
4-3