Time series data preprocessing

Time series data can be seen everywhere, and to perform time series analysis, we must first preprocess the data. Time series preprocessing techniques have a significant impact on the accuracy of data modeling.

In this article, we will mainly discuss the following points:

  • Definition of time series data and its importance.

  • Preprocessing steps for time series data.

  • Build time series data, find missing values, denoise features, and find outliers present in the dataset.

First, let's understand the definition of a time series:

A time series is a series of uniformly distributed observations recorded over a specific time interval.

An example of a time series is the price of gold. In this case, our observations are the price of gold collected over a period of time after a fixed time interval. The time unit can be minutes, hours, days, years, etc. But the time difference between any two consecutive samples is the same.

In this article, we will see common time series preprocessing steps and common problems related to time series data that should be performed before diving into the data modeling part.

Time series data preprocessing

Time series data contains a lot of information, but it is often invisible. Common problems associated with time series are disordered timestamps, missing values ​​(or timestamps), outliers, and noise in the data. Of all the problems mentioned, dealing with missing values ​​is the most difficult one because traditional imputation (a technique for dealing with missing data by replacing missing values ​​to preserve most of the information) methods are not applicable when dealing with time series data . To analyze this preprocessed real-time analysis, we will use Kaggle's Air Passenger dataset.

Time series data usually exists in an unstructured format, i.e. timestamps may be mixed up and not ordered correctly. Also in most cases, datetime columns have a default string datatype, and you must convert the datatime column to a datetime datatype before applying any operations on it. Let's implement this into our dataset:

import pandas as pd

passenger = pd.read_csv('AirPassengers.csv')
passenger['Date'] = pd.to_datetime(passenger['Date']) 
passenger.sort_values(by=['Date'], inplace=True, ascending=True)

6a582d0b86f66d2cc6799b6d6165ce6b.png

Missing values ​​in time series

Handling missing values ​​in time series data is a challenging task. Traditional imputation techniques are not suitable for time series data because the order in which the values ​​are received is important. To solve this problem, we have the following interpolation methods:

Interpolation is a commonly used time series missing value imputation technique. It helps to estimate the missing data point using the surrounding two known data points. This method is simple and the most intuitive. The following methods can be used when working with time series data:

  • time-based interpolation

  • spline interpolation

  • Linear interpolation

Let's see what our data looks like before imputation:

from matplotlib.pyplot import figure
import matplotlib.pyplot as plt

figure(figsize=(12, 5), dpi=80, linewidth=10)
plt.plot(passenger['Date'], passenger['Passengers'])
plt.title('Air Passengers Raw Data with Missing Values')
plt.xlabel('Years', fontsize=14)
plt.ylabel('Number of Passengers', fontsize=14)
plt.show()

620770f266fcdf9e3c7695c208cdc09a.png

Let's see the results of the above three methods:

passenger[‘Linear’] = passenger[‘Passengers’].interpolate(method=’linear’)
passenger[‘Spline order 3’] = passenger[‘Passengers’].interpolate(method=’spline’, order=3)
passenger[‘Time’] = passenger[‘Passengers’].interpolate(method=’time’)

methods = ['Linear', 'Spline order 3', 'Time']

from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
for method in methods:
    figure(figsize=(12, 4), dpi=80, linewidth=10)
    plt.plot(passenger["Date"], passenger[method])
    plt.title('Air Passengers Imputation using: ' + types)
    plt.xlabel("Years", fontsize=14)
    plt.ylabel("Number of Passengers", fontsize=14)
    plt.show()

ec184ac5d4db41611d25af5b1b34a2bf.png

All methods give decent results. These methods make more sense when the missing value window (the width of the missing data) is small. But if several consecutive values ​​are missing, these methods have a harder time estimating them.

Time series denoising

Noise elements in time series can cause serious problems, so in general there is noise removal before any model is built. The process of minimizing noise is called denoising. Here are some methods commonly used to remove noise from time series:

rolling average

A rolling average is the average of a window of previous observations, where a window is a series of values ​​from time series data. Calculate the mean for each ordered window. This can greatly help minimize noise in time series data.

Let's apply a rolling average on the Google stock price:

rolling_google = google_stock_price['Open'].rolling(20).mean()
plt.plot(google_stock_price['Date'], google_stock_price['Open'])
plt.plot(google_stock_price['Date'], rolling_google)
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.legend(['Open','Rolling Mean'])
plt.show()

47553d0a17d01920b797f6c26d52aedd.png

Fourier transform

Fourier transform can help remove noise by transforming time series data into frequency domain, we can filter out noise frequencies. An inverse Fourier transform is then applied to obtain the filtered time series. We use the Fourier transform to calculate the Google stock price.

denoised_google_stock_price = fft_denoiser(value, 0.001, True)
plt.plot(time, google_stock['Open'][0:300])
plt.plot(time, denoised_google_stock_price)
plt.xlabel('Date', fontsize = 13)
plt.ylabel('Stock Price', fontsize = 13)
plt.legend([‘Open’,’Denoised: 0.001'])
plt.show()

a0d69296c5ff6b3d80a61615a0a6d95a.png

Outlier detection in time series

Outliers in a time series are sudden spikes or dips in a trendline. There can be a variety of factors that cause outliers. Let's take a look at the available methods for detecting outliers:

Methods based on rolling statistics

This method is the most intuitive and works for almost all types of time series. In this approach, upper and lower bounds are created from specific statistical measures, such as mean and standard deviation, Z and T scores, and percentiles of the distribution. For example, we can define upper and lower bounds as:

09d5284f743f2485e8f4097294bae8c8.png

It is not advisable to take the mean and standard deviation of the entire series, because in this case the bounds will be static. Boundaries should be created on a rolling window basis, just like considering a set of consecutive observations to create a boundary and then transfer to another window. This method is an efficient and simple outlier detection method.

Isolated forest

As the name suggests, Isolation Forest is a decision tree-based machine learning algorithm for anomaly detection. It works by isolating data points on a given set of features using partitions of decision trees. In other words, it takes a sample from the dataset and builds a tree on that sample until every point is isolated. To isolate data points, partition randomly by choosing a split between the maximum and minimum values ​​for that feature until every point is isolated. Random partitioning of features will create shorter paths in the tree for abnormal data points, distinguishing them from the rest of the data.

4e5907f7f3b825a88c62ca1f2fb82900.png

K-means clustering

K-means clustering is an unsupervised machine learning algorithm that is often used to detect outliers in time series data. The algorithm looks at the data points in the dataset and groups similar data points into K clusters. Anomalies are distinguished by measuring the distance of a data point to its nearest centroid. If the distance is greater than a certain threshold, the data point is marked as anomalous. The K-Means algorithm uses Euclidean distance for comparison.

222c9f25997433fd8450bcc873034288.png

possible interview questions

If a person writes a project on time series in their resume, the interviewer can ask these possible questions from this topic:

  • What are the methods for preprocessing time series data, and how are they different from standard imputation methods?

  • What does time series window mean?

  • Have you heard of the Lonely Forest? If yes, then can you explain how it works?

  • What is Fourier Transform and why do we need it?

  • What are the different ways to fill missing values ​​in time series data?

Summarize

In this article, we examine some common time series data preprocessing techniques. We start with sorted time series observations; we then examine various missing value imputation techniques. Because we are dealing with an ordered set of observations, time series imputation differs from traditional imputation techniques. In addition, some noise removal techniques are applied to the Google stock price dataset, and finally some time series outlier detection methods are discussed. Using all these mentioned preprocessing steps ensures high quality data, ready for building complex models.

: Shashank Gupta

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

For time series, everything you can do.

What is the spatiotemporal sequence problem? Which models are mainly used for such problems? What are the main applications?

Public number: AI snail car

Stay humble, stay disciplined, stay progressive

94ecfc9502149f6f032b50db04358d4d.png

Send [Snail] to get a copy of "Hands-on AI Project" (AI Snail Car)

Send [1222] to get a good leetcode brushing note

Send [AI Four Classics] to get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/123587578