Time series data can be seen everywhere, and to perform time series analysis, we must first preprocess the data. Time series preprocessing techniques have a significant impact on the accuracy of data modeling.
In this article, we will mainly discuss the following points:
Definition of time series data and its importance.
Preprocessing steps for time series data.
Build time series data, find missing values, denoise features, and find outliers present in the dataset.
First, let's understand the definition of a time series:
A time series is a series of uniformly distributed observations recorded over a specific time interval.
An example of a time series is the price of gold. In this case, our observations are the price of gold collected over a period of time after a fixed time interval. The time unit can be minutes, hours, days, years, etc. But the time difference between any two consecutive samples is the same.
In this article, we will see common time series preprocessing steps and common problems related to time series data that should be performed before diving into the data modeling part.
Time series data preprocessing
Time series data contains a lot of information, but it is often invisible. Common problems associated with time series are disordered timestamps, missing values (or timestamps), outliers, and noise in the data. Of all the problems mentioned, dealing with missing values is the most difficult one because traditional imputation (a technique for dealing with missing data by replacing missing values to preserve most of the information) methods are not applicable when dealing with time series data . To analyze this preprocessed real-time analysis, we will use Kaggle's Air Passenger dataset.
Time series data usually exists in an unstructured format, i.e. timestamps may be mixed up and not ordered correctly. Also in most cases, datetime columns have a default string datatype, and you must convert the datatime column to a datetime datatype before applying any operations on it. Let's implement this into our dataset:
import pandas as pd
passenger = pd.read_csv('AirPassengers.csv')
passenger['Date'] = pd.to_datetime(passenger['Date'])
passenger.sort_values(by=['Date'], inplace=True, ascending=True)
Missing values in time series
Handling missing values in time series data is a challenging task. Traditional imputation techniques are not suitable for time series data because the order in which the values are received is important. To solve this problem, we have the following interpolation methods:
Interpolation is a commonly used time series missing value imputation technique. It helps to estimate the missing data point using the surrounding two known data points. This method is simple and the most intuitive. The following methods can be used when working with time series data:
time-based interpolation
spline interpolation
Linear interpolation
Let's see what our data looks like before imputation:
from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
figure(figsize=(12, 5), dpi=80, linewidth=10)
plt.plot(passenger['Date'], passenger['Passengers'])
plt.title('Air Passengers Raw Data with Missing Values')
plt.xlabel('Years', fontsize=14)
plt.ylabel('Number of Passengers', fontsize=14)
plt.show()
Let's see the results of the above three methods:
passenger[‘Linear’] = passenger[‘Passengers’].interpolate(method=’linear’)
passenger[‘Spline order 3’] = passenger[‘Passengers’].interpolate(method=’spline’, order=3)
passenger[‘Time’] = passenger[‘Passengers’].interpolate(method=’time’)
methods = ['Linear', 'Spline order 3', 'Time']
from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
for method in methods:
figure(figsize=(12, 4), dpi=80, linewidth=10)
plt.plot(passenger["Date"], passenger[method])
plt.title('Air Passengers Imputation using: ' + types)
plt.xlabel("Years", fontsize=14)
plt.ylabel("Number of Passengers", fontsize=14)
plt.show()
All methods give decent results. These methods make more sense when the missing value window (the width of the missing data) is small. But if several consecutive values are missing, these methods have a harder time estimating them.
Time series denoising
Noise elements in time series can cause serious problems, so in general there is noise removal before any model is built. The process of minimizing noise is called denoising. Here are some methods commonly used to remove noise from time series:
rolling average
A rolling average is the average of a window of previous observations, where a window is a series of values from time series data. Calculate the mean for each ordered window. This can greatly help minimize noise in time series data.
Let's apply a rolling average on the Google stock price:
rolling_google = google_stock_price['Open'].rolling(20).mean()
plt.plot(google_stock_price['Date'], google_stock_price['Open'])
plt.plot(google_stock_price['Date'], rolling_google)
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.legend(['Open','Rolling Mean'])
plt.show()
Fourier transform
Fourier transform can help remove noise by transforming time series data into frequency domain, we can filter out noise frequencies. An inverse Fourier transform is then applied to obtain the filtered time series. We use the Fourier transform to calculate the Google stock price.
denoised_google_stock_price = fft_denoiser(value, 0.001, True)
plt.plot(time, google_stock['Open'][0:300])
plt.plot(time, denoised_google_stock_price)
plt.xlabel('Date', fontsize = 13)
plt.ylabel('Stock Price', fontsize = 13)
plt.legend([‘Open’,’Denoised: 0.001'])
plt.show()
Outlier detection in time series
Outliers in a time series are sudden spikes or dips in a trendline. There can be a variety of factors that cause outliers. Let's take a look at the available methods for detecting outliers:
Methods based on rolling statistics
This method is the most intuitive and works for almost all types of time series. In this approach, upper and lower bounds are created from specific statistical measures, such as mean and standard deviation, Z and T scores, and percentiles of the distribution. For example, we can define upper and lower bounds as:
It is not advisable to take the mean and standard deviation of the entire series, because in this case the bounds will be static. Boundaries should be created on a rolling window basis, just like considering a set of consecutive observations to create a boundary and then transfer to another window. This method is an efficient and simple outlier detection method.
Isolated forest
As the name suggests, Isolation Forest is a decision tree-based machine learning algorithm for anomaly detection. It works by isolating data points on a given set of features using partitions of decision trees. In other words, it takes a sample from the dataset and builds a tree on that sample until every point is isolated. To isolate data points, partition randomly by choosing a split between the maximum and minimum values for that feature until every point is isolated. Random partitioning of features will create shorter paths in the tree for abnormal data points, distinguishing them from the rest of the data.
K-means clustering
K-means clustering is an unsupervised machine learning algorithm that is often used to detect outliers in time series data. The algorithm looks at the data points in the dataset and groups similar data points into K clusters. Anomalies are distinguished by measuring the distance of a data point to its nearest centroid. If the distance is greater than a certain threshold, the data point is marked as anomalous. The K-Means algorithm uses Euclidean distance for comparison.
possible interview questions
If a person writes a project on time series in their resume, the interviewer can ask these possible questions from this topic:
What are the methods for preprocessing time series data, and how are they different from standard imputation methods?
What does time series window mean?
Have you heard of the Lonely Forest? If yes, then can you explain how it works?
What is Fourier Transform and why do we need it?
What are the different ways to fill missing values in time series data?
Summarize
In this article, we examine some common time series data preprocessing techniques. We start with sorted time series observations; we then examine various missing value imputation techniques. Because we are dealing with an ordered set of observations, time series imputation differs from traditional imputation techniques. In addition, some noise removal techniques are applied to the Google stock price dataset, and finally some time series outlier detection methods are discussed. Using all these mentioned preprocessing steps ensures high quality data, ready for building complex models.
: Shashank Gupta
Recommended reading:
My 2022 Internet School Recruitment Sharing
Talking about the difference between algorithm post and development post
Internet school recruitment research and development salary summary
For time series, everything you can do.
Public number: AI snail car
Stay humble, stay disciplined, stay progressive
Send [Snail] to get a copy of "Hands-on AI Project" (AI Snail Car)
Send [1222] to get a good leetcode brushing note
Send [AI Four Classics] to get four classic AI e-books