Time Series Classification

time series classification

When I first encountered the concept of time series classification, my initial thought was: how do we classify time series? What does the data for time series classification look like?

As you can imagine, time series classification data is different from regular classification problems because the attributes have an ordered sequence. Let's look at some time series classification use cases to understand this difference.

1) ECG/EEG signal classification

Electrocardiogram (ECG, Electrocardiogram) records the electrical activity of the heart and is widely used to diagnose various heart problems. These ECG signals are captured with external electrodes. For example, consider the signal sample below, which represents the electrical activity of a heartbeat. The image on the left represents a normal heartbeat, while the image on the right represents a myocardial infarction.
insert image description here

  • The data collected from the electrodes is in the form of time series and the signals can be classified into different categories . We can also classify electroencephalogram (EEG) signals, which record the electrical activity of the brain.

2) Image data

Images can also be in a time series related format. Consider the following scenario:

The crops grown in a particular field depend on weather conditions, soil fertility, water availability and other external factors. Photographs of the land were taken daily for 5 years and labeled with the names of the crops grown on the land. The images in the dataset are taken after fixed time intervals and have a definite sequence, which is an important factor in classifying the images.

3) Motion sensor data classification

Sensors generate high-frequency data that can identify the movement of objects within their range. By setting multiple wireless sensors and observing the changes in the signal strength of the sensors, the moving direction of the object can be identified.

Let's take the problem of "indoor user motion prediction" as an example.

In this challenge, multiple motion sensors are placed in different rooms, and the goal is to identify whether a person has moved in the room based on the frequency data captured from these motion sensors. There are a total of four motion sensors (A1, A2, A3, A4) distributed across two rooms. Take a look at the image below, which illustrates where the sensors are located in each room. The two room setups are created in 3 different pairs of room groups (group1, group2, group3).
insert image description here


insert image description here


traditional method

The global feature classification algorithm uses the complete time series as a feature, and calculates the similarity between time series for classification, usually using a combination of distance measurement function and 1-NN. The research direction of this type of method is the distance measurement function used to measure the similarity of complete time series.

Typical global feature algorithm-dtw

If we allow a sequence of points to correspond to multiple consecutive points of another sequence (equivalent to prolonging the pronunciation time of the tone represented by this point), and then calculate the sum of the distances between corresponding points, this is the dtw algorithm. The dtw algorithm allows a point at a certain moment in a sequence to correspond to points at multiple consecutive moments in another sequence, which is called Time Warping

insert image description here

Typical global feature algorithm-difference distance method

  • The differential distance method calculates the first-order differential of the original time series , and then measures the distance between the differential sequences of the two time series, that is, the differential distance .
  • The difference method takes the differential distance as the supplement of the original sequence distance, which is an important part of the final distance calculation function.
  • The difference distance method combines the original time series in the time domain with the first-order difference series in the difference domain to improve the classification effect.

The main research direction is how to reasonably combine the original sequence and the difference sequence. The evolution process of the difference distance method is shown in the figure.

insert image description here

local features

The local feature class classification algorithm uses a part of subsequences in the time series as features for time series classification . The key to this type of algorithm is to find local features that can distinguish categories. Since the subsequence is shorter, the built classifier is faster, but it takes a certain amount of time to find local features.

Typical local feature algorithm - interval (interval)

The interval method in the local feature class divides the time series into several intervals and extracts features from each interval.

  • This type of method is suitable for subsequences with phase dependence and discrimination in long sequences, as well as noise.

The development process of interval-based time series classification algorithm is shown in the figure.

insert image description here

TSF

TSF (Time Series Forest) algorithm is an integrated learning algorithm for time series classification. The algorithm converts time series data into feature vectors and uses random forest method for classification.

TSF overcomes the problem of a huge feature space for intervals by using a random forest approach with statistics for each interval as features. Training a tree involves choosing root m random intervals, generating the mean, standard deviation and slope of each series of random intervals, and then creating and training a tree on the resulting 3 root m features.
The main steps of the TSF algorithm are as follows:

Feature extraction: convert raw time series data into feature vectors. Commonly used feature extraction methods include Fourier transform, wavelet transform, etc.

Data set division: Divide the extracted feature vectors into several subsets.

Randomly select a subset: Randomly select a part of the divided subset for training.

Randomly select features: Randomly select a part of features from the feature vector for training.

Build a decision tree: Build a decision tree model based on selected subsets and features.

Integrating decision trees: Repeat steps 3-5 to build multiple decision trees and integrate them into a random forest model.

Classify: Use the built random forest model to classify new time series data.

TSF算法的优点:

  1. It is able to handle large-scale time series data and has good classification performance.
  2. It can reduce computation by randomly selecting subsets and features, and improve classification accuracy by integrating multiple decision trees.

However, the TSF algorithm also has some limitations:
1) It is sensitive to the length of time series data, and shorter time series may lead to a decrease in classification performance.
2) In addition, the TSF algorithm has strong assumptions about the distribution of time series data. If the data does not meet these assumptions, the performance of the algorithm may be affected.

shapelet

A shapelet is actually a certain subsequence in a period of time series data. This subsequence is the most prominent feature of this period of time series data (obviously, a shapelet is the same as the trend and periodic components, and is also a special component of the time series data itself). It was proposed mainly to solve the early problem of using KNN to solve time series classification.

  • KNN
    • In time series classification, the idea of ​​knn is very simple. The data of m time steps of a sample is the m features of this sample, and then use knn to run. Of course, in the knn application of time series classification, use This type of conventional distance calculation obviously discards the sequence dependence information of the time series data itself , so DTW, a special distance formula for time series data distance calculation, is used to measure the distance of time series data

The idea of ​​shapelet is very simple and intuitive, which reduces the computational overhead and is very interpretable. Specifically, an example is given in the article: Now
insert image description here
we convert the outline of the leaf into time series data (the point coordinates corresponding to the outline of the leaf are all in An xy plane coordinate system), and then we need to classify the time series data corresponding to these two leaves.
As can be seen from the figure above, the shapelet is the red part of the blue curve, which is the most prominent feature of the time series data corresponding to the left leaf. We can directly use this data to replace the full sequence data of the left leaf, and then use KNN algorithm based on DTW distance.

The time series data corresponding to the two leaves are similar in many time steps, so the distance-based calculation will obviously be affected by the data corresponding to most similar time points, but now we extract the most obvious part , Then obviously the model can pay more attention to the significant difference between different time series samples

  • If the DTW distance of the most prominent feature of the two leaves is large, it is obviously different, which is easy to understand.

Zhihu article

Multi-label classification problem

Label unbalanced weight problem to solve pytorch

Multi-label classification imbalance problem: https://discuss.pytorch.org/t/multi-label-multi-class-class-imbalance/37573

Multiple categories:
insert image description here

Multi-label:
insert image description here
First, let me explain what is called a multi-label text classification problem.

Here we combine a competition example on Kaggle.

The name of the competition is: Toxic Comment Classification Challenge


Keras sample weight for imbalance multilabel datasets:

from sklearn.utils import class_weight

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values

sample_weights = class_weight.compute_sample_weight('balanced', y)
model.fit(X_t, y, batch_size=batch_size, epochs=epochs,validation_split=0.1,sample_weight=sample_weights, callbacks=callbacks_list)


Model deployment related

Zhihu article: https://zhuanlan.zhihu.com/p/195750736

Guess you like

Origin blog.csdn.net/RandyHan/article/details/131763991