Classification of Time Series for Getting Started with Pyts --- K Nearest Neighbor Algorithm and Parameter Tuning Tips (2)

Introduction

The book continues from the above (2021.11.05). After introducing the simple feature extraction of pyts time series, let’s introduce the classification algorithm for time series. Starting from the KNN (k-nearest neighbors) algorithm, an article introduces a classification algorithm. Because of some annoying things at work, I haven’t written an article recently. Maybe the friends who follow me have forgotten why they followed me. I’m really sorry, so let’s get to the point. First, we need to have a general understanding of the KNN algorithm (classification): 1. It is not the K-means algorithm (K-means is an unsupervised clustering algorithm, and novices often confuse them)
.
2. It is a supervised learning algorithm.
3. K means to select K nearest neighbors (neighbors) to vote, and which type has more votes will be classified into which one. If K=1, then whichever class is the closest sample to it is that class.
4. There are many definitions of distance. We often say that the distance between two points in space is Euclidean distance.
5. KNN is sensitive to the structure of the data set.
6. Before actual use, feature extraction (or dimensionality reduction) is usually performed on the data set, which can improve the stability and accuracy of the algorithm, such as the haar casade classification used by opencv in face recognition.

insert image description here
The above picture is from wiki. The green circle in the picture is the sample point to be classified. When K=3, select the three nearest points (circle with solid line), including a blue square and two red triangles, then it will be classified as a red triangle.

After the above introduction, I think you will also find that this algorithm has three parameters that are very important:

1. The number of K (neighbors). That is, the number of selected neighbors, the larger the selection, the more stable the algorithm, but it will lose a certain degree of distinction. For example, when the number of neighbors is equal to the number of samples, all points will be used as classification references, so no matter how I classify new samples, the result will be the same.

2. Distance selection (metric). How to measure the distance between points determines the distance of judging the sample points. There are many classic distances, such as
Euclidean distance (sqrt(sum((x - y)^2))), which is the square difference of subtraction.
Manhattan distance (sum(|x - y|)), the absolute value of the subtraction.
Minkowski distance (sum(w * |x - y| p) (1/p)), q=1 is the absolute distance, q=2 is the Euclidean distance.

3. Weight influence (weights). The idea of ​​introducing weights is to consider that near points have a greater impact on sample points than far points. A common practice is to introduce 1/d, where d is the distance between the point and the sample point.

practice

from pyts.classification import KNeighborsClassifier
from pyts.datasets import load_gunpoint
X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
#score:0.913333333333333333333333

The first is the result of the default parameters, the accuracy rate is 0.91333, and then we will try to adjust the parameters to improve the accuracy rate.
The default parameters, i.e. without any incoming parameters, are as follows:
n_neighbors=1, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1 .

parameter illustrate
n_neighbors=1 The number of neighbors is 1, which is an integer
weights=‘uniform’ By default, no weight is added, if it is 'distance', a weight of 1/d is added
algorithm='auto' There are {'auto', 'ball_tree', 'kd_tree', 'brute'} four methods of calculating the nearest neighbors, each with different calculation performance (large sample set)
leaf_size=30 The number of leaves of ball_tree or kd_tree will affect the construction speed of leaves and also affect the performance
p=2 The p value of the minkowski distance, equal to 2 is equivalent to the Euclidean distance, equal to 1 is equivalent to the Manhattan distance
metric=‘minkowski’ Distance rules, different distances can be passed in, such as "euclidean", "manhattan", etc., see the DistanceMetric class of sklearn for more
metric_params Specific parameters for some distances, the default is None
n_jobs=1 The number of parallel threads used to search for neighbors, set to -1 means to use all resources to search, the default is 1

Then we mainly change the three parameters of n_neighbors, weights, and p to see the changes in the results:

from sklearn.model_selection import GridSearchCV #网格搜索
parameters = {
    
    'weights':('uniform', 'distance'),'n_neighbors':[1,2,3,4,5],'p':[1,2]} #定义搜索参数
clf_grid = GridSearchCV(clf, parameters)#传入之前定义好的实例和参数
clf_grid.fit(X_train, y_train)
clf_grid.cv_results_ #打印在训练集上的结果

We did a grid search on the parameters to find the best parameters, cv_results_ will print a long list, let's take a part of it to illustrate:

'params': [{
    
    'n_neighbors': 1, 'p': 1, 'weights': 'uniform'},
  {
    
    'n_neighbors': 1, 'p': 1, 'weights': 'distance'},
  {
    
    'n_neighbors': 1, 'p': 2, 'weights': 'uniform'},
  {
    
    'n_neighbors': 1, 'p': 2, 'weights': 'distance'},
  {
    
    'n_neighbors': 2, 'p': 1, 'weights': 'uniform'},
  {
    
    'n_neighbors': 2, 'p': 1, 'weights': 'distance'},
  {
    
    'n_neighbors': 2, 'p': 2, 'weights': 'uniform'},
  {
    
    'n_neighbors': 2, 'p': 2, 'weights': 'distance'},
  {
    
    'n_neighbors': 3, 'p': 1, 'weights': 'uniform'},
  {
    
    'n_neighbors': 3, 'p': 1, 'weights': 'distance'},
  {
    
    'n_neighbors': 3, 'p': 2, 'weights': 'uniform'},
  {
    
    'n_neighbors': 3, 'p': 2, 'weights': 'distance'},
  {
    
    'n_neighbors': 4, 'p': 1, 'weights': 'uniform'},
  {
    
    'n_neighbors': 4, 'p': 1, 'weights': 'distance'},
  {
    
    'n_neighbors': 4, 'p': 2, 'weights': 'uniform'},
  {
    
    'n_neighbors': 4, 'p': 2, 'weights': 'distance'},
  {
    
    'n_neighbors': 5, 'p': 1, 'weights': 'uniform'},
  {
    
    'n_neighbors': 5, 'p': 1, 'weights': 'distance'},
  {
    
    'n_neighbors': 5, 'p': 2, 'weights': 'uniform'},
  {
    
    'n_neighbors': 5, 'p': 2, 'weights': 'distance'}]

This long list is the permutation and combination of all possible results of the parameters you define the search. In fact, it is to train one by one to see which one is the best.

'rank_test_score': array([ 4,  4,  1,  1,  4,  4, 17,  1,  9,  9, 13, 13, 13,  9, 20,  9, 13,
         4, 19, 17]

This string of numbers is the ranking of so many test results above. It can be seen that the third and fourth results are the best . At the same time, careful friends will also find that the results of introducing weights are generally better than not introducing them . Since the third parameter is the same as the default one, let's verify the results of the fourth one on the test set:

clf = KNeighborsClassifier(n_neighbors=1,weights='distance',p=2)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
#0.91333333

You might say: Isn't this not improved? What is this grid search searching for. Please note that a good cv result on the training set does not guarantee a good effect on the test set, but a poor cv result on the training set may be poor on the test set . So grid search is a reference. Real-world machine learning is more complex, and doing well on the test set doesn't mean you're doing well on new samples, but following some basic common sense can improve generalization performance.

clf = KNeighborsClassifier(n_neighbors=1,weights='distance',p=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
#0.9533333333333334

In fact, after switching the distance to Manhattan, the accuracy rate on the test set will increase a lot, reaching 0.953333. However, this is actually a small sample set, and its shape is (50,150).

References

1.https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
2.https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html?highlight=distance#sklearn.metrics.DistanceMetric
3.https://pyts.readthedocs.io/en/stable/modules/classification.html

Guess you like

Origin blog.csdn.net/weixin_43945848/article/details/123175689