1. K-Nearest Neighbor Classifier (K-NN, K-Nearest Neighbor Classifier) (supervised learning)

1. Algorithm idea

K nearest neighbor algorithm, k-nearest neighbor, that is, K-NN
In layman’s terms: given an element, then start drawing a circle with the coordinates of the element as the center, where the K value It is a hyperparameter that needs to be given artificially. The radius of the circle gradually increases (the distance measure uses Euclidean distance) until it contains other K elements. Then look at which categories the K elements included belong to. According to the decision rules (using a few Obey the majority principle) and see which categories the K elements belong to, then x will be classified into which category.

Application scenario: It is known that there are two categories of green pentagon and blue hexagon. A new orange element x is added. Which category can x be classified into?
Insert image description here

Three elements:距离度量、K值、决策规则

①Distance measure

In the space, elements that are closer are more likely to be in the same category because their similarity is higher.
There are many distance measurement methods, commonly used ones include Euclidean distance, Manhattan distance, cosine distance, etc.

Ⅰ. Euclidean distance

The distance formula between two points:Insert image description here
Of course, it can be a distance in a higher dimension. The calculation method is the same for the same reason, and high dimensions can also be applied.

Ⅱ.Manhattan distance

The sum of the absolute axis distances of two points on the coordinate system:Insert image description here

Ⅲ. Cosine distance

The cosine of the angle between two vectors in vector space:Insert image description here

距离度量目的是看未知元素与哪个已存在的类别最近,那么新来的元素就可以归为这类

②K value

The K value can be understood as the end condition, which is a hyperparameter that needs to be set manually.
Different K values ​​have different effects in the final algorithm, and may even be very different. Large
Cross-validation method is usually used to determine the optimal K value
For cross-validation, please refer to the blog post: 10. Evaluation indicators , 2. Model evaluation method

③Decision rules

Common decision-making rules include: the minority obeys the majority principle, the weighted average principle, etc.

Ⅰ. The minority obeys the majority

This is easy to understand. For example, if there are K elements, which category appears more, then the new element will be of this category.

Ⅱ. Weighted average

In this case, all categories have corresponding weights in advance. The category weights corresponding to the K elements are added up and averaged. See which category weight value the value is closest to, and then it is classified into this category.

2. Official website API

Official website API

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

There are quite a lot of parameters here. For specific parameter usage, you can learn based on the demo provided on the official website and try it more; here are some commonly used parameters for explanation.
Guide package:from sklearn.neighbors import KNeighborsClassifier

①n_neighbors

This parameter is one of the three elementsK值, which is the end condition. It will stop after finding K elements, and then look at the one with the most categories among the K values, and then add the new one. elements are specified as this category; the default is 5

The specific official website details are as follows:
Insert image description here

Usage

KNeighborsClassifier(n_neighbors=2)

②weights

Weight function setting selection
'uniform': All elements have the same weight and are treated equally. By default,
'distance' is used: elements that are closer have greater weight
Of course, you can also customize the weight function

The specific official website details are as follows:
Insert image description here

Usage

KNeighborsClassifier(weights={'distance'})
KNeighborsClassifier(weights={'uniform'})

③algorithm

What algorithm is used to calculate adjacent elements?

Specific official code information below:
Insert image description here
'ball_tree': For useBallTree arithmeticresolution
'kd_tree': For the sake KDTree algorithmelimination solution
'brute ': Violence BF calculation solution Arithmetic

Usage

KNeighborsClassifier(algorithm="ball_tree")
KNeighborsClassifier(algorithm="kd_tree")
KNeighborsClassifier(algorithm="brute")
KNeighborsClassifier(algorithm="auto")

For the specific usage process and functions of other parameters, you can learn from the official website.

④Finally build the model

KNeighborsClassifier(n_neighbors=4,algorithm=“auto”)

3. Code implementation

①Guide package

Here you need to evaluate, train, save and load the model. The following are some necessary packages. If an error is reported during the import process, just install it with pip.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

②Load the data set

The data set can be simply created by itself in csv format. What I use here is 6 independent variables X and 1 dependent variable Y.
Insert image description here

fiber = pd.read_csv("./fiber.csv")
fiber.head(5) #展示下头5条数据信息

Insert image description here

③Divide the data set

The first six columns are the independent variable X, and the last column is the dependent variable Y

Official API of commonly used split data set functions:train_test_split
Insert image description here
test_size: Proportion of test set data
train_size: Proportion of training set data
random_state: Random seed
shuffle: Whether to disrupt the data
Because my data set here has a total of 48, training set 0.75, test set 0.25, that is, 36 training sets and 12 test sets

X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']

X_train, X_test, y_train, y_test = train_test_split(X,Y,train_size=0.75,test_size=0.25,random_state=42,shuffle=True)

print(X_train.shape) #(36,6)
print(y_train.shape) #(36,)
print(X_test.shape) #(12,6)
print(y_test.shape) #(12,)

④Build KNN model

You can try setting and adjusting the parameters yourself.

knn = KNeighborsClassifier(n_neighbors=4,algorithm="auto")

⑤Model training

It’s that simple, a fit function can implement model training

knn.fit(X_train,y_train)

⑥Model evaluation

Throw the test set in and get the predicted test results

y_pred = knn.predict(X_test)

See if the predicted results are consistent with the actual test set results. If consistent, it is 1, otherwise it is 0. The average is the accuracy.

accuracy = np.mean(y_pred==y_test)
print(accuracy) # 0.8333333333333333

can also be evaluated by score. The calculation results and ideas are the same. They all look at the probability of the model guessing correctly in all data sets. However, the score function has been encapsulated. Of course, the incoming The parameters are also different, you need to import accuracy_score, from sklearn.metrics import accuracy_score

score = knn.score(X_test,y_test)
print(score)

⑦Model testing

Get a piece of data and use the trained model to evaluate
Here are six independent variables. I randomly throw them alltest = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
into the model. Get the prediction result, prediction = knn.predict(test)
See what the prediction result is and whether it is the same as the correct result, print(prediction)

test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
prediction = knn.predict(test)
print(prediction) #[2]

⑧Save the model

knn is the model name, which needs to be consistent
The following parameter is the path to save the model

joblib.dump(knn, './knn.model')#保存模型

⑨Load and use the model

knn_yy = joblib.load('./knn.model')

test = np.array([[11,99498,5369,9045.27,28.47,3827588.56]])#随便找的一条数据
prediction = knn_yy.predict(test)#带入数据,预测一下
print(prediction) #[4]

Complete code

Model training and evaluation does not include ⑧⑨.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

fiber = pd.read_csv("./fiber.csv")

X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']

X_train, X_test, y_train, y_test = train_test_split(X,Y,train_size=0.75,test_size=0.25,random_state=42,shuffle=True)

knn = KNeighborsClassifier(n_neighbors=4,algorithm="auto")

knn.fit(X_train,y_train)

y_pred = knn.predict(X_test)

accuracy = np.mean(y_pred==y_test)
print(accuracy) # 0.8333333333333333
score = knn.score(X_test,y_test)
print(score)

Guess you like

Origin blog.csdn.net/qq_41264055/article/details/132916434
Recommended