Table of contents
K-Nearest Neighbors (KNN) is a commonly used non-parametric supervised learning algorithm for classification and regression tasks. This article will deeply analyze the principle of KNN, from distance measurement to K value selection, to help readers fully understand the working principle and application of KNN.
1. Overview of KNN algorithm
The KNN algorithm is based on a simple idea: similar samples have similar categories. It calculates the distance between the new sample and each sample in the training set, and selects the nearest K samples for classification or regression.
2. Distance measure
In the KNN algorithm, the distance measure is an important indicator for judging the similarity between samples. Commonly used distance measurement methods include Euclidean distance, Manhattan distance, and Minkowski distance. According to the characteristics of the specific problem and the properties of the data, it is very important to choose an appropriate distance measure method.
3. K value selection
The K value is an important parameter in the KNN algorithm, which determines the number of neighbors used for classification or regression. Choosing an appropriate value of K is crucial to the performance of the model. A small value of K will cause the model to be sensitive to noise, while a large value of K will cause the model to be too conservative.
4. Classification tasks
In the KNN algorithm, the classification task is the most common application scenario. When a new sample is given, the KNN algorithm calculates its distance from the training set sample and selects the nearest K neighbor samples. Then, votes are made according to the categories of the neighbor samples, and the new sample is classified into the category with the most votes.
5. Regression tasks
In addition to classification tasks, the KNN algorithm can also be applied to regression tasks. In the regression task, the KNN algorithm calculates the distance between the new sample and the training set sample, and selects the nearest K neighbor samples. Then, the weighted average is performed according to the values of the neighbor samples to obtain the predicted value of the new sample.
6. Advantages and disadvantages of KNN
Advantages of the KNN algorithm:
- Simple and easy to understand, no training phase required.
- Ability to handle multi-category and multi-feature problems.
- It performs well when the sample distribution is relatively uniform.
Disadvantages of KNN algorithm:
- For large-scale datasets, calculating the distance between samples is time-consuming.
- For high-dimensional data, distance computations are susceptible to the curse of dimensionality.
- For imbalanced datasets, the classification results may be biased towards the category with more samples.
7. Application of KNN algorithm
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 创建KNN模型
model = KNeighborsClassifier(n_neighbors=3)
# 训练模型
model.fit(X_train, y_train)
# 预测结果
y_pred = model.predict(X_test)
# 评估模型
accuracy = accuracy_score(y_test, y_pred)
print("准确率:", accuracy)
In the code, a classic iris data set (Iris) is first loaded, and the data set is divided into a training set and a test set. Then create a KNN classification model and use the training set for training. Finally, the test set is used to make predictions and the accuracy is calculated to evaluate the performance of the model.