K-Nearest Neighbor Algorithm for Regression and Classification

Article directory

- Kneighbors Algorithm for Classification and Regression

Kneighbors Algorithm for Classification and Regression

The k-nearest neighbor algorithm belongs to the supervised learning algorithm and is a basic classification and regression algorithm.

The principle of the algorithm: For an unclassified data, vote for the k classified instances that are adjacent to it and the nearest distance to determine the category it belongs to, that is, the category that the majority of the k instances closest to it belong to is The category of this classification instance.

It is simply understood that those who are close to vermilion are red and those who are close to ink are black.

Generally, different choices of k value will lead to different results, and choosing an appropriate k value is very important to the algorithm.

The value of k is too small, which means that the model is too complex, which can easily lead to overfitting: it is easily affected by noise.

If the value of k is too large, it means that the model is too simple, which can easily lead to underfitting: at this time, the sample points that are far away from the problem to be classified will also play a role in the classification, making the prediction error.

measure of distance

The k-nearest neighbor algorithm needs to calculate the distance between the unclassified data and all the sample data known to be classified every time, and then sort and take out the k data with the closest distance to vote for the final classification result. Therefore, the calculation of the distance is also a very important part. There are generally three ways to calculate:

Euclidean distance:

$L(x_i, x_j)=\left ( \sum_{l=1}^{n}(x_i^{(l)}-x_j^{(l)})^2 \right )^{\frac{1}{2}} )$

Manhattan distance:

$L(x_i, x_j)=\sum_{l=1}^{n}|x_i^{(l)}-x_j^{(l)}|$

The maximum value of the distance of each coordinate:

$KaTeX parse error: Undefined control sequence: \sideset at position 13: L(x_i,x_j)=\̲s̲i̲d̲e̲s̲e̲t̲{}{}{max}_l|x_i…$

Pros and Cons of Algorithms

Advantages of the algorithm:

The appropriate k value can be found through continuous experimentation;
The accuracy of the algorithm is high;
High tolerance to outliers and noise.

Disadvantages of the algorithm:

Every time an unlabeled sample is classified, it is necessary to calculate the distance and rank of all the labeled samples with its neighbors
The choice of k value is difficult to determine;
Computing requires a lot of memory.

Algorithm case

classification task

Use the Jupyter notebook environment here:

Guide package:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

Generate data:

X, y = make_blobs(n_samples=400, centers=4, cluster_std=1.3, random_state=6)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='rainbow')

insert image description here

Introduce the packaged classification function:

def KneighborsClassifier(X, y, ax=None, n_neighbors=5, weights='uniform', algorithm='auto', cmap='rainbow'):
    """
    采用K近邻算法进行分类
    :param X:待处理的二维矩阵,其中(X[:, 0], X[:, 1])表示二维平面内的一个点
    :param y:监督学习标签
    :param ax:传入提前好的axis对象或默认为None
    :param n_neighbors: 最重要的超参数k值
    :param weights: 'uniform '表示每个点的权重都一样
    :param algorithm: 采用分类的算法 默认为自动
    :param cmap: 采用的颜色映射
    :return:
    """

    # 生成非None的ax对象
    ax = ax or plt.gca()

    # 画出训练数据以及ax对象的设置
    ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap='rainbow')
    ax.axis('tight')
    ax.set_xlabel('x', fontsize=16)
    ax.set_ylabel('y', fontsize=16)
    ax.set_title('KNeighborsClassifier', fontsize=19)

    # 获取坐标轴的范围
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # 建立模型并配置超参数
    # p=2表示采用欧式距离
    model = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights,
                                 algorithm=algorithm, leaf_size=30, p=2)

    # 用评估器拟合数据
    model.fit(X, y)

    # 画出预测的二维网络
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # 为结果生成彩色图
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3, cmap=cmap, zorder=1,
                           levels=np.arange(n_classes + 1) - 0.5)
    # 设置图像边界
    ax.set(xlim=xlim, ylim=ylim)

    return model

View classification results:

KneighborsClassifier(X, y)

insert image description here

return task

Guide package:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor

Generate data:

rng = np.random.RandomState(42)

x = rng.rand(100) * 10
y = np.sin(x) + 0.2 * rng.randn(100)
plt.scatter(x, y)

insert image description here

Import the encapsulated function:

def plusNewAxis(x):
    """
    给数组x增加一个维度的函数
    :param x: 待处理的数组
    :return: np.ndarray
    """
    return x[:, np.newaxis]

def KneighborsRegression(x, y, k=5, ax=None):
    """
    使用k近邻算法做回归分析
    :param x: 回归数据的x坐标
    :param y: 回归数据的y坐标
    :param k: 近邻值的超参数
    :param ax: 传入的ax对象
    :return: model
    """

    # 调整绘制的ax对象
    ax = plt.gca() or ax

    # 转化为机器学习可以利用的维度
    if x.ndim == 1:
        X = plusNewAxis(x)
    else:
        X = x

    # 建立模型并完成拟合
    model = KNeighborsRegressor(k)
    model.fit(X, y)
    # 评估模型得分
    print('模型得分:', model.score(X, y))

    # 生成测试数据
    Min, Max = X.min() - 0.5, X.max() + 0.5
    xfit = np.linspace(Min, Max, 1000)
    Xfit = xfit[:, np.newaxis]
    yfit = model.predict(Xfit)

    # 绘制原始数据
    ax.scatter(x, y, color='b')
    # 绘制回归曲线
    ax.plot(xfit, yfit, color='r')

    # 设置标题与轴信息
    ax.set_title('KNeighbors Regression', fontsize=19)
    ax.set_xlabel('x', fontsize=16)
    ax.set_ylabel('y', fontsize=16)

    # 添加图例
    ax.legend(['row data', 'Regression'], loc='best')

    # 返回模型
    return model

Complete the regression fit and calculate the score:

KneighborsRegression(x, y)

16)

# 添加图例
ax.legend(['row data', 'Regression'], loc='best')

# 返回模型
return model


完成回归拟合并计算得分：

~~~python
KneighborsRegression(x, y)

insert image description here