最近邻

概述

基于最近邻的监督学习方法分两类：分类，针对的是具有离散标签的数据;回归，针对的是具有连续标签的数据基于最近邻的无监督学习方法用于聚类分析。

最近邻方法原理是从训练样本中找到与查询点在距离上最近的预定数量或范围的多个点，然后依据这些点来预测查询点的标签。从训练样本中找出点的数量可以是用户定义的常量，这叫ķ最近邻学习即KNN，也可以通过用户定义的查询点的距离半径范围得出，这叫基于半径的最近邻学习即RNN。

数据之间的距离可以理解为数据之间的相似度。距离可以通过多种方式来度量，如欧几里得距离，曼哈顿距离等。标准欧几里得是最常见的选择。

最近邻学习方法称为非泛化机器学习方法，因为只是简单的“记住”了其所有的训练数据，死记硬背下所有历史数据，在新数据面前就与所有的历史数据比较从而找出最相似的历史数据。而泛化的机器学习方法在给定的样本数据进行训练之后会形成概念模型，在新数据面前则依据概念模型直接推导计算得出结论。

无监督最近邻

无监督最近邻的任务就是从训练样本中找到与查询点在距离上最近的预定数量或范围的多个点。需要找出点的个数可以是用户定义的常量，这叫ķ最近邻即KNN ，也可以通过用户定义的新点的距离半径范围得出，这叫基于半径的最近邻即RNN。

KNN无监督最近邻示例

import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors

# random the data as the training data
x = 5 * np.random.random((50, 2))
y = np.array([[1, 3], [4, 2]])

# knn
n_neighbors = 5

# create color maps
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# fit the training data
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors = n_neighbors, algorithm = 'auto');
nbrs.fit(x)

# get the nearest neighbors
distances, indices = nbrs.kneighbors(y)
print "distance:",distances
print "indices:",indices

# get the selection of nearest neighbors
selected = nbrs.kneighbors_graph(y).toarray()
print "selected:",selected

# plot the point
plt.plot(y[:,0], y[:,1], 'g+')

# plot the area
t = np.linspace(0, np.pi * 2, 50)
x_t = np.cos(t)
y_t = np.sin(t)
for i in range(y.shape[0]) :
    plt.plot(x_t * distances[i, -1] + y[i, 0], y_t * distances[i, -1] + y[i, 1])

#  all selected
selected = selected[0, :].astype(np.bool) | selected[1, :].astype(np.bool)
selected = selected.astype(np.int32)

# plot the selection
plt.scatter(x[:, 0], x[:, 1], c = selected, cmap = cmap_bold, edgecolor = 'k', s = 20)

plt.show()

RNN无监督最近邻示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors

# random the data as the training data
x = 5 * np.random.random((50, 2))
y = np.array([[1, 3], [4, 2]])

# rnn
n_radius = 1

# create color maps
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# fit the training data
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(radius  = n_radius, algorithm = 'auto');
nbrs.fit(x)

# get the nearest neighbors
distances, indices = nbrs.radius_neighbors(y)
print "distance:",distances
print "indices:",indices

# get the selection of nearest neighbors
selected = nbrs.radius_neighbors_graph(y).toarray()
print "selected:",selected

# plot the point
plt.plot(y[:,0], y[:,1], 'g+')

# plot the area
t = np.linspace(0, np.pi * 2, 50)
x_t = np.cos(t)
y_t = np.sin(t)
for i in range(y.shape[0]) :
    plt.plot(x_t * n_radius + y[i, 0], y_t * n_radius + y[i, 1])

#  all selected
selected = selected[0, :].astype(np.bool) | selected[1, :].astype(np.bool)
selected = selected.astype(np.int32)

# plot the selection
plt.scatter(x[:, 0], x[:, 1], c = selected, cmap = cmap_bold, edgecolor = 'k', s = 20)

plt.show()

最近邻分类

最近邻分类属于非泛化学习或基于实例的学习，他不会从训练数据上学习去构造一个泛化的概念模型。基于最邻近方法从样本集合中找出的点通过投票得出最具代表性的标签作为查询点的标签，一般情况下，从训练样本集合中找出的多个点使用统一的权重来投票查询点的标签，在某些情况下，需要进行加权，分配的权重与查询点的距离成反比，即与查询点的距离越大，权重越小。

一个查询点的ķ个最近邻分类方法为KNN算法，一个查询点的固定半径ř内的最近邻分类方法为RNN方法。

KNN最近邻分类示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors

# import some data to play with
from sklearn import datasets
iris = datasets.load_iris()

# only take the first two features.
# we could avoid this ugly slicing by using a two-dim dataset
x = iris.data[:, :2]
y = iris.target

# k
n_neighbors = 15

# create color maps
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# knn
n_neighbors = 15

for weights in ['uniform', 'distance']:
    # we create an instance of Neighbors classifier and fit the data
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights = weights)
    clf.fit(x, y)

    # plot the decision boundary.
    # For that, we will assign a color to each point in the mesh[x_min, x_max] * [y_min, y_max]
    x_min, x_max = x[:, 0].min() - 1, x[:, 0].max() + 1
    y_min, y_max = x[:, 1].min() - 1, x[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))

    # predict
    z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # put the result into a color plot
    z = z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, z, cmap = cmap_light)

    # plot also the training points
    plt.scatter(x[:, 0], x[:, 1], c = y, cmap = cmap_bold, edgecolor = 'k', s = 20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification (k = %i, weithts = '%s')" % (n_neighbors, weights))

plt.show()

RNN最近邻分类示例

基于半径最近邻分类方法中需要半径参数，而给出合适的半径参数非常困难，需要进行数据的归一化预处理，并且RNN对数据的分布也有要求。后续介绍数据归一化后补充RNN最近邻分类示例。

机器学习-最近邻（KNN，RNN）

最近邻

概述

无监督最近邻

KNN无监督最近邻示例

RNN无监督最近邻示例：

最近邻分类

KNN最近邻分类示例：

RNN最近邻分类示例

最近邻回归

KNN最近邻回归示例：

最近邻算法

猜你喜欢