K近邻算法
\[y = \underset{c_j}{argmax} \sum_{x_i\in N_k(x)} I(y_i=c_j),i=1,2,\cdots,N;j=1,2,\cdots,K\]
k近邻模型
模型
所有训练实例将特征空间进行了一个划分,特征空间的每一点受距离最近的训练实例支配
距离度量
\(L_p\)距离
\[L_p(x_i,x_j) = (\sum_{l=1}^{n}|x_i^{l}-x_j^l|^p)^{\frac{1}{p}}\]
\[L_2(x_i,x_j) = (\sum_{l=1}^{n}|x_i^{l}-x_j^l|^2)^{\frac{1}{2}}\]
\[L_1(x_i,x_j) = \sum_{l=1}^{n}|x_i^{l}-x_j^l|)\]
\[L_{max}(x_i,x_j) = \underset{l}{max}|x_i^{l}-x_j^l|)\]
k值的选择:
太小容易过拟合,近似误差会减小
太大以为着模型变简单,近似误差增大
分类决策规则
多数表决规则(majority voting rule)
对于样本\(x\),k个最近邻构成\(N_k(x)\),如果类别是\(c_j\),误分类率是
\[\frac{1}{k}\sum_{x_i\in N_k(x)} I(y+i\neq c_j) = 1-\frac{1}{k} \sum_{x_i \in N_k(x)}I(y_i=c_j)\]
误分类率最小就是经验风险最小,所以要让\(I(y_i=c_j)\)最大,就是多数表决规则等价于经验风险最小化
KD树
构造KD数
# Makes the KD-Tree for fast lookup
def make_kd_tree(points, dim, i=0):
if len(points) > 1:
points.sort(key=lambda x: x[i])
i = (i + 1) % dim # 第i维是轮循的
half = len(points) >> 1 # 选取第i维的中值作为根节点
return [
make_kd_tree(points[: half], dim, i),
make_kd_tree(points[half + 1:], dim, i),
points[half] #数组有三个元素,第一个是左子树,第二个是右子树,第三个是根节点
]
elif len(points) == 1:
return [None, None, points[0]]
# Adds a point to the kd-tree
def add_point(kd_node, point, dim, i=0):
if kd_node is not None:
dx = kd_node[2][i] - point[i]
i = (i + 1) % dim
for j, c in ((0, dx >= 0), (1, dx < 0)):
if c and kd_node[j] is None:
kd_node[j] = [None, None, point]
elif c:
add_point(kd_node[j], point, dim, i)