Chapter III -KNN (classification and regression algorithm model)

The last chapter has learned Perceptron model, strategies and algorithms, Perceptron for classification task has its advantages, but the model is having a strong assumption that under the conditions - the training data set must be linearly separable , but if the data sets are presented randomly distributed, then the time to do if the classification task, you can also consider the k-nearest neighbor (KNN) , which is a basic classification and regression methods, either do a simple binary classification can also do more complex classification tasks, you can also do regression task.

KNN model

KNN model actually corresponds to the division of feature space , although no specific language to describe mathematical abstraction, but there are still its three elements: the distance metric, choose the value of K, the classification decision rule .

Distance Measurement

\ [Provided wherein the space \ Chi is an n-dimensional real vector space R ^ n, x_i, x_j \ in \ chi, x_i = (x_i ^ {(1)}, x_i ^ {(2)}, x_i ^ {(3) } ..., x_i ^ {(n)}) ^ T, \\ x_j = (x_j ^ {(1)}, x_j ^ {(2)}, x_j ^ {(3)} ..., x_j ^ {(n)}) ^ T, x_i, x_j distance may be defined as: \\ L_P (x_i, x_j) = (\ sum ^ n_ {l = 1} | x_i ^ {(l)} - x_j ^ {( l)} | ^ p) ^ {\ frac {1} {p}} \\ generally, when p = 1, L_1 (x_i, x_j) = (\ sum ^ n_ {l = 1} | x_i ^ { (l)} - x_j ^ {(l)} |), called the Manhattan distance; \\ time when p = 2, L_2 (x_i, x_j) = (\ sum ^ n_ {l = 1} | x_i ^ {( l)} - x_j ^ {(l)} | ^ 2) ^ {\ frac {1} {2}}, in fact, also form L2 norm, Euclidean distance referred to, more normally used; when p \\ = \ infty, which is the maximum value of each coordinate distance, namely: L _ {\ infty} (x_i, x_j) = max | x_i ^ {(l)} - x_j ^ {(l)} | \]

Select the K value

In addition to the distance measure, as well as choose the K value of the results of the KNN algorithm will have a significant impact.

If you choose a smaller value of k , it is equivalent to the prediction of training examples with a smaller field of "learning" the approximation error may be reduced , only closer to the input EXAMPLES Example will play a role in the prediction result, but the disadvantage is learning the estimation error will increase , is very sensitive to predict the result will be an example of neighborhood points;
If you choose a larger value , the learning error estimate may be reduced , but at the same time, the approximation error will increase , then there will be for example relatively far distance point would not achieve the predicted effect, so that the prediction errors result.

Classification decision rule

KNN decision rule is usually that "vote" - majority of the way.

If the loss function is 0-1 loss function, is a function of the classification:
\ [F: n-R & lt ^ \ rightarrow {\ {c_1 and, c_2, ...,} C_K \} \]
for k training example adjacent dots set N, misclassification rate is:
\ [\ FRAC. 1 {{}} K \ sum_ {x_i \ in n_k (X)} the I (y_i \ C_J NEQ) = l- \ FRAC. 1 {{}} K \ sum_ { x_i \ in n_k (x)}
it (y_i = c_j) \] to make the minimum classification error rate, then the probability is to ask the right maximum, so few rules to obey just to meet the majority of empirical risk minimization.

KNN algorithm

Algorithm Description

\ [Input: training data set: T = \ {(x_1, y_1), (x_2, y_2), (x_3, y_3) ..., (x_N, y_N) \}, wherein x_i \ in \ chi \ subseteq R ^ n are examples of feature vectors, \\ y_i \ in y = \ {c_1, c_2, ..., c_K \} is an example of category obtained, i = 1,2, ...,; examples of feature vector X; \ \ output: example x y category belongs. \\ (1) according to the given distance vector, in a training set T to identify with the k nearest neighbors of point x, k contains points x in the art referred to as N_k (x); \\ (2) in n_k (x) determined in accordance with the classification decision rule x category y: y = argmax \ sum_ {x_i \ in n_k (x)} I (y_i = c_j), i, j = 1,2, .., N. \]

When implementing KNN, mainly how fast the training data K nearest neighbor search problem when considering if a large number of dimensions of the feature space or training when large volumes of data, the data storage is a big problem. KNN most simple implementation is a linear scan, this time when the large data set, calculation is very time consuming . In order to improve the efficiency of such a search, using special fabric store training data - KD tree (Tree KD) . kd tree is a tree data structure of k-dimensional points in space to be stored for a quick search. Substantially, KD tree is a binary tree that represents a division of k-dimensional space .

Code

Self-programming

class KNN:
    """
    使用自编程实现KNN算法
    @author cecilia
    """
    def __init__(self,X_train,y_train,k=3):
        # 所需参数初始化
        self.k=k   # 所取k值
        self.X_train=X_train
        self.y_train=y_train

    def predict(self,X_new):
        # 计算欧氏距离
        dist_list=[(np.linalg.norm(X_new-self.X_train[i],ord=2),self.y_train[i])
                   for i in range(self.X_train.shape[0])]
        #[(d0,-1),(d1,1)...]
        # 对所有距离进行排序
        dist_list.sort(key=lambda x: x[0])
        # 取前k个最小距离对应的类别（也就是y值）
        y_list=[dist_list[i][-1] for i in range(self.k)]
        # [-1,1,1,-1...]
        # 对上述k个点的分类进行统计
        y_count=Counter(y_list).most_common()
        # [(-1, 3), (1, 2)]
        return y_count[0][0]

def main():
    # 初始化数据
    X_train=np.array([[5,4],
                      [9,6],
                      [4,7],
                      [2,3],
                      [8,1],
                      [7,2]])
    y_train=np.array([1,1,1,-1,-1,-1])
    # 测试数据
    X_new = np.array([[5, 3]])
    # 不同的k(取奇数）对分类结果的影响
    for k in range(1,6,2):
        #构建KNN实例
        clf=KNN(X_train,y_train,k=k)
        #对测试数据进行分类预测
        y_predict=clf.predict(X_new)
        print("k={},class label is：{}".format(k,y_predict))

Sklearn library

from sklearn.neighbors import KNeighborsClassifier

def sklearn_knn():
    """
    使用sklearn库实现KNN算法
    @author cecilia
    """
    X_train=np.array([[5,4],
                      [9,6],
                      [4,7],
                      [2,3],
                      [8,1],
                      [7,2]])
    y_train=np.array([1,1,1,-1,-1,-1])
    # 待预测数据
    X_new = np.array([[5, 3]])
    # 不同k值对结果的影响
    for k in range(1,6,2):
        # 构建实例
        clf = KNeighborsClassifier(n_neighbors=k,n_jobs=-1)
        # 选择合适算法
        clf.fit(X_train, y_train)
        # print(clf.kneighbors(X_new))
        # 预测
        y_predict=clf.predict(X_new)
        #print(clf.predict_proba(X_new))
        print("accuracy:{:.0%}".format(clf.score([[5,3]],[[1]])))
        print("k={},label lcass is：{}".format(k,y_predict))

The results show:

Think

KNN algorithm complexity of the model is mainly reflected in where? Under what circumstances would result in over-fitting?

Model k near the algorithm complexity is reflected in the value of k; k values less likely to cause over-fitting, k values less likely to cause great fit.