[Machine] k- nearest neighbor learning

KNN Overview


k- nearest neighbor (kNN, k-NearestNeighbor) algorithm is a basic classification and regression method, here only discuss the k- nearest neighbor classification problem.

As a supervised classification algorithms, it is one of the simplest machine learning algorithms, as the name implies, the algorithm is based on the Juche idea from a similar category neighbor, to determine their respective categories. Premise algorithm is the need for a training data set that has been labeled categories, popular understanding of the calculation process:

Given a training data set, the new input instance, in the training data set to find the nearest example of k instance, the majority of these k instances belong to a class, put into the input instance of this class.

Sentence summary: He that knows nothing doubts nothing!

K近邻算法的输入为实例的特征向量,对应于特征空间的点;输出为实例的类别,可以取多类。k 近邻算法假设给定一个训练数据集,其中的实例类别已定。分类时,对新的实例,根据其 k 个最近邻的训练实例的类别,通过多数表决等方式进行预测。因此,k近邻算法不具有显式的学习过程。

K近邻算法实际上利用训练数据集对特征向量空间进行划分,并作为其分类的“模型”。 k值的选择、距离度量以及分类决策规则是k近邻算法的三个基本要素。

 

For example scenario


Movies can be classified according to subject matter, then how to distinguish 动作片and 爱情片do?

  1. Action film: the number of more fighting
  2. Romance: More often kiss

Based on the number of movie kisses, fights occur, use k- nearest neighbor structure of the program, you can automatically divide the type of movie theme.

Now focus from all the movies and the movie based on the unknown sample we obtained above, in accordance with the distance in ascending order, you can find the k nearest movie. Assume k = 3, the three closest to the movies followed, He's Not Really into Dudes, Beautiful Woman and California Man. knn algorithm according to the type of the nearest three films, decided to unknown type of film, and this film is all three love stories, we determined that the unknown is a romance movie.

 

working principle


calculation steps:

  1. Suppose there is a sample data set with the tag (the training set), which comprises a corresponding relationship between each of the data category.
  2. After entering the new data without a label, the data corresponding to each feature and feature data based on the new sample is compared.
    1. Calculating a new distance data and the data set for each sample data.
    2. Distance determined for all sorts (from small to large, the more smaller the like).
    3. Take the first k (k is generally less than 20) samples data corresponding to the category labels.
  3. Seeking the highest number of category labels appear as new data classification k of data.

Development Process:

  • Data collection: Any method
  • Data Preparation: distance calculating the desired value, preferably a structured data format
  • Data analysis: Any method
  • Training algorithm: This step does not apply to k- nearest neighbor
  • Test algorithm: calculation error rate
  • Using the algorithm: the output of the input sample data and structured, and then run the k- nearest-neighbors algorithm determines which category classifies the input data belongs, and finally performs the subsequent processing of the classification of the calculated

 

KNN algorithm advantages and disadvantages


1. Advantages

  • Idea is simple, easy to understand, easy to implement;
  • High precision, not sensitive to outliers, no input data is assumed, without training
  • Especially suitable for multi-classification problems.

2. shortcomings

  • Lazy algorithm to calculate more than they are classified, to scan all the training samples to calculate the distance, large memory overhead, score slow;
  • When the sample imbalance as one of the categories of the sample is large, when the neighbor could lead to the calculation of new samples, bulk samples of the majority, affect the classification results;
  • Explanatory be poor, you can not give that kind of decision tree rule.

3. Note the problems

Set the value of K

K value is too small will reduce the classification accuracy; if set too large, part of the training set and the test sample contains less data class, it will increase noise and reduce the effect classification.

Typically, the K value is set by way of cross-validation (in K = 1 as a reference)

A rule of thumb: K is generally lower than the square root of the number of training samples.

Optimization

Compression training samples;

When determining the final category, not the simple use of the voting process, but weighted voting, the higher the closer the weight right.

 

KNN Projects


Optimize the pairing effect dating sites

Project Overview

Helen using dating sites looking for Dating. Over time, she found that people had gone out three types:

  • I do not like
  • The charm of the average person
  • Charismatic person

She hopes:

  1. The average person working with the charm dating
  2. Weekend with the charm of dating
  3. Do not like people directly excluded

Now she collected some of the data has not been recorded dating sites, which also helps to classify matching objects.

Development Process

  • Data collection: Provides Text File
  • Data preparation: Using Python parse text files
  • Data analysis: Plot the points using Matplotlib
  • Training algorithm: This step does not apply to k- nearest neighbor
  • Test method: Helen using data provided as part of the test sample. The difference between the test sample and non-sample test that: the test specimen is completed data classification, if the forecast and the actual classification of the different categories, they are marked with an error.
  • Using an algorithm: generate simple command-line program, and then Helen can enter some characteristic data to determine whether the other party for their favorite type.

1. Data collection

Helen will feature its dating an extracted labeled classified, stored in the text, mainly includes the following three features:

Frequent flyer miles every year available to play video games weekly consumption percentage of time spent on the number of liters of ice cream

40920   8.326976    0.953952    3
14488   7.153469    1.673904    2
26052   1.441871    0.805124    1
75136   13.147394   0.428964    1
38344   1.669788    0.134296    1
...

2. Prepare the data: Using Python parse text files

The text records into the resolver NumPy

def file2matrix(filename):
    """
    Desc:
        导入训练数据
    parameters:
        filename: 数据文件路径
    return: 
        数据矩阵 returnMat 和对应的类别 classLabelVector
    """
    with open(filename) as f:
        lines = f.readlines()

    row_num = len(lines)
    # 创建矩阵 zeros(2,3)就是生成一个 2*3的矩阵,各个位置上全是 0
    matrix = zeros((row_num, 3))
    index = 0
    class_label = []
    for line in lines:
        # 移除字符串头尾空字符  切割特征值
        features = line.strip().split("/t")
        # 初始化特征矩阵和对应的分类标记
        matrix[index, :] = features[0:3]
        class_label.append(features[-1])
        index += 1

    return matrix, class_label

3. Analyze Data

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 0], datingDataMat[:, 1], 15.0*array(datingLabels), 15.0*array(datingLabels))
plt.show()

FIG matrix employed in the first and second columns show good properties results obtained clearly identified the three different sample classification regions, which have different categories of people interested in different regions.

Normalized data

Normalization is to let the weight becomes a unified process, for more details please refer to: https://www.zhihu.com/question/19951858 

From the sample 3 and sample 4: $$ \ sqrt {(0-67) ^ 2 + (20000-32000) ^ 2 + (1.1-0.1) ^ 2} $$

Normalization feature value, eliminate the effects caused by different order of magnitude between features

Normalized definition: I think so, normalization is to take the data that you need to deal with after treatment limited to a certain range you need (via an algorithm). First of normalization is to facilitate later data processing, followed by Yasumasa accelerate the convergence program is running. Methods are as follows:

1) converting a linear function, expression is as follows:  

y=(x-MinValue)/(MaxValue-MinValue)  

说明:x、y分别为转换前、后的值,MaxValue、MinValue分别为样本的最大值和最小值。  

2) number of conversion function, expression is as follows: 

y=log10(x)  

说明:以10为底的对数函数转换。

如图:

![对数函数图像](http://data.apachecn.org/img/AiLearning/ml/2.KNN/knn_1.png)

3) conversion inverse cotangent function, expression is as follows:

y=arctan(x)*2/PI 

如图:

![反余切函数图像](http://data.apachecn.org/img/AiLearning/ml/2.KNN/arctan_arccot.gif)

In statistics, the normalized specific role of the statistical distribution of the induction unified sample. Normalization between 0 and 1 is the statistical probability distribution of a return -1 - + 1 is between coordinates statistical distribution.

def normalize(matrix):
    """
    Desc:
        归一化特征值,消除特征之间量级不同导致的影响
    parameter:
        dataSet: 数据集
    return:
        归一化后的数据集 normDataSet. ranges和minVals即最小值与范围,并没有用到

    归一化公式:
        Y = (X-Xmin)/(Xmax-Xmin)
        其中的 min 和 max 分别是数据集中的最小特征值和最大特征值。该函数可以自动将数字特征值转化为0到1的区间。
    """

    # 计算每种属性的最大值、最小值、极差
    min_val = matrix.min()
    max_val = matrix.max()
    ranges = max_val - min_val
    m = matrix.shape[0]
    # 生成与最小值之差组成的矩阵
    norm_matrix = matrix - tile(min_val, (m, 1))
    # 将最小值之差除以范围组成矩阵
    norm_matrix = norm_matrix / tile(ranges, (m, 1))
    
    return norm_matrix, ranges, min_val

4. training algorithm: This procedure does not apply to k- nearest neighbor

Because each time test data must be compared with the total amount of training data, so the process is not necessary

kNN algorithm pseudo code:

对于每一个在数据集中的数据点:
    计算目标的数据点(需要分类的数据点)与该数据点的距离
    将距离排序:从小到大
    选取前K个最短距离
    选取这K个中最多的分类类别
    返回该类别来作为目标数据点的预测值
def classify(input_data, dataset, labels, k):
    """
    knn分类算法
    :param input_data: 输入待分类的目标数据,即某人的特征一维向量(一行列表)
    :param dataset: 分类数据集
    :param labels: dataset对应的分类标签
    :param k: k个最下距离
    :return:
    """
    msize = dataset.shape[0]
    # 距离度量 度量公式为欧氏距离
    diff_mat = tile(input_data, (msize, 1)) - dataset
    sq_diff_mat = diff_mat ** 2
    sq_distances = sq_diff_mat.sum(axis=1)
    distances = sq_distances ** 0.5

    # 将距离排序:从小到大
    sorted_index = distances.argsort()
    # 选取前K个最短距离, 选取这K个中最多的分类类别
    class_count = {}
    for i in range(k):
        label = labels[sorted_index[i]]
        class_count[label] = class_count.get(label, 0) + 1
    sorted_class = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)

    return sorted_class[0][0]

5. Test Algorithm

Helen using data provided as part of the test sample. If the prediction and the actual classification of the different categories, it is marked with an error.

def dating_class_test():
    """
    Desc:
        对约会网站的测试方法
    parameters:
        none
    return:
        错误数
    """
    # 设置测试数据的的一个比例(训练数据集比例=1-hoRatio)
    ratio = 0.1  # 测试范围,一部分测试一部分作为样本
    # 从文件中加载数据
    feature_matrix, labels = txt2matrix("data/datingTestSet.txt")
    # 归一化数据
    norm_mat, ranges, min_val = normalize(feature_matrix)

    # m 表示数据的行数,即矩阵的第一维
    m = norm_mat.shape[0]
    # 测试的样本数量
    test_num = int(m * ratio)
    error_count = 0
    for i in range(test_num):
        classifier_result = classify(norm_mat[i, :], norm_mat[test_num:m, :, labels[test_num:m, :, 3]])
        print("the classifier came back with: %d, the real answer is: %d" % (classifier_result, labels[i]))
        if (classifier_result != labels[i]):
            error_count += 1
    print("the total error rate is: %f" % (error_count / test_num))
    print(error_count)

6. Using the algorithm

Produce simple command-line program, and then Helen can enter some characteristic data to determine whether the other party for their favorite type.

Dating site prediction function pseudo-code

def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(raw_input("percentage of time spent playing video games ?"))
    ffMiles = float(raw_input("frequent filer miles earned per year?"))
    iceCream = float(raw_input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels, 3)
    print "You will probably like this person: ", resultList[classifierResult - 1]

 

 

 

Published 33 original articles · won praise 17 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_42022528/article/details/104307368