19.6.29 k nearest neighbor algorithm

## deep learning k-nearest neighbor

Machine learning is to convert unordered data into useful information, and its main task is to classification and regression, classification is to divide the instance data to the appropriate classification mouth, regression to predict numeric data.

k-nearest neighbor algorithm (of kNN) is a machine learning algorithm, simply, that he is using the method of distance measurement between the different feature values ​​are classified for nominal and numeric type. Its working principle is: There is a sample data set, also known as the training set, and each sample set of data exists for each tag. After entering the new data without a label, the feature data corresponding to each feature value and the new data sample is compared concentrated, and then concentrated extraction algorithm wherein data is most similar to the sample (nearest neighbor) class label. Generally, we select only the k most similar data before the sample data set, select the highest number of classification appears k most similar data as new data classification.

Target variable is the result of machine learning algorithms to predict, in the classification algorithm type of the target variable type is usually nominal, and in the regression algorithm usually is continuous.
k-nearest neighbor algorithm general process:
(1) collecting data
(2) preparing data
(3) Data analysis
(4) training algorithm
(5) Test method
(6) using an algorithm
k nearest neighbor algorithm

def calssify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]
    #计算欧式距离
    diffMat = tile(inX,(dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)#行向量分别相加,从而得到一个新的行向量
    distances = sqDistances**0.5
    #对距离进行排序
    sortedDistIndicies=distances.argsort()#argsort()根据元素的值从大到小对元素进行排序,返回下标
    classCount={}
    #选择距离最小的k个点
    for i in range(k):
        voteIlabel=labels[sortedDistIndicies[i]]
        #对选取的K个样本所属的类别个数进行统计
        calssCount[voteIlabel]=classCount.get(voteIlabel,0)+1
    #排序
    sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

classify0 () function has four input parameters: classification is used inX input vector, the input training set as dataSet A, label Labels vectors, k represents the last parameter is used to select the number of nearest neighbors, wherein the element tag vector the number of rows and the same number of matrix dataSet.
Euclidean distance formula: calculates the distance between two points vectors xA and xB.
The distance between the end point of all the calculations, the data can be sorted in ascending order according to the data, and then determine the major categories from the smallest element of the first k where, k is always a positive integer input; finally decomposed into membered dictionary classCount list, then the second row of the program using the method of introducing itemgetter operator module sorts the tuples in the order of the second element, and returns the highest frequency of occurrence of the element tags.
In order to predict where the data classification, enter the following command at the Python prompt:

kNN,classify0{[0,0],group,labels,3}

Through a large number of test data, we can get the error rate classifier - classifier given the number of errors divided by the total number of test results performed.

Guess you like

Origin blog.csdn.net/dan_youshang/article/details/94164397