Vision Machine Learning 2----KNN Algorithm

       Reference blog: Click to open the link

       KNN is mature in theory and is one of the simplest machine learning methods. It can be used to implement classification and regression, and it is one of the supervised learning methods.

       The basic idea of ​​KNN algorithm: input new data without labels, that is, unclassified, first extract the features of the new data and compare it with each data feature in the test set; then extract the k nearest neighbor (most similar) data features from the sample. Classification label, count the classification with the most occurrences in the k nearest neighbor data, and use it as the classification of the new data.
        Assuming that each sample has m eigenvalues, a sample can be represented by an m-dimensional vector: X = (x1, x2, ... , xm ), similarly, the eigenvalues ​​of the test points can also be expressed as: Y =( y1, y2, ... , ym ). Selecting an appropriate distance function can improve its accuracy. Usually KNN can use Euclidean, Manhattan, Mahalanobis and other distances for calculation. Only the first Euclidean distance is mentioned here.

       The Euclidean distance is : . To implement the KNN algorithm, we only need to calculate the distance between each sample point and the test point, select the k samples with the closest distance, obtain their labels, and then find the label with the largest number of k samples, and return the label.

#Normalize the data, that is, convert the data to between 0 and 1
def normData(dataSet):
    maxVals = dataSet.max(axis=0)
    minVals = dataSet.min(axis=0)
    ranges = maxVals - minVals
    retData = (dataSet - minVals) / ranges
    return retData, ranges, minVals

#implement the KNN algorithm
def kNN(dataSet, labels, testData, k):
    distSquareMat = (dataSet - testData) ** 2 # Calculate the square of the difference
    distSquareSums = distSquareMat.sum(axis=1) # Find the sum of squares of differences in each row
    distances = distSquareSums ** 0.5 # Open the root number to get the distance from each sample to the test point
    sortedIndices = distances.argsort() # Sort, get sorted subscripts
    indices = sortedIndices[:k] # Take the smallest k
    labelCount = {} # Store the number of occurrences of each label
    for i in indices:
        label = labels[i]
        labelCount[label] = labelCount.get(label, 0) + 1 # times plus one
    sortedCount = sorted(labelCount.items(), key=opt.itemgetter(1), reverse=True)
    # Sort the number of occurrences of the label from largest to smallest
    return sortedCount[0][0] # Return the label with the largest number of occurrences

#Substitute data test
import numpy as np
import operator as opt
if __name__ == "__main__":
    dataSet = np.array([[2, 3],[1, 1],[9, 9],[6, 8]])
    normDataSet, ranges, minVals = normData(dataSet)
    labels = ['a','a', 'b','b']
    testData = np.array([3.9, 5.5])
    normTestData = (testData - minVals) / ranges
    result = kNN(normDataSet, labels, normTestData, 1)
    print(result)
output is a

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325750438&siteId=291194637