KNN algorithm of machine learning summary

foreword

​The KNN (K-Nearest Neighbor) algorithm is the simplest and a very practical machine learning algorithm. It is the first algorithm introduced in the book " Machine Learning in Practice ". It belongs to the example-based supervised learning algorithm, which does not need to be trained, and will not get a model that summarizes the characteristics of the data . It can be applied only by selecting the appropriate parameter K. The goal of KNN is to find the best K neighbors in the training data, and predict the label of new data based on the labels of these neighbors. Every time KNN is used for prediction, all training data will participate in the calculation.

kNN has many application scenarios:

  • Classification problems, and can naturally deal with multi-classification problems, such as classifying music into different types according to the characteristics of music.
  • Recommendation system, based on the user's historical behavior, recommend similar items or services
  • Image recognition, such as face recognition, license plate recognition, etc.

1. Concept

1.1 Basic concepts of machine learning

Machine learning is a very important branch in the field of artificial intelligence, which can help us discover patterns and make predictions from large amounts of data.

Machine learning can be divided into three types: supervised learning, unsupervised learning and semi-supervised learning.

  • Supervised learning means that the correct answers have been marked in the training data, the model is trained through these data, and then the new data is predicted.
  • Unsupervised learning means that the correct answers are not marked in the training data, and the laws in the data are discovered through operations such as clustering and dimensionality reduction of the data.
  • Semi-supervised learning is a method between supervised learning and unsupervised learning.

The following table is an explanation of some basic concepts of machine learning

concept explain Remark
Classification Divide the dataset into different categories Belongs to supervised learning
clustering The process of dividing a dataset into classes of similar objects Belongs to unsupervised learning
return Refers to predicting continuous numeric data Belongs to supervised learning
sample set Generally refers to the data set used to train the model, generally divided into training set and test set. In the sample set, each sample contains one or more features and a label.
feature Used to describe the properties or characteristics of a sample Usually the columns of the training sample set, they are the results of independent measurements, and multiple features are linked together to form a training sample
Label The category or result the sample belongs to
Model A regularity or pattern learned from training data. In machine learning, a model can be used to predict labels or values ​​for new data
gradient Refers to the rate of change of a function at a point. In machine learning, gradients can be used to optimize model parameters by minimizing a loss function

1.2 k value

The k value means that among multiple neighbors, the category of the top k most similar neighbors is selected to determine the category of the current sample. Usually k is an integer not greater than 20, and 3 or 5 is often selected.

1.3 Distance Metrics

The distance measure refers to the method used to calculate the distance between samples in the kNN algorithm. Commonly used distance metrics include Euclidean distance, Manhattan distance, Chebyshev distance, Minkowski distance, etc.

  • Euclidean distance

    • two-dimensional plane

      d = ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}d=(x1x2)2+(y1y2)2

    • n dimension

      d = ∑ i = 1 n ∣ x i − y i ∣ 2 d=\sqrt{\sum_{i=1}^{n}{\left| x_{i}-y_{i} \right|^{2}}} d=i=1nxiyi2

  • Manhattan distance

    d = ∑ i = 1 n ∣ x i − y i ∣ d= \sum_{i=1}^{n}|x_i - y_i| d=i=1nxiyi

  • Chebyshev distance

    d = m a x ( ∣ x 1 − x 2 ∣ , ∣ y 1 − y 2 ∣ , ⋯   , ∣ x i − y i ∣ ) d= max(|x_1 - x_2|, |y_1 - y_2|, \cdots, |x_i - y_i|) d=max(x1x2,y1y2,,xiyi)

  • Minkowski distance

    d = ∑ i = 1 n ( ∣ x i − y i ∣ ) p p d = \sqrt[p]{\sum_{i=1}^{n}(|x_i - y_i|)^p} d=pi=1n(xiyi)p

1.4 Weighting method

The weighting method in the KNN algorithm refers to using different weights for samples with different distances when calculating the distance. These weights can be the distance from the source of the sample data, or the distance between different samples. The weighting method can be selected according to the actual situation to achieve better classification or prediction effect.

Commonly used numerical data weighting methods are as follows:

  1. Weighted average: The weighted average of the attribute values ​​​​of K neighbors is used as the predicted value of the new data point.
  2. Mean method: take the average of the attribute values ​​of K neighbors as the predicted value of the new data point.
  3. Worst value: Take the minimum and maximum value of the attribute values ​​of K neighbors, and then take the average value as the predicted value of the new data point.

Common discrete data weighting methods are as follows:

  1. Inverse function
  2. Gaussian function
  3. polynomial function

Different weighting methods can be selected according to the actual situation to achieve better classification or prediction effect.

Two, realize

The handwritten digit data set is the data set provided in Chapter 2 of "Machine Learning in Action": https://github.com/pbharrin/machinelearninginaction

2.1 Handwritten implementation

import numpy as np
from collections import Counter
import operator
import math
from os import listdir

# inX 输入向量
# dataSet 训练集
# labels 训练集所代表的标签
# k 最近邻居数目
# output: label
def classify0(inX, dataSet, labels, k):
    sortedDistIndicies=euclideanDistance(inX, dataSet)
    classCount = {
    
    }
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) +  1.0 * weight(sortedDistIndicies[i])
    sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)

    return sortedClassCount[0][0]

def euclideanDistance(inX, dataSet):
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat ** 2
    sqDistances = sqDiffMat.sum(axis = 1)
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()
    return sortedDistIndicies

def weight(dist):
    return 1


def classify1(test, train, trainLabel, k):
    distances = []
    for i in range(len(train)):
        distance = np.sqrt(np.sum(np.square(test - train[i, :])))
        distances.append([distance, i])
    distances = sorted(distances)
    targets = [trainLabel[distances[i][1]] for i in range(k)]
    return Counter(targets).most_common(1)[0][0]

def img2vector(filename):
    returnVect = np.zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

def handWritingDataSet(inputDir):
    hwLabels = []
    fileNames = []
    dataFileList = listdir(inputDir)           
    m = len(dataFileList)
    dataMat = np.zeros((m,1024))
    for i in range(m):
        fileNameStr = dataFileList[i]
        fileStr = fileNameStr.split('.')[0]     
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        fileNames.append(fileStr)
        dataMat[i,:] = img2vector( inputDir + '/%s' % fileNameStr)
    return dataMat,hwLabels,fileNames

trainMat, trainLabels, _ = handWritingDataSet('digits/trainingDigits/')
testMat, testLabels,testFileNames = handWritingDataSet('digits/testDigits/')

errorCount = 0
k = 3
for idx, testData in enumerate(testMat):
    prefictLabel = classify0(testData, trainMat, trainLabels, k)
    # prefictLabel = classify1(testData, trainMat, trainLabels, k)
    if testLabels[idx] != prefictLabel:
        errorCount+=1
        print("错误数据:%s.txt, 预测数字:%d" % (testFileNames[idx], prefictLabel))
print("k值:%d, 错误数量:%d, 错误率:%.3f%%" %(k, errorCount, errorCount / 1.0 / np.size(testMat, 0) * 100))

knn_classfi

2.2 Adjust library Scikit-learn

Document address: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

trainMat, trainLabels, _ = handWritingDataSet('digits/trainingDigits/')
testMat, testLabels,testFileNames = handWritingDataSet('digits/testDigits/')

errorCount = 0
k = 3

neigh = KNeighborsClassifier(n_neighbors=k)
neigh.fit(trainMat, trainLabels)

for idx, testData in enumerate(testMat):
    prefictLabel = neigh.predict([testData])
    if testLabels[idx] != prefictLabel:
        errorCount+=1
        print("错误数据:%s.txt, 预测数字:%d" % (testFileNames[idx], prefictLabel))
print("k值:%d, 错误数量:%d, 错误率:%.3f%%" %(k, errorCount, errorCount / 1.0 / np.size(testMat, 0) * 100))

knn_scikit-learn

2.3 Test your own data

In the above handwritten data set, there are 1934 training sets and 946 test sets, all of which are converted texts from 32x32 pictures. If you want to test your own handwritten numbers, you need to convert the handwritten number picture into a 32x32 pixel format picture first, and then convert it into text. The following is a picture-to-text code

import cv2
import os

def img2txt(inputDir):
    dataFileList = os.listdir(inputDir)

    for file in dataFileList:
        if not file.endswith('png'):
            continue
        img = cv2.imread(inputDir + file, cv2.IMREAD_GRAYSCALE)
        fr = open(inputDir + file.split('.')[0] + '.txt', 'w')
        height, width = img.shape[0:2]

        for row in range(height):
            line = ''
            for col in range(width):
                if img[row, col] > 250:
                    line+='0'
                else:
                    line+='1'
            fr.write(line)
            fr.write('\n')

        fr.close()
if __name__ == '__main__':
    img2txt('img/')

Next, prepare ten numbers from 0-9 handwritten by yourself for testing. The numbers below are cut to 32x32 pixels with the windows drawing tool, and then handwritten with the mouse to realize.

self-digits

Convert 10 numbers into text for testing, the result error rate is 30%

self_knn

3. Summary

3.1 Analysis

  • When recognizing handwritten digits in the test set, there are always some samples that cannot be recognized correctly. It is found through observation that it is because they are relatively close to other category features.
  • Using your own handwritten digit recognition, the wrong sample is not because it is similar, for example, 4 is recognized as 7, this is not very clear, it may be related to the characteristics of the sample

3.2 Advantages and disadvantages of KNN

  • advantage
    1. The idea is simple, the theory is mature, it can be used for both classification and regression
    2. Insensitive to outliers due to selection of the nearest k
  • shortcoming
    1. KNN needs to calculate the distance between each test sample and all training samples, which has high time complexity and high computational cost
    2. Unable to give any underlying structure information for the data
    3. The algorithm is relatively simple. When the training data is small, it is difficult to distinguish some very similar data of different types.

reference

  1. https://github.com/pbharrin/machinelearninginaction

Guess you like

Origin blog.csdn.net/qq_23091073/article/details/130468441