R language study notes - K nearest neighbor algorithm

K-Nearest Neighbor Algorithm (KNN) means that if most of the K nearest neighbor samples in the feature space belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category. That is, each sample can be represented by its closest k neighbors. The KNN algorithm is suitable for classification as well as regression. The KNN algorithm is widely used in recommender systems, semantic search, and anomaly detection.

KNN algorithm classification schematic diagram:

Do the green dots in the picture belong to the red triangle or the blue square? If K=5 (the 5 closest neighbors to the green dot, inside the dashed circle), there are 3 blue squares that are the "nearest neighbors" of the green dot with a ratio of 3/5, so the green dot should be classified To a class of blue squares; if K=3 (the 3 nearest neighbors to the green dot, inside the solid circle), then there are two red triangles that are the "nearest neighbors" of the green dot, with a ratio of 2/3, then Green dots should be classified as red triangles.

It can be seen from the above that this method only determines the category of the sample to be classified according to the category of the nearest one or several samples in the classification decision.

KNN algorithm implementation steps:

1. Data preprocessing

2. Construct training set and test set data

3. Set parameters, such as K value (K generally selects the square root of the sample data amount, 3~10)

4. Maintain a priority queue of size K from large to small by distance (Euclidean distance) for storing nearest neighbor training tuples. Randomly select K tuples from the training tuples as the initial nearest neighbor tuples, calculate the distances from the test tuples to the K tuples, and store the training tuple labels and distances in the priority queue

5. Traverse the training tuple set, calculate the distance L between the current training tuple and the test tuple, and compare the obtained distance L with the maximum distance Lmax in the priority queue

6. Make a comparison. If L>=Lmax, discard the tuple and traverse the next tuple . If L < Lmax, delete the tuple with the largest distance in the priority queue, and store the current training tuple in the priority queue.

7. After the traversal is completed, calculate the majority class of the K tuples in the priority queue and use it as the class of the test tuple

8. After the test tuple set is tested, calculate the error rate, continue to set different K values ​​for retraining, and finally take the K value with the smallest error rate.

R language implementation process:

The function packages for K-nearest neighbor algorithm analysis in R language include the knn function in the class package, the train function in the caret package, and the kknn function in the kknn package

knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
parameter meaning:
train: matrix or data frame containing training set
test: matrix or data frame containing test set
cl : Factor variable for classifying the training set
k: Number of neighbors
l: The minimum number of votes for finite decision-making
prob: Whether to calculate the probability of the prediction group
use.all: The processing method of the control node, that is, if there are multiple Kth nearest neighbors The distance between the point and the sample point to be judged is equal. By default, these points are used as the judgment sample point;
             when this parameter is set to FALSE, a point is randomly selected as the Kth closest judgment point.

(Sample data description: The sample data in the text describes a woman who divides her date into three types of preferences based on the number of air miles earned by her date; the percentage of time spent playing video games; the number of liters of ice cream consumed per week)

Code:

#Import analysis data
mydata <- read.table("C:/Users/Cindy/Desktop/婚恋/datingTestSet.txt")
str(mydata)
colnames(mydata) <- c('flight mileage', 'proportion of video game time', 'number of ice cream eaten', 'like category')
head(mydata)
#Data preprocessing, normalization
norfun <- function(x){
  z <- (x-min(x))/(max(x)-min(x))
  return(z)
}
data <- as.data.frame(apply(mydata[,1:3],2,norfun))
data$likesort<-mydata[,4]
#Create test set and training set samples
library(caret)
set.seed(123)
ind <- createDataPartition(y=data$likes classification, times = 1, p=0.5, list = F)
testdata <- data[-ind,]
traindata <- data[ind,]
#KNN algorithm
library(class)
kresult <- knn(train = traindata[,1:3],test=testdata[,1:3],cl=traindata$likes classification,k=3)
#Generate actual and predicted cross table and predicted accuracy
table(testdata$favorites, kresult)
sum(diag(table(testdata$favorites,kresult)))/sum(table(testdata$favorites,kresult))

operation result:

According to the results, the correct rate of this classification is 95%.

Advantages and disadvantages of KNN algorithm:

advantage:

1. Easy to understand and implement

2. Suitable for classifying rare events

3. Especially suitable for multi-classification problems (multi-modal, objects have multiple class labels), kNN performs better than SVM

shortcoming:

The amount of calculation is large, and it is necessary to calculate the "distance" between the new data point and each data in the sample set to determine whether it is the first K neighbors)

Improve:

In terms of classification efficiency, the attributes that have little impact on the classification result are deleted; in terms of classification effect, the weighted K-nearest neighbor algorithm is used, and different weight values ​​are given to the sample points according to the distance. The kknn function in the kknn package uses the weighted KNN algorithm.

                                                                                                                                                                                                          

 

2018-04-30 22:31:25

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325177957&siteId=291194637