Machine learning algorithm theory and practical (a) - KNN algorithm

table of Contents:

I. INTRODUCTION

Second, workflow

Third, examples

Fourth, implemented in Python

1. analog data and plot

2.KNN process

3. scikit-learn the KNN

Five, KNN advantages and disadvantages

1. Advantages

2. shortcomings

Six, KNN application

1. banking system

2. Calculate credit rating

3. Political

4. Other areas


I. INTRODUCTION

K Nearest Neighbor (KNN) algorithm is a supervised ML algorithm can be used for classification and regression prediction problem. However, it is mainly used for classification industry forecasting problems. Two of the following will be well defined KNN:

  • Inert learning algorithm : because it has no special training phase, and in the classification of all data used for training.
  • Non-parametric learning algorithms  : because it does not assume that any information about the underlying data .

Second, workflow

K Nearest Neighbor (KNN) algorithm uses a "characteristic similarity" to predict the value of the new data point, which means assigning a value for the new data point according to the degree of matching new data points in the training set of points. We can see how it works by following these steps:

Step 1: Load the training and test data.

Step 2: Select the K value, i.e., the most recent data points (K may be any integer).

Step 3: For each point in the test data, do the following:  

  1. The distance between each line is calculated in the test data and training data by means of any of the following methods: Euclidean distance, Manhattan distance or Hamming distance . The most common of the Euclidean distance is calculated.

  2. Based on the distance values ​​are sorted in ascending order.

  3. It will then select the first K rows in the sorted array.

  4. Now, it will assign it a class test point is based on the category up to these lines appear.

Step 4: End.

Third, examples

The following work is to understand the concepts and algorithms KNN K of examples.

Suppose we have a set of data can be drawn as follows. as follows:

K concept

Now, we need to be with a black dot (at point 60, 60) the new data point is classified as blue or red category. We assume that K = 3, that is, it will find the three most recent data points. as the picture shows:

KNN algorithm

We can see three nearest neighbor data point with the black dot in the figure above. In these three, two belong to the red level, and therefore the black dot will be assigned red level.

Fourth, implemented in Python

1. analog data and plot

# 导入相应的包
import numpy as np
import matplotlib.pyplot as plt

# 模拟数据
raw_data_X = [[3.39, 2.33],
              [3.11, 1.78],
              [1.34, 3.36],
              [3.58, 4.67],
              [2.28, 2.86],
              [7.42, 4.69],
              [5.74, 3.53],
              [9.17, 2.51],
              [7.79, 3.42],
              [7.93, 0.79]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_Y)

# 绘制散点图
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], color='g')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], color='r')
plt.show()

# 加入一个新数据点
X = np.array([8.09, 3.36])
# 绘制增加新数据点的散点图
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], color='g')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], color='r')
plt.scatter(X[0], X[1], color='b')
plt.show()

2.KNN process

① calculate the distance

# 其他数据分别与新数据点之间的距离
from math import sqrt

# 方法一
distance = []
for x_train in X_train:
    d = sqrt(np.sum((x_train - X)**2))
    distance.append(d)
distance

# 方法二
distance = [sqrt(np.sum((x_train - X)**2)) for x_train in X_train]

[4.811538215581374,
 5.224633958470201,
 6.75,
 4.696402878799901,
 5.831474942070831,
 1.489227987918573,
 2.356140912594151,
 1.3743725841270265,
 0.30594117081556693,
 2.574975728040946]

② value based on the distance, they are sorted in ascending

# 升序排序(按下标排序)
nearest = np.argsort(distance)
nearest

array([8, 7, 5, 6, 9, 3, 0, 1, 4, 2], dtype=int64)

from the sorted array to select top K rows

# 设k为6
k = 6

# 从排序后的数组中显示前k行对应的类型
topK_y = [y_train[i] for i in nearest[:k]]
topK_y

[1, 1, 1, 1, 1, 0]

④ assign a category to this category based on test points up to these lines appear

# 导入统计数据包
from collections import Counter

# 显示各类对应的数量
votes = Counter(topK_y)
votes

Counter({1: 5, 0: 1})

1 shows the number appears at most 5. This additional data points of type 1, red.

votes.most_common(2)         # 显示为[(1, 5), (0, 1)]
votes.most_common(1)         # 显示为[(1, 5)]
votes.most_common(1)[0][0]   # 显示为1

# 预测数据
predict_y = votes.most_common(1)[0][0]
predict_y

1

3. scikit-learn the KNN

# 导包
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# 设置k为6
kNN_classifier = KNeighborsClassifier(n_neighbors = 6)

# 训练
raw_data_X = [[3.39, 2.33],
              [3.11, 1.78],
              [1.34, 3.36],
              [3.58, 4.67],
              [2.28, 2.86],
              [7.42, 4.69],
              [5.74, 3.53],
              [9.17, 2.51],
              [7.79, 3.42],
              [7.93, 0.79]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)
kNN_classifier.fit(X_train, y_train)

# 新数据
X = np.array([8.09, 3.36])
X_predict = X.reshape(1, -1) 

# 预测数据
y_predict = kNN_classifier.predict(X_predict)
y_predict[0]

1

Five, KNN advantages and disadvantages

1. Advantages

  • A very simple algorithm, easy to understand and interpret.

  • Very useful for non-linear data, because this algorithm no assumptions about the data.

  • A general algorithm, because we can use it for classification and regression.

  • It has a relatively high accuracy, but better than the KNN supervised learning model.

2. shortcomings

  • This is the algorithm for calculating a bit expensive, because it stores all training data.

  • Compared with other supervised learning algorithm, requiring high storage capacity.

  • Predict very slow when large N.

  • It is not related to the data size and is very sensitive.

Six, KNN application

Here are some of the areas can be successfully applied KNN:

1. banking system

KNN can be used to predict suitable personal loan approval weather in the banking system do? Whether the individual has the characteristics similar to those of the breach?

2. Calculate credit rating

By comparison with people with similar characteristics, KNN algorithm can be used to find personal credit rating.

3. Political

With KNN algorithm, we can reach potential voters are divided into several categories, such as "vote would", "will not vote", "the party's congress will vote", "on behalf of the party's vote."

4. Other areas

Speech recognition, handwriting detection, video recognition and image recognition.

 

 

 

Published 61 original articles · won praise 152 · views 40000 +

Guess you like

Origin blog.csdn.net/weixin_40431584/article/details/104671882