手把手教你使用KNN算法(Python实现)

上节我们简单进行了KNN算法的说明,想想假期结束再回味一下!

Knn算法基本原理:

假设我有如下两个数据集:

dataset = {'black':[ [1,2], [2,3], [3,1] ], 'red':[ [6,5], [7,7], [8,6] ] }

另外有一点绿颜色标记(3.5,5.3), KNN的任务就是判断这个点(下图中的绿点)该划分到哪个组。


KNN分类算法超级简单:只需使用初中所学的两点距离公式(欧拉距离公式),计算绿点到各组的距离,看绿点和哪组更接近。K代表取离绿点最近的k个点,这k个点如果其中属于红点个数占多数,我们就认为绿点应该划分为红组,反之,则划分为黑组。如果有两组数据(如上图),k值最小应为3(X轴坐标3.5)。

除了K-Nearest Neighbor之外还有其它分组的方法,如Radius-Based Neighbor。此方法后面在做介绍。

实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import math
import numpy as np
from matplotlib import pyplot
from collections import Counter
import warnings
 
# k-Nearest Neighbor算法
def k_nearest_neighbors ( data , predict , k = 5 ) :
 
     if len ( data ) >= k :
         warnings . warn ( "k is too small" )
 
     # 计算predict点到各点的距离
     distances = [ ]
     for group in data :
         for features in data [ group ] :
             #euclidean_distance = np.sqrt(np.sum((np.array(features)-np.array(predict))**2))   # 计算欧拉距离,这个方法没有下面一行代码快
             euclidean_distance = np . linalg . norm ( np . array ( features ) - np . array ( predict ) )
             distances . append ( [ euclidean_distance , group ] )
 
     sorted_distances = [ i [ 1 ]    for i in sorted ( distances ) ]
     top_nearest = sorted_distances [ : k ]
 
     #print(top_nearest)  ['red','black','red']
     group_res = Counter ( top_nearest ) . most_common ( 1 ) [ 0 ] [ 0 ]
     confidence = Counter ( top_nearest ) . most_common ( 1 ) [ 0 ] [ 1 ] * 1.0 / k
     # confidences是对本次分类的确定程度,例如(red,red,red),(red,red,black)都分为red组,但是前者显的更自信
     return group_res , confidence
 
if __name__ == '__main__' :
 
     dataset = { 'black' : [ [ 1 , 2 ] , [ 2 , 3 ] , [ 3 , 1 ] ] , 'red' : [ [ 6 , 5 ] , [ 7 , 7 ] , [ 8 , 6 ] ] }
     new_features = [ 3.5 , 5.2 ]    # 判断这个样本属于哪个组
 
     for i in dataset :
         for ii in dataset [ i ] :
             pyplot . scatter ( ii [ 0 ] , ii [ 1 ] , s = 50 , color = i )
 
     which_group , confidence = k_nearest_neighbors ( dataset , new_features , k = 3 )
     print ( which_group , confidence )
 
     pyplot . scatter ( new_features [ 0 ] , new_features [ 1 ] , s = 100 , color = which_group )
 
     pyplot . show ( )

结果如下所示


归为红色一类的概率为:0.66666666

我们使用实际数据进行应用

数据集(Breast Cancer):https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29

点击download: Data Folder/breast-cancer-wisconsin.data(复制粘贴到txt文件再重命名)

代码如下:(if __name__=='__main__':前面代码一样

import math
import numpy as np
from collections import Counter
import warnings
import pandas as pd
import random
 
# k-Nearest Neighbor算法
def k_nearest_neighbors ( data , predict , k = 5 ) :
 
     if len ( data ) >= k :
         warnings . warn ( "k is too small" )
 
     # 计算predict点到各点的距离
     distances = [ ]
     for group in data :
         for features in data [ group ] :
             euclidean_distance = np . linalg . norm ( np . array ( features ) - np . array ( predict ) )
             distances . append ( [ euclidean_distance , group ] )
 
     sorted_distances = [ i [ 1 ]    for i in sorted ( distances ) ]
     top_nearest = sorted_distances [ : k ]
 
     group_res = Counter ( top_nearest ) . most_common ( 1 ) [ 0 ] [ 0 ]
     confidence = Counter ( top_nearest ) . most_common ( 1 ) [ 0 ] [ 1 ] * 1.0 / k
 
     return group_res , confidence
if __name__=='__main__':
    df=pd.read_csv('iris.csv')#加载数据
    #print (df.head())
    #print(df.shape)
     df . replace ( '?' , np . nan , inplace = True )    # -99999
     df . dropna ( inplace = True )    # 去掉无效数据
     #print(df.shape)
     df . drop ( [ 'id' ] , 1 , inplace = True )#去掉id 这一列(第一列名字为id)
 
     # 把数据分成两部分,训练数据和测试数据
     full_data = df . astype ( float ) . values . tolist ( )
 
     random . shuffle ( full_data )
 
     test_size = 0.2    # 测试数据占20%
     train_data = full_data [ : - int ( test_size * len ( full_data ) ) ]
     test_data = full_data [ - int ( test_size * len ( full_data ) ) : ]
 
     train_set = { 2 : [ ] , 4 : [ ] }
     test_set = { 2 : [ ] , 4 : [ ] }
     for i in train_data :
         train_set [ i [ - 1 ] ] . append ( i [ : - 1 ] )
     for i in test_data :
         test_set [ i [ - 1 ] ] . append ( i [ : - 1 ] )
 
     correct = 0
     total = 0
 
     for group in test_set :
         for data in test_set [ group ] :
             res , confidence = k_nearest_neighbors ( train_set , data , k = 5 ) # 你可以调整这个k看看准确率的变化,你也可以使用matplotlib画出k对应的准确率,找到最好的k值
             if group == res :
                 correct += 1
             else :
                 print ( confidence )
             total += 1
 
     print ( correct / total )    # 准确率
 
     print ( k_nearest_neighbors ( train_set , [ 4 , 2 , 1 , 1 , 1 , 2 , 3 , 2 , 1 ] , k = 5 ) ) # 预测一条记录

结果如下所示:


使用scikit-learn 中K临近算法

代码如下:

import numpy as np
from sklearn import preprocessing , cross_validation , neighbors    # cross_validation已deprecated,使用model_selection替代
import pandas as pd
 
df=pd.read_csv('iris.csv')#加载exel数据
#print(df.head())
#print(df.shape)
df . replace ( '?' , np . nan , inplace = True )    # -99999
df . dropna ( inplace = True )
#print(df.shape)
df . drop ( [ 'id' ] , 1 , inplace = True )
 
X = np . array ( df . drop ( [ 'class' ] , 1 ) )
Y = np . array ( df [ 'class' ] )
 
X_trian , X_test , Y_train , Y_test = cross_validation . train_test_split ( X , Y , test_size = 0.2 )
 
clf = neighbors . KNeighborsClassifier ( )
clf . fit ( X_trian , Y_train )
 
accuracy = clf . score ( X_test , Y_test )
print ( accuracy )
 
sample = np . array ( [ 4 , 2 , 1 , 1 , 1 , 2 , 3 , 2 , 1 ] )
print ( sample . reshape ( 1 , - 1 ) )
print ( clf . predict ( sample . reshape ( 1 , - 1 ) ) )

结果如下:(里面有个警告但不妨碍结果)


scikit-learn中的算法和我们上面实现的算法原理完全一样,只是它的效率更高,支持的参数更全。

(以上内容学习于大熊猫)



猜你喜欢

转载自blog.csdn.net/ai_mackey/article/details/80182409