k-nearest algorithm-classification practice

Python algorithm simple practical operation of k-nearest neighbor algorithm

sequence

Recently, I started to learn some algorithms about data mining, but I don’t like to type them down intact according to the code in the book, so I plan to find some data sources to model and do some analysis according to the learning progress. The data source is Kaggle . Just use this blog as the beginning of the record.

Introduction to Algorithm

The k-nearest algorithm is one of the simplest algorithms. The idea of ​​this algorithm is: in the feature space, if most of the k nearest (ie, the nearest neighbors in the feature space) samples near a sample belong to a certain category, then the sample Also belongs to this category. This algorithm is mainly used to solve classification problems, and it can be applied whether it is two-class classification or multi-class classification.
This article mainly records the actual operation. If you want to know the specific algorithm principle, you can find it yourself, or you can refer to this blog: KNN principle summary

Data Sources

Glass classification (Kaggle): https://www.kaggle.com/uciml/glass

Insert picture description here
This data divides 7 types of glass according to the composition of the glass (RI, Na, Mg, Al, SI, K, Ca, Ba, Fe), and each glass (Type) is named after the number 1~7.
Build a knn multi-classification model based on these data.

Data mining

1. Import third-party libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split#划分训练集和测试集
from sklearn.neighbors import KNeighborsClassifier#导入knn算法
from sklearn.metrics import accuracy_score#导入分类评分标准

Import the modules required for modeling in turn. In addition to the first four libraries, which are necessary third-party libraries for data mining, focus on accuracy_score:

sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
normalize: The default value is True, which returns the proportion of the correct classification; if it is False, the number of samples that are correctly classified

The classification accuracy score refers to the percentage of all classifications that are correct. Classification accuracy is easier to understand, but it cannot tell you the potential distribution of the response value, and it cannot tell you the type of error that the classifier made. But I simply built a multi-classification model, which is enough.

2. Read the file

import winreg
real_address = winreg.OpenKey(winreg.HKEY_CURRENT_USER,r'Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders',)
file_address=winreg.QueryValueEx(real_address, "Desktop")[0]
file_address+='\\'
file_origin=file_address+"\\源数据-分析\\glass.csv"#设立源数据文件的桌面绝对路径
glass=pd.read_csv(file_origin)#读取csv文件

Because every time you download data, you have to transfer the file to the python root directory or read it in the download folder, which is very troublesome. So I set up an absolute desktop path through the winreg library, so that I only need to download the data to the desktop or paste it into a specific folder on the desktop to read it, and it won't be confused with other data.

3. Modeling

Insert picture description here

y=list(glass.columns)[:-1]
X_train,X_test,y_train,y_test=train_test_split(glass[y],glass["Type"],random_state=1)
#考虑到接下来可能需要进行其他的操作,所以定了一个随机种子,保证接下来的train和test是同一组数

The index of the divided column is the feature value and the target classification value, and the data is divided into a training set and a test set.

knn=KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

The knn algorithm is introduced, and the neighbor value in the algorithm is set to 1, and modeling is performed to obtain the result.
Insert picture description here

4. Scoring

prediction=knn.predict(X_test)#对测试值进行预测
accuracy_score(y_test,prediction)#对结果进行评分

Use knn.predict to predict the test value and compare it with the previously divided test tags to see the accuracy of the model.
Insert picture description here
The result is 0.72, which means that the test result is 72% consistent with the predicted result, or the model accuracy score is 72 points.

5. Simple parameter adjustment
The neighbor parameter set up before is 1, and then test different parameters in turn to see what the optimal neighbor parameter is.

result={
    
    }#通过字典来记录每次的参数及对应的评分结果
for i in range(20):#参数依次从1取到20
    knn=KNeighborsClassifier(n_neighbors=(i+1))
    knn.fit(X_train,y_train)
    prediction=knn.predict(X_test)
    score=accuracy_score(y_test,prediction)
    result[i+1]=score*100
for i in result.keys():
    if result[i]==max(result.values()):
        print("最佳邻近数:"+str(i))
print("模型评分"+str(max(result.values())))

The results are as follows: It
Insert picture description here
can be seen that in the process of selecting parameters from 1 to 20 in turn, the best neighboring parameters are 1, 4, and 5; the best accuracy score of the model is 72 points.

6. Summary
In fact, the whole process is not so much modeling, it is better to experience the modeling process. Moreover, the k-nearest algorithm is not a very complicated algorithm, and it does not involve some operations of data standardization or normalization, but there are still some issues that can be discussed:

1. Can this algorithm be used for the classification of text data, and if so, how to use it? (I tried some text-based data before, but errors were reported. Whether it was converting the data type to float or converting the text element to ASC code, the modeling could not be successful)

2. Is the accuracy of the knn algorithm the smaller the adjacent parameter? What about the generalization ability of other data?
3. Is there a more comprehensive parameter tuning method to improve model accuracy?
4. Can the knn algorithm be used for regression problems?

There are many places that are not doing very well. Netizens are welcome to make suggestions.
The following is my main contact information. I hope to meet some friends to learn and exchange data analysis and data mining together.
Github : https://github.com/yb705

Guess you like

Origin blog.csdn.net/weixin_43580339/article/details/111628241