sklearn implements K-nearest neighbor algorithm, iris classification, facebook check-in location prediction

1 K-nearest neighbor algorithm (KNN) concept

KNN algorithm flow

sklearn's KNN algorithm implementation

 

API :   sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm='auto')

Parameter Description

  • n_neighbors:
    • int, optional (default = 5), k_neighbors queries the number of neighbors used by default
  • algorithm string type, optional parameters as follows:'auto','ball_tree','kd_tree','brute'. among them:
    • The default parameter is auto, which can be understood as the algorithm determines the appropriate search algorithm by itself. In addition, users can also specify their own search algorithms ball_tree, kd_tree, brute methods to search,
    • Brute is brute force search, that is, linear scan. When the training set is large, the calculation is very time-consuming.
    • kd_tree, a tree data structure that constructs a kd tree to store data for quick retrieval. The kd tree is also a binary tree in the data structure. The tree constructed by median segmentation, each node is a super rectangle, and the efficiency is high when the dimension is less than 20.
    • The ball tree was invented to overcome the high-dimensional failure of the kd tree. Its construction process is to divide the sample space with the centroid C and the radius r, and each node is a hypersphere.
from sklearn.neighbors import KNeighborsClassifier
import pickle


# 训练集 x是特征值,y是标签
x = [[0], [1], [2], [10], [15], [11]]
y = [0, 0, 0, 1, 1, 1]

# 实例化一个knn对象,k=3
estimator = KNeighborsClassifier(n_neighbors=3, algorithm='auto')
# 训练
estimator.fit(x,y)

# 预测 
estimator.predict([[22]])

# 模型保存
with open('./KNN.pkl','wb') as f:
    pickle.dump(estimator, f)

Model loading

import pickle


# 模型加载
with open('./KNN.pkl', 'rb') as ff:
    model = pickle.load(ff)

# 模型预测
model.predict([[15]])

2 Iris classification based on KNN

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier


# 数据集加载
iris= load_iris()

# 数据集划分
x_train,x_test, y_train,y_test = train_test_split(iris.data, iris.target, test_size=0.2)

#特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

# 机器学习
estimator = KNeighborsClassifier(n_neighbors=11, algorithm='auto')
estimator.fit(x_train, y_train)

# 预测
y_predict = estimator.predict(x_test)
print("测试集预测结果:\n", y_predict)
print()
print(y_predict == y_test)

# 计算模型得分
score = estimator.score(x_test,y_test)
print("模型得分为:", score)

3 Facebook sign-in location prediction

 

import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import numpy as np


# 1、获取数据集
facebook = pd.read_csv("./train.csv")

# 2.基本数据处理
# 2.1 缩小数据范围
facebook_data = facebook.query("x>2.0 & x<2.5 & y>2.0 & y<2.5")
# 2.2 选择时间特征
time = pd.to_datetime(facebook_data["time"], unit="s")
time = pd.DatetimeIndex(time)
facebook_data["day"] = time.day
facebook_data["hour"] = time.hour
facebook_data["weekday"] = time.weekday
# 2.3 去掉签到较少的地方
place_count = facebook_data.groupby("place_id").count()
place_count = place_count[place_count["row_id"]>3]
facebook_data = facebook_data[facebook_data["place_id"].isin(place_count.index)]
# 2.4 确定特征值和目标值
x = facebook_data[["x", "y", "accuracy", "day", "hour", "weekday"]]
y = facebook_data["place_id"]
# 2.5 分割数据集
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

# 3.特征工程--特征预处理(标准化)
# 3.1 实例化一个转换器
transfer = StandardScaler()
# 3.2 调用fit_transform
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

# 4.机器学习--knn+cv
# 4.1 实例化一个估计器
estimator = KNeighborsClassifier()
# 4.2 调用gridsearchCV
param_grid = {"n_neighbors": [1, 3, 5, 7, 9]}
estimator = GridSearchCV(estimator, param_grid=param_grid, cv=5)
# 4.3 模型训练
estimator.fit(x_train, y_train)

# 5.模型评估
# 5.1 基本评估方式
score = estimator.score(x_test, y_test)
print("最后预测的准确率为:\n", score)

y_predict = estimator.predict(x_test)
print("最后的预测值为:\n", y_predict)
print("预测值和真实值的对比情况:\n", y_predict == y_test)

# 5.2 使用交叉验证后的评估方式
print("在交叉验证中验证的最好结果:\n", estimator.best_score_)
print("最好的参数模型:\n", estimator.best_estimator_)
print("每次交叉验证后的验证集准确率结果和训练集准确率结果:\n",estimator.cv_results_)

The training data set is a bit big, put it into the network disk link: https://pan.baidu.com/s/1IqeYKknJnxeTa9wieSTh7w Extraction code: 1111 

KNN algorithm summary

  • advantage:

    • Simple and effective
    • Low cost of retraining
    • Suitable for cross-domain samples
      • The KNN method mainly relies on the surrounding limited nearby samples , rather than the method of discriminating the class domain to determine the category. Therefore, for the sample set to be divided with more cross or overlap of the class domain, the KNN method is better than other methods. Suitable for.
    • Suitable for automatic classification of large samples
      • This algorithm is more suitable for the automatic classification of class domains with a relatively large sample size , and those with a small sample size are more prone to misclassification using this algorithm .

  • Disadvantages:

    • Lazy learning

      • KNN algorithm is a lazy learning method (lazy learning, basically does not learn), some active learning algorithms are much faster
    • Category ratings are not normalized

      • Unlike some classifications that pass probability scores
    • The output is not very interpretable
      • For example, the output of the decision tree is more interpretable
    • Not good at uneven samples
      • When the sample is unbalanced, for example, the sample size of one class is very large, while the sample size of other classes is very small, it may cause that when a new sample is input, the samples of the large-capacity class among the K neighbors of the sample account for the majority. The algorithm only calculates the "nearest" neighbor samples. If the number of samples of a certain type is large, then either such samples are not close to the target sample, or such samples are very close to the target sample. In any case, the quantity does not affect the results of the operation. It can be improved by using the weight method (the neighbor with a small distance from the sample has a large weight).
    • Large amount of calculation
      • At present, the commonly used solution is to edit the known sample points in advance, and remove the samples that have little effect on classification in advance.

 

Guess you like

Origin blog.csdn.net/qq_39197555/article/details/114992113