[Iris k value tuning]

Introduction to Iris Dataset

The iris dataset contains three categories of iris samples: Setosa, Versicolor, and Virginica. Each sample has four features: sepal length, sepal width, petal length, and petal width. Our goal is to predict the iris category from these features.

Introduction to K-Nearest Neighbor Algorithm

The K-nearest neighbor algorithm is a simple and effective classification algorithm. Its basic idea is: for an unknown sample, by calculating the distance between it and all samples in the training set, select the nearest k samples, and then predict the category of the unknown sample according to the category of the k samples.

hyperparameter search

In the K nearest neighbor algorithm, the k value is an important hyperparameter. Different k values ​​can lead to significant changes in model performance. Therefore, we need to find the optimal k value through hyperparameter search. Here we will use Grid Search for hyperparameter search.

grid search

Grid search is a method to determine the optimal hyperparameters by traversing all possible combinations of given hyperparameters. In order to perform grid search, we first define the range of k values ​​to search for, then try all possible k values ​​within this range, then evaluate the model performance corresponding to each k value, and select the k value with the best performance.

1. The principle of grid search

The core idea of ​​grid search is very simple: exhaustively search for a given combination of hyperparameters, evaluate the performance of each combination through cross-validation, and finally select a set of parameters with the best performance. The process is similar to searching for an optimum point on a two-dimensional grid of parameters, hence the name "grid search".

Suppose we have two hyperparameters that need to be tuned: parameter A and parameter B, and their candidate values ​​are A1, A2 and A3, B1, B2 and B3, respectively. Grid search tries the following parameter combinations in order:

  • (A1, B1), (A1, B2), (A1, B3)
  • (A2, B1), (A2, B2), (A2, B3)
  • (A3, B1), (A3, B2), (A3, B3)

For each parameter combination, we use cross-validation to evaluate the performance of the model on the training set. Finally, the best performing set of hyperparameters is chosen as the final choice for our model.

2. Advantages of grid search

Grid search has the following advantages as a parameter tuning method:

  • Comprehensiveness: grid search tries all possible combinations of parameters, making sure we don't miss an optimal solution.
  • Intuitive: Grid search is simple and intuitive, easy to understand and implement.
  • Reproducibility: Given the same hyperparameter range and step size, the results of grid search are reproducible.

However, grid search also has its disadvantages, mainly in terms of computational cost. As the number of hyperparameters increases, the search space grows exponentially, leading to a sharp increase in the computational complexity of grid search. Therefore, grid search may not be optimal for high-dimensional parameter spaces and large datasets.

the code

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义要搜索的k值范围
param_grid = {
    
    'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15]}

# 创建K近邻分类器
knn = KNeighborsClassifier()

# 初始化网格搜索对象
grid_search = GridSearchCV(knn, param_grid, cv=5)

# 在训练集上进行网格搜索
grid_search.fit(X_train, y_train)

# 方式1直接比较预测值和真实值
y_pred = grid_search.predict(X_test)
print(y_pred == y_test)
print("准确率:", sum(y_pred == y_test) / len(y_test))

# 方式2计算在测试集上的准确率
score = grid_search.score(X_test, y_test)
print("准确率:", score)

# 输出最优的k值和对应的准确率
print("最优的k值:", grid_search.best_params_['n_neighbors'])
print("最优的准确率:", grid_search.best_score_)
print("最优的模型:", grid_search.best_estimator_)

In this sample code, we have used classes sklearnfrom the library GridSearchCVto do a grid search. We specify the range of k values ​​​​to be searched as [1, 3, 5, 7, 9, 11, 13, 15], and then the grid search will try these k values ​​and return the optimal k value and the corresponding accuracy rate .

Guess you like

Origin blog.csdn.net/qq_66726657/article/details/131926189