[Machine Learning] Classification Algorithms - Model Selection and Tuning GridSearchCV (Grid Search)

"Author's Homepage": Shibie Sanri wyx
"Author's Profile": CSDN top100, Alibaba Cloud Blog Expert, Huawei Cloud Sharing Expert, and High-quality Creator in the Network Security Field
"Recommended Column": Zero-Basic Quick Start to Artificial Intelligence proficient"

The K of the K-nearest neighbor algorithm refers to the number of neighbors. The "K value" is different, and the "accuracy rate" of the algorithm is also different. We need to constantly adjust the K value to improve the accuracy rate of the algorithm. In the process of "tuning" , we need to use "cross-validation" .

1. Cross Validation

Cross-Validation (Cross-Validation) is a commonly used method in machine learning to build models and verify model "parameters" . It is used to "evaluate" the performance indicators of machine models for "model selection" .

The "basic idea" of cross-validation is to group the original data, one part as the training set, and the other part as the verification set, first use the training set to train the algorithm model, and then use the verification set to test the trained algorithm model.

For example, divide the data into four parts, first use the first part of the data as the verification set, and verify the training results of the latter three parts with the first part; then use the second part of the data as the verification set, and use the training results of the other three parts of the data Verify with the second copy; and so on. . .

insert image description here

Cross-validation is often used in conjunction with grid search.

2. Grid search

Grid search is also called super "parameter search" . For example, the K value of the K-nearest neighbor algorithm needs to manually specify parameters. Such parameters are called hyperparameters. Grid search sets several sets of hyperparameter combinations by preset, and each set of hyperparameters is evaluated by cross-validation to select the "optimal" parameter combination to build a model.

The sklearn module GridSearchCV implements grid search very well. It can automatically adjust parameters. As long as the parameters are input, it can give the optimal results and parameters.


3. Model selection and tuning API

sklearn.model_selection.GridSearchCV( estimator,param_grid,cv)

  • estimator : the classifier to use
  • param_grid : parameters to be optimized, in dictionary or list format{ "n_neighbors": [1, 3, 5] , }
  • cv : number of cross validations

return value property

  • best_params_ : (dict) best parameters
  • best_score_ : (float) best result
  • best_estimator_ : (estimator) best classifier
  • cv_results_ : (dict) cross-validation results
  • best_index_ : (int) the index of the best parameter
  • n_splits_ : (int) number of cross-validation

4. Case demonstration

Next, we use GridSearchCV to select the "best K value" for the K-nearest neighbor algorithm

4.1. Feature set acquisition division

Using the iris "data set" that comes with sklearn , the data set is divided into 60% training and 40% testing.

from sklearn import datasets
from sklearn import model_selection

# 1、获取数据集
iris = datasets.load_iris()
# 2、划分数据集
# x_train:训练集特征,x_test:测试集特征,y_train:训练集目标,y_test:测试集目标
x_train, x_test, y_train, y_test = model_selection.train_test_split(iris.data, iris.target, random_state=6)
print('训练集特征:', len(x_train))
print('测试集特征:', len(x_test))
print('训练集目标:', len(y_train))
print('测试集特征:', len(y_test))

output:

训练集特征: 112
测试集特征: 38
训练集目标: 112
测试集特征: 38

As can be seen from the output results, the ratio of the training set and the test set is as expected


4.2. Feature Standardization

Next, "normalize" the training set features and test set features

from sklearn import datasets
from sklearn import model_selection
from sklearn import preprocessing

# 1、获取数据集
iris = datasets.load_iris()
# 2、划分数据集
# x_train:训练集特征,x_test:测试集特征,y_train:训练集目标,y_test:测试集目标
x_train, x_test, y_train, y_test = model_selection.train_test_split(iris.data, iris.target, random_state=6)
# 3、特征标准化
ss = preprocessing.StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)
print(x_train)

output:

[[-0.18295405 -0.192639    0.25280554 -0.00578113]
 [-1.02176094  0.51091214 -1.32647368 -1.30075363]
 [-0.90193138  0.97994624 -1.32647368 -1.17125638]

As you can see from the output, the features have been standardized.


4.3, KNN algorithm processing

Pass the training feature set and test feature set to KNN, and check the "accuracy rate" .

from sklearn import datasets
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import neighbors

# 1、获取数据集
iris = datasets.load_iris()
# 2、划分数据集
# x_train:训练集特征,x_test:测试集特征,y_train:训练集目标,y_test:测试集目标
x_train, x_test, y_train, y_test = model_selection.train_test_split(iris.data, iris.target, random_state=6)
# 3、特征标准化
ss = preprocessing.StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)
# 4、KNN算法处理
knn = neighbors.KNeighborsClassifier(n_neighbors=2)
knn.fit(x_train, y_train)
print(knn.score(x_test, y_test))

output:

0.8947368421052632

As can be seen from the output results, the accuracy rate is 89%, which is average.


4.4. Parameter tuning

Encapsulate different K values ​​into a dictionary and pass it to GridSearchCV to calculate the "optimal" parameters.

from sklearn import datasets
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import neighbors

# 1、获取数据集
iris = datasets.load_iris()
# 2、划分数据集
# x_train:训练集特征,x_test:测试集特征,y_train:训练集目标,y_test:测试集目标
x_train, x_test, y_train, y_test = model_selection.train_test_split(iris.data, iris.target, random_state=6)
# 3、特征标准化
ss = preprocessing.StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)
# 4、KNN算法处理
knn = neighbors.KNeighborsClassifier(n_neighbors=2)
# 5、参数调优
params = {
    
    "n_neighbors": [1, 3, 5, 7]}
knn = model_selection.GridSearchCV(knn, param_grid=params, cv=10)
knn.fit(x_train, y_train)
print('最优参数:', knn.best_params_)
print('最优准确率:', knn.best_score_)
print('最优分类器:', knn.best_estimator_)

output:

最优参数: {
    
    'n_neighbors': 5}
最优准确率: 0.9727272727272729
最优分类器: KNeighborsClassifier()

As can be seen from the output results, the optimal K value parameter is 5, and the accuracy rate reaches 97%.

Guess you like

Origin blog.csdn.net/wangyuxiang946/article/details/131751353