Classification of the iris dataset using MLP

本文是通过自己的理解进行的实验并进行文章撰写,如有问题请批评指正

data preparation

First, the iris dataset that comes with sklearn is used, and the data is imported using datasets.load_iris(). The dataset contains 150 data samples, which are divided into 3 categories, with 50 data in each category, and each data contains 4 attributes. Then use StandardScaler to normalize the mean and variance, and view the source code through pycharm to display Standardize features by removing the mean and scaling to unit variance. The calculation method used by this function is z = (x - u) / s, uis the mean of the training samples or zero if with_mean=False, and sis the standard deviation of the training samples or one if with_std=False. Through this operation, the characteristics of the data samples can be standardized , so that the data is more concentrated and more regular, standardization and normalization can make the features of different dimensions more numerically comparable, thereby improving the accuracy of the classifier, and then fit generates rules, and applies the formulated rules to the previous import on the iris dataset.


MLP classifier creation

Then generate an MLP classifier, that is, a multi-layer perceptron classifier. Here, the MLPClassifier function is used to generate a specified MLP classifier. The multi-layer neural network in the code is a neural network with three hidden layers. , the first layer has 30 neurons, the second layer has 20 neurons, and the third layer has 10 neurons, the maximum number of iterations is 800, the loss function optimizer used is adam, and the activation function is tanh.

mlp = MLPClassifier(max_iter=800,             # 最大迭代次数
                    solver='adam',
                    activation='tanh',
                    # 3个隐含层,每层10个神经元
                    hidden_layer_sizes=(30, 20, 10),    
                    verbose=False)

Draw the accuracy rate change curve

Then use the learning curve function learning_curve, the source code is described as Determines cross-validated training and test scores for different training set sizes. It is the result of determining the cross-validated training set and test set for different training sets. The estimator of this function refers to the method of training and prediction. Here, the previously generated MLP model is passed in; X represents the data set, and Y represents the label of the data set. Here, the previously imported iris data set is passed in respectively. Relevant data and labels; train_sizes is the relative or absolute number of training samples used to generate the learning curve. The default value is np.linspace(0.1, 1.0, 5), and the value passed in is also this, indicating that the size of the training set will be Divide into 5 equal intervals, and take 5 values ​​linearly between 0.1 and 1.0; cv is a separation method to determine cross-validation. Cross-validation is to divide the original data set into equal K parts ("fold") and then use part 1 as the test set, and the rest as the training set, train the model, and calculate the accuracy of the model on the test set, each time with a different Part of it is used as a test set, and the above process is repeated K times, and finally the average accuracy rate is used as the final model accuracy rate. 150 samples are equally divided into 5 parts (cv=5), each with 30 samples, according to the k-fold cross-validation method, there are at most 30 samples as the verification set, so the maximum number of samples that can be used for the training set is 150-30= 120, that is, the maximum value of train_sizes is 120.
Then draw the accuracy change curves of cross-validation of training sets of different sizes, and the resulting graph is shown below

insert image description here


Result analysis

As the number of training samples increases, the accuracy curve of the verification set shows an upward trend: because the larger the number of samples in the training data set, the data represented by the sample is more connected to all the data, that is, the data that can include more situations , which makes the model more generalizable and reduces the impact of noise points on model training, so the higher the accuracy, the higher the accuracy rate curve will be.

From the running results, the accuracy rate curve of the training set has little change, that is, the change in the number of samples has little effect on it: because the role of the training set is to train parameters, extracting features from the data of the training set, and affecting the training set Continuous iterative optimization of the data, so that the performance of the obtained model is continuously improved, and it can be predicted for the data that does not appear on the training set, so the training set is the data source for model training, and it can be tuned on this basis, so in The accuracy on the training set is high and does not vary much with the number of samples.

Cross-validation method: There are not many data sets used. When using cross-validation training, all the data is used for training, and no data is left for testing. K-fold cross-validation is used for model tuning to find the best generalization performance of the model. Excellent hyperparameter values ​​are used to evaluate the predictive performance of the model, especially the performance of the trained model on new data, which can reduce overfitting to a certain extent.

Guess you like

Origin blog.csdn.net/qq_48068259/article/details/127881392