09-- & recall cross validation & regularization

Introduction

When we build a machine learning model, generally it will split a bunch of data, for example: 80% of the data as a training set, take the other 20% as a final test set
of data which means that 80% use to build models, using the remaining 20% to verify this model, but here should be noted that the 20% of the final data set is used to verify the
time for 80% of the front of the model data is performed, but also needs to perform random cut divided into three, first, using the first and second parts of data modeling, using the third authentication data; reuse parts of first and third data modeling, using the second authentication data; last used second and third parts of the data model, using the first data verification, validation set is here comes

sklearn.cross_validation

When using the code, we can use cross_validation.train_test_split sklearn intermediate resolution data set
Here Insert Picture Description
train_test_split () is a function of the random division of the training set and test set model_selection module package provided sklearn
prototype: train_test_split (test_size , train_size, random_state = None, shuffle = True, stratify = None)
parameters:
test_size: test set size. If a floating point number, then between 0.0 and 1.0, the representation of the test set; if an integer type, compared with the absolute number of test set samples; if not, a training set of supplements. By default, the value is 0.25
train_size: the training set size, if the ratio of floating point number between 0.0 and 1.0 represents the test set; if it is an integer type, for the absolute number of test set of samples; if not, to complement test set
random_state: Specifies the random way. An integer or RandomState instance, or None. If an integer, it can specify a random number generator seed; if RandomState instance, specify a random generator; if None, the default random generator, randomly select a seed
shuffle: Boolean, whether recombinant data before splitting. If shuffle = False, then stratify must None
stratify: Array-like or None, if not None, the resolution data sets in a layered manner, and use this as a tag
Return Value: train and test data sets obtained resolved

Model Evaluation Criteria

Under normal circumstances, we will use precision to evaluate a model is good, but there is also simple to use to assess the accuracy of the time does not indicate a good model, for example: We need to assess whether a person has cancer, we first get inside the hospital 1,000 people as the sample data, among 990 people are not suffering from cancer, only the presence of 10 people suffering from cancer, who modeled this at this time, to extract the 990 people evaluated are normal, simply look accuracy, then probably 99%, but 99% in the middle of this and no one is suffering from cancer, this accuracy does not make any sense

recall

More used
recall is a measure of surface coverage, a measure is divided into a plurality of positive cases positive examples, recall = TP / (TP + FN) = TP / P = sensitive
, for example, the presence of 10 suffering from cancer, by modeling can be found in the middle of two, so that the recall rate is 20%
Here Insert Picture Description
Here Insert Picture Description

Overfitting

Usually when performing machine learning will be faced with this situation, in the training set between fit better, but not good test set to fit the above, this case is over-fitting

Regularization

Earlier talked about the cross-validation, using the cross-validation time, you will get different sets of parameter value sets, when the values of these parameters during the training set are able to achieve a high fitting case, how to choose? All parameters can first view, then the distribution of observed cases, the case if a relatively large distribution span, such parameters are generally not applied; overfitting is typically the case due to the parameters used, it should be to select a set of parameters is more evenly distributed
Here Insert Picture Description
as shown above, the parameters can be seen that the distribution of the intermediate model a large, middle model parameter distribution B is relatively small, such a case, we wish to model the penalties for a is large, for B model small penalties
L2 regularization: loss + 1/2 * w ^ 2
for L2 regularization above, we can also add a punishment c in front, this value can choose, usually [0.01, 0.1,1,10,100] selected intermediate

Guess you like

Origin blog.csdn.net/Escid/article/details/90896585