2 model assessment
2.1 data set into training set and test set method
According to Chang 7: 3 ratio selected, if the data has a random, then to take the first sample set as a 70%
2.1.1 distillation method
Directly to the data set D is divided into two mutually exclusive collection
- Data distribution division should maintain consistency
- There are various division manner using several randomly divided into experiment was repeated taking the average as the evaluation result
- Common practice: 2 / 3-4 / 5 the samples used for training for the remaining test, the test sample set containing at least 30
2.1.2 cross validation --k fold cross-validation
The data set is divided into D k of size similar mutually exclusive subsets, each with k-1 subsets as the training set, and sets, as the rest of a test set for k-th training and testing, test results returned k mean
- Typically k = 10
- Different ways, for example, dividing 10 times 10-fold cross validation
- Leave-one: m samples in k = m-1, but when the high computational complexity of the sample amount is too large
- Typical points system: 60% as a training set, as a validation set of 20%, 20% as a test set, the smallest set of models selected authentication error
Bootstrap 2.1.3
'Then continues to repeat the above process the sample back D, to give the final training set D' each randomly selected data set D in a sample is placed in the data set D, and D is 36.8% of the data is not present in the D 'in these data are used as the test set
- The test results become outsourced estimate
- The initial distribution of the data set changes, estimation error is introduced, the amount of data used when insufficient
2.2 parameter adjustment
- The common practice: the given range and step size
- Two kinds of parameters: the algorithm parameter and the model parameters , the number of the former is less, manual; multi latter number, generating learning
- Model evaluation and determine only some of the samples as a training set, and then obtain the optimal model with re-training of all samples , is the ultimate model
2.3 Performance Measurement
The most commonly used regression task performance metric is the mean square error
\ [E (f; D) = \ frac {1} {m} \ sum_ {i = 1} ^ {m} (f (x_i) -y_i) ^ 2 \ tag {2.1} \]
more generally, the data distribution D and the probability density function p, the mean square error is described as (.)
\ [E (F; D) = \ the int_ {X \ in D} (F (X) - y) ^ 2p (x) dx \ tag {2.2} \]
2.3.1 F1 measure
F1 is based on the harmonic mean of precision and recall of
2.4 0/1 Test Error
\[ err(h_\theta(x),y)= \begin{cases} 1, & \mbox{if }h_\theta(x)\ge0.5,y=0\\&\mbox{ or if }h_\theta(x)<0.5,y=1 \\ 0, & \mbox{otherwise}\end{cases}\tag{2.3} \]
\[ Test error=\frac{1}{m_{test}}\sum_{i=1}^{m_{test}}err(h_\theta(x_{test}^{(i)}),y_{test}^{(i)})\tag{2.4} \]
2.5 deviation, variance, noise
Deviation: a measure of expected learning algorithm to predict the degree of deviation from the actual results, depicts learning algorithm itself fitting capability
Variance: measure the changes in learning performance of different training sets from the same size of the same points lead, depicts the impact caused by the disturbance data
Noise: The expression of the lower bound of the current expected generalization error learning algorithm on any task that can be achieved, portray the difficulty of learning the problem itself
Deviation and variance conflict
Training set and cross-validation error error big, big means that the model bias, model underfitting
Training set error is very small but significant cross-validation error, a large variance means that the model, the model over-fitting
2.6 learning curve
The abscissa is the number of samples and the ordinate is the error, and training error == seeking authentication error to remember regularization term == 0
- Good learning curve fitting situation, the more samples, the training set errors increase, decrease error validation set, because more training parameters
- Programming, each training set a part of the training set of parameters, to use the model for all the samples == == validation sets the calculated cost function
- Learning curve at high deviation, increase the number of samples stabilized error value
- Learning curve at a high variance, cross-validation error is much larger than the error of the training set, increasing the number of samples validation set error can be reduced, because the more often included, resulting in the probability of occurrence of new samples validation set is reduced
2.7 modify the model approach
- Increasing the number of training samples: solve the problem of high variance
- Reducing the number of features: to prevent over-fitting, high variance solve the problem
- Increasing the number of features: increasing the complexity of the model, to solve the problem of high deviation
- Increasing the polynomial: Synthesis wherein an increase, to solve the problem of high variation
- Reducing the regularization parameter \ (\ the lambda \) , i.e. increasing the characteristic parameters, to prevent under-fitting, to solve the problem of high variation
- Increasing the regularization parameter \ (\ the lambda \) , that is, small changes in the characteristic parameters, to prevent over-fitting, high variance solve the problem
2.8 decision boundary
All the algorithms are found in accordance with certain guidelines for the optimal parameters \ (\ Theta \) , \ (\ Theta ^ TX \) is the graphical decision boundary