Testing and Validation of Machine Learning (Machine Learning Workshop 5)

About Machine Learning Study 3 and 4, you can browse it on Autumn Code Records .

testing and verification

The only way to know how well a model generalizes to new cases is to actually try it on new cases. One approach is to put the model into production and monitor its performance. This works, but if your model is really bad, your users will complain - not the best idea.

A better option is to split the data into two sets: a training set and a test set. As the name implies, you use the training set to train the model and the test set to test the model. The error rate on new cases is called the generalization error (or out-of-sample error), and you can get an estimate of this error by evaluating the model on the test set. This value tells you how well the model performs on instances it has never seen before.

If the training error is low (i.e. your model makes few mistakes on the training set) but the generalization error is high, it means your model is overfitting the training data

Usually 80% of the data is used for training and 20% of the data is reserved for testing. However, this depends on the size of your dataset: if it contains 10 million instances, then keeping 1% means your test set will contain 100,000 instances, which may be enough to get a good estimate of generalization error.

Hyperparameter tuning and model selection

Evaluating the model is easy: just use the test set. But suppose you're torn between two types of models, say linear and multinomial: how do you decide between them? One option is to train both and compare how well they generalize using the test set.

Now suppose the linear model generalizes better, but you want to apply some regularization to avoid overfitting. The question is, how do you choose the value of the regularization hyperparameter? One option is to train 100 different models with 100 different values ​​of this hyperparameter. Suppose you find the best hyperparameter values ​​that produce a model with the lowest generalization error (for example, only 5% error). You put this model into production, but unfortunately it doesn't perform as expected and produces a 15% error. What just happened?

The problem is that you measured the generalization error multiple times on the test set, and you tuned the model and hyperparameters to produce the best model for that particular set. This means the model is less likely to perform well on new data.

A common solution to this problem is called holdout validation (pictured below): you only need to hold out part of the training set to evaluate multiple candidate models and choose the best one. The newly retained set is called the validation set (or dev set or dev set). More specifically, you train multiple models with various hyperparameters on a reduced training set (that is, the full training set minus the validation set), and then choose the model that performs best on the validation set. After this validation process is preserved, you can train the best model on the full training set (including the validation set), which will give you the final model. Finally, you evaluate the final model on the test set to obtain an estimate of the generalization error.

insert image description here

This solution usually works fairly well. However, if the validation set is too small, the model evaluation will be imprecise: you may end up choosing a suboptimal model by mistake. Conversely, if the validation set is too large, the remaining training set will be much smaller than the full training set. Why is this
bad? Well, since the final model will be trained on the full training set, comparing candidate models trained on a much smaller training set is not ideal. It's like choosing the fastest sprinter to run in a marathon. One way to solve this problem is to perform repeated cross-validation, using many small validation sets. After training on the rest of the data, each model is evaluated once per validation set. By averaging all evaluations of the model, you can more accurately measure the performance of the model. However, there is a downside: the training time is multiplied by the number of validation sets.

data mismatch

In some cases, it is easy to obtain large amounts of data for training, but this data may not perfectly represent the data used in production. For example, suppose you want to create a mobile application that takes photos of flowers and automatically determines their species. You can easily download millions of photos of flowers on the web, but they don't quite represent what you'd actually take with the app on your mobile device. Maybe you only have 1,000 representative photos (those that were actually taken using the app).

In this case, the most important rule to remember is that both validation and test sets must be as representative as possible of the data expected to be used in production, so they should consist entirely of representative images: you can Shuffle them and put half in the validation set and the other half in the test set (make sure there are no duplicates or near-duplicates in either set). After training your model on web images, if you observe that your model performs disappointingly on the validation set, you won't know if it's because your model outperformed the training set, or just because of web images and mobile apps Mismatch between pictures.

One solution is to put some training images (from the network) in another set, which AndrewNg calls the training-dev set (pictured below). After training the model (on the training set, not on the train-dev set), you can evaluate it on the train-dev set. If the
model is underperforming, then it must be overfitting the training set, so you should try to simplify or regularize the model, get more training data, and clean the training data. However, if it performs well in the train-dev set, then you can evaluate the model in the dev set. If it's performing poorly, then the problem must be from a data mismatch. You can try to fix this by preprocessing web images to make them look more like the ones your mobile app will take before training the model. Once you have a model that performs well on both the train-dev set and the dev set, you can evaluate it one last time on the test set to see how well it might perform in production

insert image description here

shown in the figure. When real data is scarce (right), you can train on similarly rich data (left) and keep a portion of the data in the train-dev set to evaluate overfitting; then use real data to evaluate data mismatch (dev set) and Evaluate the performance of the final model (test set).

                                    ##### 没有免费的午餐定理

A model is a simplified representation of data. Simplification is done to discard redundant details that are unlikely to generalize to new instances. When you choose a particular type of model, you are implicitly making assumptions about the data. For example, if you choose a linear model, you are implicitly assuming that the data is essentially linear, and that the distance between instances and lines is just noise, which can be safely ignored.

In a famous 1996 paper, David Wolpert demonstrated that if you make absolutely no assumptions about the data, then there is no reason to prefer one model over any other. This is the so-called "No Free Lunch" (NFL) theorem. For some datasets, the best model is a linear model, while for others it is a neural network. None of the models is a priori guaranteed to work better (hence the name of the theorem). The only way to determine which model is best is to evaluate them all. Since this is impossible, in practice you make some reasonable assumptions about the data and only evaluate a few reasonable models. For example, for simple tasks you can evaluate linear models with various levels of regularization, and for complex problems you can evaluate various neural networks.

practise

We introduced some of the most important concepts in machine learning. We'll dig deeper and write more code in the upcoming labs, but before doing so, make sure you can answer the following questions:

  • 1. How would you define machine learning?
  • 2. Can you name the four application types it is best at?
  • 3. What is a labeled training set?
  • 4. What are the two most common supervisory tasks?
  • 5. Can you name four common unsupervised tasks?
  • 6. What type of algorithm would you use to make the robot walk in various unknown terrains?
  • 7. What type of algorithm will you use to classify customers into groups?
  • 8. Would you define the spam detection problem as a supervised learning problem or an unsupervised learning problem?
  • 9. What is an online learning system?
  • 10. What is extra-nuclear learning?
  • 11. What types of algorithms rely on similarity measures to make predictions?
  • 12. What is the difference between model parameters and model hyperparameters?
  • 13. What do model-based algorithms search for? What are their most common strategies for success? How do they make predictions?
  • 14. Can you name the four main challenges of machine learning?
  • 15. What happens if your model performs well on the training data but generalizes poorly to new instances? Can you name three possible solutions?
  • 16. What is a test set and why is it used?
  • 17. What is the purpose of the validation set?
  • 18. What is the train-dev set, when is it needed, and how to use it?
  • 19. What will happen if you use the test set to tune hyperparameters?

Guess you like

Origin blog.csdn.net/coco2d_x2014/article/details/132511371