1.5 Training / Development / Testing Partition-Deep Learning Lesson 3 "Structured Machine Learning Project" -Professor Stanford Wu Enda

Train / Dev / Test Distribution

The way you set up training sets, development sets, and test sets greatly affects the speed at which you or your team can make progress in building machine learning applications. The same team, even a team in a large company, will really slow down the team ’s progress rather than speed up the way they set up these data sets. Let ’s see how these data sets should be set up to maximize the efficiency of your team .

Insert picture description here

In this video, I want to focus on how to set up a development set and a test set. The dev set is also called a development set , sometimes called a hold out cross validation set . Then, the workflow in machine learning is that you try many ideas, train different models with the training set, and then use the development set to evaluate different ideas, then choose one, and then iterate to improve the performance of the development set until the end. You can get a cost that is satisfactory to you, and then you can use the test set to evaluate.

Now, for example, you want to develop a cat classifier, and then you operate in these regions, the United States, the United Kingdom, other European countries, South America, India, China, other Asian countries and Australia, then how should you set up a development set What about the test set?

Insert picture description here

One way is that you can choose 4 of them. I plan to use these four (the first four), but it can also be a randomly selected area, and then say that the data from these four areas constitute the development set. Then for the other four areas, I plan to use these four (the last four), or I can choose 4 at random. These data constitute the test set.

It turns out that this idea is very bad, because in this example, your development set and test set come from different distributions. I suggest you not to do this, but let your development set and test set come from the same distribution. I mean this, you have to remember, I think it ’s like setting up your development set plus a single real evaluation indicator, it ’s like setting a goal, and then telling your team, that ’s what you ’re aiming at, because Once you have established such a development set and indicators, the team can iterate quickly, try different ideas, and run experiments. You can quickly use the development set and indicators to evaluate different classifiers, and then try to select the best one. Therefore, the machine learning team is generally very good at using different methods to approach the target, and then iterate continuously to continue to approach the bullseye. Therefore, it is optimized for the indicators on the development set.

Insert picture description here

Then in the example on the left, there is a problem when setting up the development set and the test set. Your team may spend several months iteratively optimizing on the development set. It turns out that when you finally test the system on the test set, The data from these four countries or the following four regions (that is, the test set data) and the data in the development set may be very different, so you may reap "surprise surprises" and find that it took so many months Time spent optimizing for the development set, but it did not perform well on the test set. So, if your development and test sets come from different distributions, it ’s like you set a goal and let your team spend a few months trying to reach the bullseye. As a result, after a few months of work, you find that you say "wait" , During the test, "I want to move the target here", and the team may say "Well, why do you let us spend so many months to approach the bullseye, then suddenly you can move the bullseye to a different s position?".

So, to avoid this situation, what I suggest is that you randomly shuffle all the data into the development set and test set, so the development set and test set have data from eight regions, and the development set and test set are both From the same distribution, this distribution is all your data mixed together.

Here is another example. This is a true story, but some details have changed. So I know there is a machine learning team that spent months optimizing on the development set, which contains loan approval data for middle-income postal codes. Then the specific machine learning problem is that the input x x is the loan application, can you predict the output and and and and is they have the ability to repay the loan? So this system can help banks judge whether to approve loans. So the development set comes from loan applications. These loan applications come from middle-income postal codes. Thezip codeis the US postal code. But after several months of training on this, the team suddenly decided to test it on low-income postal code data. Of course, the middle-income and low-income postal code data in this distribution data are very different, and they spend a lot of time optimizing the classifier for the previous set of data, resulting in a poor system performance in the latter set of data. So this particular team actually wasted 3 months and had to return to do a lot of work again.

Course PPT

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

Published 241 original articles · Like9 · Visitors 10,000+

Guess you like

Origin blog.csdn.net/weixin_36815313/article/details/105491530