Machine Learning Strategy 2 - Optimizing Deep Learning System

Carrying out error analysis

If you want to make a learning algorithm capable of tasks that humans can do, but your learning algorithm has not yet reached human performance, then manually inspecting the mistakes your algorithm makes may give you an idea of ​​what to do next. This process is called error analysis.

Suppose you are debugging a cat classifier, and you achieve 90% accuracy, which is equivalent to 10% error. Doing this on your dev set is still far from your desired goal. Your team members look at examples of algorithm classification errors and notice that the algorithm classifies some dogs as cats. Your teammate gives you a suggestion on how to optimize the algorithm for dog pictures. In order to make your cat classifier do better on dog pictures and prevent the algorithm from classifying dogs into cats, you can collect more dog pictures for dogs and design some algorithm functions that only deal with dogs. But the question is, should you start a project specifically with dogs? Is this project worth the months it would take to get the algorithm to make fewer errors on dog pictures? Or instead of spending several months working on the project, you might end up discovering that it doesn’t work at all. Here is an error analysis process that can let you quickly know whether this direction is worth the effort.

First, collect, say, 100 mislabeled dev set samples, and then manually check, one at a time, to see how many of the mislabeled samples in your dev set are dogs. Now, suppose that in fact, only 5% of your 100 mislabeled samples are dogs, which means that out of 100 mislabeled dev set samples, 5 are dogs. That means 100 samples, out of a typical 100 error samples, even if you completely solved the dog problem, you would only be able to fix 5 of those 100 errors. Or in other words, if only 5% of your errors are for dog pictures, then if you spend a lot of time on dog questions, your error rate will only drop from 10% to 9.5%. You can then decide that this is a bad time spent, or maybe it should be spent, but at least this analysis gives an upper limit. If you continue to deal with the dog problem, you can improve the upper limit of the algorithm's performance. In machine learning, sometimes we call it a performance ceiling, which means, where is the best you can go and how much can completely solving the dog problem help you?

But now, let's say another thing happens, let's say we look at these 100 mislabeled development set samples, and you find that actually 50 of the pictures are of dogs, so 50% of them are pictures of dogs. Now take the time to It may work just fine to solve your dog's problem. In this case, if you actually solved the dog problem, your error rate might drop from 10% to 5%. Then you might decide that it's worth pursuing the direction of halving the error rate, and focus on reducing the problem of mislabeled dog pictures.

In machine learning, sometimes we despise manual work, or use too many artificial values. But if you want to build an application system, this simple manual statistical step and error analysis can save a lot of time and quickly determine what is the most important or the most promising direction.

When doing error analysis, it is also possible to evaluate several ideas in parallel at the same time. To conduct error analysis, you should find a set of error samples, maybe in your development set or test set, observe the incorrectly labeled samples, and look at false positives and false negatives. The statistics belong to different errors. Number of errors of type. In the process, you may be inspired to identify new error types. If you go through the error examples and discover that some other factor is interfering with the classifier, you can create a new error type along the way. In short, by counting the percentages of different error mark types in the total number, it can help you find out which problems need to be solved first, or give you inspiration for new optimization directions.

Cleaning Up Incorrectly Labeled Data

The data for your supervised learning problem consists of input x and output label y. If you look at your data and find that some of the output labels are wrong. Some of the labels in your data are wrong. Is it worth spending time to fix these labels?

First, let's consider the training set . It turns out that deep learning algorithms are quite robust to random errors in the training set. As long as your labeled samples are not too far from random errors, sometimes the person doing the labeling may not have paid attention or accidentally pressed the wrong key. If the errors are random enough, then it may be okay to leave these errors alone instead of spending the time. Too much time to fix them.

Of course it doesn't hurt to go through the training set, check the labels, and fix them. Sometimes it is valuable to correct these errors, and sometimes it is okay to leave them alone. As long as the total data set is large enough, the actual error rate may not be too high.

Deep learning algorithms are robust to random errors, but not so robust to systematic errors. For example, if the tagger keeps tagging white dogs as cats, that's a problem. Because after your classifier learns, it will classify all white dogs as cats. But random errors, or near-random errors, are not a problem for most deep learning algorithms.

What if there are samples in the development set and test set with these label errors? If you are concerned about the impact of incorrectly labeled samples on the development set or test set, it is generally recommended that you add an extra column during error analysis so that you can also count the number of incorrectly labeled samples. For example, if you count the impact of 100 incorrectly labeled samples, among these 100 samples, the output of your classifier is inconsistent with the label of the development set. Sometimes for a few of the samples, the output of your classifier is different from the label. , it is because the labels are wrong, not because your classifier is wrong. In this case, you can label in additional columns and then count the percentage of label errors.

Assuming that the final statistical result is that the proportion of label errors in the development set is 6%, is it worth correcting these incorrectly labeled samples? If these labeling errors severely impact your ability to evaluate your algorithm on the development set, then you should spend time fixing the erroneous labels. However, if they don't significantly impact your ability to estimate cost deviations using the development set, then they probably shouldn't be spent valuable time dealing with.

Give a concrete example to determine whether it is worth manually correcting data that has been flagged incorrectly. First look at the overall development set error rate. Assume that the system reaches 90% overall accuracy, so there is a 10% error rate, and then count the error percentage caused by incorrect labeling. Assume that 6% of errors come from labeling errors, so 6% of 10% is 0.6%. At this time, maybe you should look at errors caused by other reasons. If there are 10% errors on your development set, 0.6% of them are due to labeling errors. , the remaining 9.4% are caused by other reasons, such as mistaking dogs for cats. So in this case, I said that there is a 9.4% error rate that needs to be concentrated on correction, and the errors caused by marking errors are only a small part of the overall errors, so if you must do this, you can also manually correct various errors label, but maybe that’s not the most important task at the moment.

Training and Testing on Different Distributions using data from different distributions

The algorithm works best when you collect enough labeled data to form a training set. This has led many teams to use every means to collect data and then pile them into the training set to make the training data larger, even if some The data, even most of the data, comes from a different distribution than the development set and test set.
Insert image description here
Let's say you're developing a mobile app where users upload photos they take with their phones, and you want to identify whether the images users upload from the app are of cats. Now you have two sources of data. One is the data distribution you really care about, which is data uploaded from applications, such as the application on the right. These photos are generally more amateurish, the framing is not very good, and some are even blurry. Another source of data is that you can use a crawler program to mine web pages and download them directly. In this sample, you can download many cat pictures with professional framing, high resolution, and professional photography. If your app doesn’t have many users, you can only collect 10,000 photos uploaded by users, but by mining web pages through crawlers, you can download massive cat pictures, and more than 200,000 cat pictures have been downloaded. What you really care about about algorithm performance is how well your final system handles this distribution of images from your application, because eventually your users will upload images like the ones on the right, and your classifier must perform well at this task. Now you're stuck because you have a relatively small data set with only 10,000 samples from that distribution, and you also have a much larger data set from another distribution, and what the picture looks like and what you really want to process are not the same. But you don’t want to use these 10,000 images directly, because your training set will be too small. Using these 200,000 images seems to help. However, the dilemma is that these 200,000 images are not exactly from the distribution you want, so what can you do?

One option here, one thing you can do is merge the two sets of data together so that you have 210,000 photos, and you can randomly distribute these 210,000 photos into training, development and test sets. Now there are some advantages and disadvantages to setting up your data set this way. The advantage is that your training set, development set, and test set all come from the same distribution, which is easier to manage. But the disadvantage is that this disadvantage is not small, that is, if you look at the development set, you will find that many of the images are from images downloaded from the web. That is not the data distribution you really care about. What you really want to deal with are images from mobile phones. Remember, the purpose of setting up a development set is to tell your team what to aim for, and the way you aim for it, most of your energy is spent on optimizing images downloaded from the web, which is actually not what you want. So this method is not advisable.

It is recommended to use the following method. For example, there are still 205,000 pictures. The training set contains 200,000 pictures from web pages and 5,000 pictures from mobile applications. The development set is 2,500 pictures from applications, and the test set is also 2,500 pictures from applications. The advantage of dividing the data into training set, development set and test set is that now the target you are aiming at is the target you want to process. You tell your team that all the data contained in my development set comes from mobile phone uploads. This is what you Really care about image distribution. Let's try to build a learning system so that the system can perform well in processing the distribution of images uploaded by mobile phones. The disadvantage is that now your training set distribution is different from your development set and test set distribution. But it turns out that splitting your data into training, development, and test sets like this gives you better system performance in the long run.

Analysis of bias and variance when data does not match (Bias and Variance with Mismatched Data Distribution)

Estimating the bias and variance of a learning algorithm can really help you determine the direction you should prioritize next. However, when your training set comes from a different distribution than the development set and test set, the way to analyze the bias and variance may be different.

If your development set comes from the same distribution as your training set, and your training set error is 1% and your development set error is 10%, you might say that there is a big variance problem here and your algorithm cannot be very good. It generalizes from the training set. It handles the training set very well, but suddenly the development set is very poor.

But if your training data and development data come from different distributions, you can no longer safely draw this conclusion. Maybe the algorithm did well on the training set, maybe because the training set was easy to recognize because the training set was all high-resolution pictures, very clear images, but the development set was much more difficult to recognize. So maybe the software doesn't have a variance problem, it's simply a reflection that the dev set contains images that are more difficult to classify accurately. So the problem with this analysis is that first of all, the algorithm has only seen the training set data, not the development set data. Second, the development set data comes from different distributions.

In order to distinguish the influence of the two factors, we define a new set of data, called the training-development set, so this is a new data subset. We should mine it from the training set distribution, but it will not be used to train your network.

We have set up a training set, a development set and a test set like this: the development set and the test set come from the same distribution, but the training set comes from a different distribution. What we have to do is to randomly break up the training set, and then separate a part of the training set as the training-development set (training-dev). Just like the development set and the test set come from the same distribution, the training set and the training-development set also come from the same distribution. .

But the difference is that now you only train your neural network on the training set, you will not let the neural network run backpropagation on the training-development set. In order to perform error analysis, what you should do is look at the error of the classifier on the training set, the error on the train-dev set, and the error on the dev set.

For example, in a certain sample, the training error is 1%, the error on the training-development set is 9%, and then the development set error is 10%. You can conclude from this that when you change from training data to training-development set data, the error rate really goes up a lot. The difference between training data and training-development data is that your neural network can see the first part of the data and train directly on it, but it does not train directly on the training-development set. This tells you that the algorithm has a variance problem. , because the error rate of the training-development set is measured on data from the same distribution as the training set. So even though your neural network performs well on the training set, it won't generalize well to training-dev sets from the same distribution. It won't generalize well to training-dev sets from the same distribution.

Let's look at a different sample, let's say the training error is 1% and the train-dev error is 1.5%, but when you start processing the dev set, the error rate goes up to 10%. Now your variance problem is very small, because when you go from the seen training data to the training-development set data, the data that the neural network has not seen yet, the error rate only goes up a little bit. But when you move to the development set, the error rate goes up significantly, so it's a data mismatch issue. Because your learning algorithm has not been trained directly on the training-development set or the development set, but the two data sets come from different distributions. But no matter what the algorithm is learning, it does well on the training-development set, but not well on the development set. So in short, your algorithm is good at dealing with a different distribution than the data you care about. We call it data dissociation. Matching problem.

Addressing Data Mismatch

If a serious data mismatch is found, error analysis is usually performed first to try to understand the specific differences between the training set and the development test set. Technically, in order to avoid overfitting on the test set, do error analysis and then see if there is a way to collect more data that looks like the development set for training.

One approach is artificial data synthesis, and artificial data synthesis does work. But when you use artificial data synthesis, you must be careful and remember that you may be simulating only a small part of the total possibility space.

Guess you like

Origin blog.csdn.net/Luo_LA/article/details/132595822