Coursera Wu Enda Structuring Machine Learning Projects Notes

I found that I couldn't find the notes on the official website, so I will briefly summarize it.
The effect of watching it is not good, I was super sleepy when I watched it, and I couldn’t concentrate. I talked about a lot of methods, but some of them are temporarily unusable. If you have any problems, you can watch the video again. There may be omissions in the writing here.

Week 1

1.2 Orthogonalization

Orthogonalization. That is to say, for example, to adjust the screen of an old-fashioned TV, move a knob horizontally, move a knob vertically, and rotate an angle knob. If one knob operates two variables at the same time, it will be difficult to adjust the screen. This is the idea of ​​orthogonalization. One knob adjusts only one variable.
insert image description here
Then it's the same for ML, for each xx set has its own set of knobs, and then adjust it continuously: it's
insert image description here
good to understand it roughly here, and I will talk about it in detail below.

1.3 Single number evaluation metric

A single quantitative evaluation indicator.
For example, to judge whether it is a cat, Precision means that for a classifier, given a picture, the probability of correctly judging that it is a cat is 95%, and then Recall means that given a picture of a cat, the probability of correctly judging that it is a cat is 90%. Obviously we hope that both Precision and Recall have high probabilities, so we use F1 to measure (the better method mentioned in the previous lesson, so that one of them is not low), and then we can see that the A classifier is better.
insert image description here
For the country that likes cats, there are the percentages of the first 4 prediction errors, we can average them, and use the algorithm with the least average error:
insert image description here

1.4 Satisficing and Optimizing metric

If there are multiple indicators, such as accuracy and running time, then we can choose one of the indicators as the optimizing metric (Optimizing metric), and the rest as the satisfying metric (Satisficing metric): For multiple

indicators , the solution we adopt is to select the best optimization index under the condition of ensuring that the index is met. For example, the running time is required to be within 100ms, so classifier C is excluded, and B has the highest accuracy rate among A and B, so we finally chose classifier B. If there are N indicators, then use 1 optimizing metric and N-1 satisfying metric.

1.5 Train/dev/test distributions

How to build the train/dev/test set. (Note: dev is the development set or the previously learned cv set) Very bad, because the algorithm only fits the first 4 countries (regions), but not the last 4, and the latter 4 are not necessarily related to the first 4. The correct way should be to shuffle 8 countries (regions), and then put them into dev and test sets, so that both sets have data from 8 countries (regions).
insert image description here
Choosing the wrong target is like aiming at the target on the left at the beginning, but in actual combat, the score is actually calculated on the target on the right:
insert image description here
in summary, dev set and test set must obey the same distribution:
insert image description here

1.6 Size of the dev and test sets

How to divide 3 sets: The old method is 622, but because a lot of data can be obtained in modern deep learning, and the test set only needs to be convincing enough, 622 is not needed, 98/1/1 can be used like this Divide, such as 1000000, there are 10000 test sets or development sets, which is enough. In addition, 2 sets are fine, but three sets are more recommended.
insert image description here

1.7 When to change dev/test sets and metrics

When to modify indicators.
For example, in the algorithm for pushing pictures of cats and cats to users, it is necessary to identify whether it is a cat. In this regard, A has a lower error rate than B, but A will push color/qing/pictures, while B will not, so B is better than A. Therefore, we will adjust the formula at this time, such as setting the error weight (in the case of yellow pictures predicting cats) to 10 or 100, so that we can pick out classifiers that do not often push yellow pictures and can often successfully identify cat pictures . (How to modify the formula here will not be elaborated for the time being)
insert image description here
For this, defining an index to sort the classifier is an example of orthogonalization (orthogonalization). It has the following two steps, one is to define the formula for calculating the error/ indicator, and the other is to separately consider how to improve the performance of the system on this indicator.
insert image description here

1.8 Why human-level performance?

Bayes optimal error (Bayes optimal error): The theoretically possible optimal error , excerpted by someone’s blog, the wiki can’t be opened, I don’t know why, and there is another nugget
insert image description here
insert image description here
on the accuracy of AI learning After surpassing humans, the speed of ascent slows down, and finally fails to reach the Bayesian optimal error. One reason is that the accuracy rate of humans is relatively close to that of Bayesian. Between the two, there is not much room for improvement. The other is that when the accuracy rate is lower than that of humans, some "tools" can be used to improve it. Once it exceeds Human beings are not so happy to improve.
insert image description here
Said "tools":
insert image description here
insert image description here

1.9 Avoidable bias

Avoidable bias (avoidable bias)
After getting some results on the cat classifier, we have to pay attention to the human error rate. Sometimes the accuracy rate is not as high as possible, and sometimes we don't want it to surpass humans, which is too much. For example, the picture is very blurry, and humans can't tell whether there is a cat in it. At this time, it is meaningless for the machine to be unable to tell the difference (my own understanding). And human error is close to Bayesian error, and Bayesian error cannot be surpassed unless overfitting, so we wouldn't want to exceed Bayesian/human error. The following will talk about the method of focusing on (focus on). Note that the focus on bias in ppt means to reduce the bias, and the variance is the same.
insert image description here
We call the gap between training error and humans error avoidable bias, and the gap between training error and dev error is called variance. Because the error rate of humans is sometimes closer to Bayesian, it is possible to replace Bayesian with humans.

In the left column, the avoidable bias is larger than the variance, so it proves that the potential for improvement is greater, so we know that we should focus on reducing the bias and get more returns, and the right column should focus on the variance.
insert image description here

1.10 Understanding human-level performance

Selecting human (level) error is important when precision is high. In the judgment of radiographic images, there are different error rates for different people. How to choose the error rate of human beings will have an impact on the research, especially when your accuracy rate is relatively high. Then we know that Bayesian should be lower than the lowest error rate of humans. Here, at least less than 0.5% in the
insert image description here
left and middle columns, it doesn’t matter if you choose the error rate of ordinary people or expert teams, because no matter how much, we all Know whether to focus on bias or variance. But in the column on the right, because training error and dev error are very close to human error, choosing different human error will determine whether you focus on bias or variance, and it will be difficult to choose at this time. So it is explained: when the accuracy is relatively high, why the improvement speed will stagnate. In fact, the most important difference
insert image description here
insert image description here
between this section and the previous one is that the previous bias is compared with 0%, that is to say, the human error is regarded as 0%, and then see whether the bias is large or the variance is large, and then think where the focus on is. This idea does not have much impact on things that humans do well, but for some difficult-to-distinguish audio, etc., if the error rate of humans is slightly higher than 0%, then a better Bayesian error should be considered at this time ( Human error is close to Bayesian, which is to use human error instead of Bayesian error (bar) to help you better estimate avoidable bias and variance to make better optimization strategies. (Watch video 8 again if you are not sure yet)
insert image description here

1.11 Surpassing human-level performance

In scenario A, the training error is lower than that of human, and then it can be concluded that the research direction is variance as before. But in B, because it has surpassed human beings and the data is insufficient, the traditional judgment method will not work, and the speed of optimization will be reduced. Then in the field of unnatural perception, machines often outperform humans.
insert image description here

1.12 Improving your model performance

You have learned about Orthogonalization, how to set up dev and test sets
, human-level representations of Bayesian error, and how to estimate avoidable bias and variance. Let's combine these into a set of guidelines for improving your learning algorithms.

Two basic principles of supervised learning, one is that the train set is done well (avoidable bias is low), and the other is that after the train set is done well, it can be extended to the dev set (development set) and test set (Variance low)
insert image description here
and finally method induction.
insert image description here
(more detailed notes)
insert image description here

test

insert image description here
insert image description here
Although adding it will affect the distribution, it still needs to be added.
insert image description here

insert image description here
Because the classifier selected from the dev set is used as the test set, "more data" is added to the test set instead of the training set.
insert image description here
Adjusting the evaluation indicators is the first thing that needs to be done.
insert image description here
It is also possible to buy a better machine.
insert image description here

Week 2

2.1 Carrying out error analysis

Error Analysis.
Suppose you are classifying cat and cat pictures and find that the current effect is not good, but there are a few places that can be optimized, some of which recognize dogs as cats, and some recognize big cats as cats, etc. One way is to manually observe 100 pictures Misclassified pictures, then record the mistakes and calculate the percentage, and prioritize those with the most mistakes. For example, the figure below gives priority to solving the second and third causes of error.

This is very important. If you optimize blindly, such as optimizing the first error cause, then even if this problem is perfectly solved, you can only reduce the error rate by 8%.
insert image description here

2.2 Cleaning up incorrectly labeled data

In supervised learning problems, if you find that some output labels Y are mislabeled, is it worth your time to fix these labels?
If it's just accidental, then it's free, if there's a lot of data impact, then there's something wrong. It is the same whether it is in train, dev or test set.

For example, in the blue example below, if it is on the left, it is better to solve other reasons, and if it is on the right, because it accounts for more, it is convenient to operate. . So solve it first.
insert image description here
Then, if you want to modify the dev/test set, there are the following points to pay attention to.
The first thing is to ensure that the dev and test set come from the same distribution, but it is also possible that the distribution of training and dev/test is slightly different . The fifth question of the last homework, you can Add new pictures with slightly different distributions to the training set, but not to the test set.
Then, when testing the examples of wrong judgments, we should also test the examples of correct judgments. Sometimes the algorithm may be lucky, but it is usually not done because the data is large.
Then because the training set is very large, you shouldn't spend time on it, so you don't have to deal with it. And because even slightly different distributions between the training set and the dev/test set are fine, DL is robust in this regard.
insert image description here

2.3 Build your first system quickly, then iterate

If you are new to it, quickly build a simple neural network, then iterate, and try not to build a very complex neural network on the first try. After you have the results, you can do bias/variance and the error analysis in the previous two sections to find the direction of priority processing and slowly optimize.
insert image description here

2.4 Training and testing on different distributions

Methods for handling different divisions of training and test sets.
Suppose you want to identify cat pictures from the app, then there are 200,000 clear pictures crawled from the Internet, and 10,000 blurred pictures from the app. The distribution of these two pictures is somewhat different, so how to allocate them.

Method 1 is not recommended, which is to randomly assign 210,000 images to 3 data sets, but in this case, most of the pictures in your dev/test set are webpages, which are actually inconsistent with your goal (the cat picture of the app) .
Therefore, method 2 is better, which is to divide 10,000 sheets into half, put 5,000 sheets in the training set, 2,500 sheets in the dev set, and 2,500 sheets in the test set. You can also put 5000 cards in dev, 5000 cards in test, and keep one card in training.
insert image description here

2.5 Bias and Variance with mismatched data distributions

By estimating the bias and variance of the learning algorithm, you can help you determine the priority of the next step, but when your training set, development set, and test set come from different distributions, the analysis method of bias and variance will change accordingly.

Take the cat picture in the previous section as an example. If your training data and dev data are different, you need to add a training-dev set. This training-dev set has the same distribution as the training set, but has not been trained. The gap between the training-dev set and the dev set is called data mismatch (problem). The data mismatch (problem) reflects whether your algorithm is good at dealing with different distributions, but you care about all the data between different errors
insert image description here
. Reflected, uh,
the gap between the dev and test error shows the degree of overfitting to the dev set. If the gap is large, more dev data needs to be collected.
Finally, the system does not have a solution to the data mismatch problem, but some attempts (try) can be usedinsert image description here

2.6 Addressing data mismatch

After a data mismatch occurs, first of all, the reasons for the different distributions are artificial, and then the data of the training set is closer to the data of the dev/test set, or more data close to the dev/test set are collected.
insert image description here
For example, there is a problem of different data distribution in the recognition of human voice by the interior rearview mirror, and then it is found that the dev/test set has some noise, but the training set has clear audio, so the clear audio should be added with noise, if it is clear The audio has 10,000 hours, and the car noise has 1 hours, that is, copy 10,000 copies of 1 hour, and then mix it with clear audio to become a new training set. Now the data distribution is closer. However, there is a risk that the algorithm may overfit the noise of this 1 hour. Although the noise sounds similar to the human ear, for all the noise sets, the noise of this 1 hour is only a small part of it, so it may be easy to overfit Together, it would be better if there is a 1:1 mix of 10000hours of noise and clear audio. This method is called artificial data synthesis , which is very useful but also has disadvantages.
insert image description here

2.7 Transfer learning

Migration learning can change the output layer (or the output layer and its surrounding layers) of a trained neural network to recognize a similar thing, such as replacing the last few layers of the neural network that recognizes cats. Become a neural network that recognizes radioactive images. Note that here is from big data to identifying small data, and vice versa.

This application is used for your original neural network has a lot of data. For example, there are many cats and cats here, but there are only a few radioactive pictures. Although the two are in different directions, the neural network of cats and cats can be given in the early stage. The knowledge of graphic recognition will pave the way for you to identify radioactive pictures. It would be better to use the example of speech recognition. For example, it was originally to recognize human speech, but later it was changed to recognize a certain wake-up word. The unmodified network layer can learn the characteristics of human voice.

If you only have a small radiation dataset, you can retrain only the weights of the last layer while keeping all other parameters. If you have enough data, you can also retrain all the remaining layers of the neural network. Our experience is that if you have a small-scale data set, then retrain the neural network of the last layer and the output layer, or you can also train the neural network of the last layer or two. But if you have a lot of data, you can probably retrain all the parameters of this neural network.

In migration learning, if you retrain all the parameters of the neural network, the initialization phase of this training is sometimes called pre-training. The reason is that you are using image recognition data to pre-initialize (pre-training). -initialize) or the weights of the pre-trained neural network. And then if you update all the weights afterwards, then the training on the radiology scan data is sometimes what we call fine-tuning.
insert image description here
insert image description here

2.8 Multi-task learning

Transfer learning is single-to-single, and task A is transferred to task B.
Multi-task learning here is to let the same neural network do several things at the same time, such as classifying whether a picture has four things: pedestrians, cars, stop signs, and traffic lights. The difference from softmax is that a picture of softmax is a classification, such as a picture of a car, but multi-task learning means that one picture can be tagged with multiple tags, so it is also possible to train it with 4
insert image description here
networks, but some neural networks The early feature will be used when recognizing objects, and it can also be 4 in 1.

Then Y is not given 0 and 1 in some places, because the definition of loss here is that only 0 and 1 will be added to the calculation.
insert image description here
Multi-category learning is useful when the following 3 things are true

The second point is not a hard requirement, but I have seen a lot of successful multi-task learning, and the amount of data for each single task is very similar. You'll recall transfer learning, from task A to task B. If there are 1 million data in task A and 1000 data in task B, then the value trained from the million data can help enhance task B with a smaller data set. What about multi-task learning? In multi-task learning, you generally have more than two tasks. In the previous example, there were four. Let's say there are 100 tasks. You learn by multitasking, trying to recognize 100 categories at the same time. You have 1000 examples in each task If you only care about the accuracy of one task, let's look at the 100th task, call it A100 If you want to do this last task alone, you only have 1000 examples to train This one task, this 1 task is trained by other 99 tasks with 99 thousand samples is a big boost, they can give a lot of information to enhance, otherwise only relatively small in the 100th task 1000 data. Similarly, the remaining 99 tasks also have the same data and get other task information to help training.
The second point is not a hard requirement, but I am more inclined to do so. In order to enhance multi-task learning, if you focus on any one task, the other tasks have more data than the single task combined. One way is if we have many tasks in our example and the data in each task is quite similar but the key is that you already have 1000 examples in one task and in other tasks you have far more than 1000 data in total, other tasks can help you in The last task performed better.

The only time multi-task learning will degrade performance (compared to a single neural network) is if the network is not large enough.
insert image description here
In fact, multi-task learning is much less used than transfer learning. I see a lot of applications of transfer learning where you want to solve a problem with a small amount of data. You find a related problem with a lot of data to learn and move on to this new problem. So if you want to solve a machine learning problem but you have a relatively small dataset, transfer learning can help you.

What is end-to-end deep learning?

What is end-to-end deep learning?
For example, in speech and text recognition, before end-to-end, MFCC was used to extract the underlying features, and then ML was used to find the phoneme, and then the phoneme was extracted to obtain the text. Now the end-to-end is to extract the text directly in one step, (or do not need to divide it very complicated. ).
insert image description here
The face recognition on the access control is divided into two steps. One is to extract the face from the picture first, and the other is to compare the face captured by the camera with the known employee face. It is divided into two steps because there are a lot of data in both, but there is very little data from the photos taken to the faces of employees before, so it is divided into two sub-problems, if there is not much data in place in one step.
insert image description here
End-to-end and the traditional method of dividing the problem have their own advantages. If there is a lot of data, end-to-end can be used (note that it does not have to be done in one step, but it is much simpler than the traditional method of segmentation). If it is less, the traditional method is still very accurate.

2.10 Whether to use end-to-end deep learning

End-to-end pros and cons and guidelines for use

The second disadvantage is that it excludes potentially useful hand-designed components. If well-designed, hand-designed components can be very helpful. If they limit the performance of your algorithm, they can also become harmful. For example, when You force your algorithm to think in terms of phonemes but it already finds a better representation on its own so it's a double-edged sword, it can be harmful and it can be beneficial, but it does tend to be beneficial when When you train on a small training set, hand-designed ingredients are more likely to help you

insert image description here
Maybe we will think about it in one step (end-to-end), but sometimes because of insufficient data or the neural network is not powerful enough to learn very complex functions (function), we still have to separate the steps.
insert image description here

test

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
DL is robust in this regard even if the training set and dev/test set distributions are slightly different.
insert image description here

Guess you like

Origin blog.csdn.net/Only_Wolfy/article/details/91427450