Week 6 | Stanford CS229 Machine Learning

foreword

This article is the study notes of Stanford University CS229 machine learning course

The main part of this article is reprinted from Dr. Huang Haiguang. The link is given at the end of the article . If you are interested, you can directly visit the homepage of the notes to download the corresponding course materials and homework codes.

Course official website: CS229: Machine Learning (stanford.edu)

Course Video: Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018) - YouTube

Summary of Notes: Summary of Notes | Stanford CS229 Machine Learning_ReturnTmp's Blog-CSDN Blog

week 6

10. Advice for Applying Machine Learning

10.1 Deciding what to do next

Reference video: 10 - 1 - Deciding What to Try Next (6 min).mkv

We’ve covered many different learning algorithms so far, and if you’ve been following along through these videos, you’ll find yourself becoming an expert at many advanced machine learning techniques before you know it.

However, there is still a big gap among people who understand machine learning, and some people do master how to use these learning algorithms efficiently and effectively. And others may not be so familiar with what I'm about to talk about. They may not fully understand how to apply these algorithms. So time is always wasted on pointless attempts. What I want to do is to ensure that when you design a machine learning system, you can understand how to choose the most appropriate and correct path. So, in this video and the next few, I'll give you some practical advice and guidance on how to make your choices. Specifically, the question I will focus on is if you are developing a machine learning system, or trying to improve the performance of a machine learning system, how should you decide which path to choose next? In order to explain this problem, I want to still use the learning example of predicting housing prices, if you have completed regularized linear regression, that is, minimizing the cost function JJThe value of J , if, after you get your learning parameters, if you want to test your hypothesis function on a new set of housing samples, if you find that there is a huge error in predicting house prices, now Your question is to improve this algorithm, what should you do next?

In fact, you can think of many ways to improve the performance of this algorithm, one of which is to use more training samples. Specifically, maybe you can think of getting more and different house sales data through telephone surveys or door-to-door surveys. Unfortunately, I have seen many people spend a lot of time trying to collect more training samples. They always think that if I have twice or even ten times the amount of training data, it will definitely solve the problem, right? But sometimes getting more training data doesn't actually help. In the next few videos, we'll explain why.

We'll also know how to avoid wasting too much time collecting more training data, which doesn't really help. Another approach, which you might think of, is to try to use a smaller feature set. So if you have a set of features like x 1 , x 2 , x 3 x_1,x_2,x_3x1,x2,x3etc. Maybe there are a lot of features, maybe you can spend a little time carefully picking a small subset of these features to prevent overfitting. Or maybe you need to use more features, maybe the current feature set is not very helpful for you. You want to collect more data from the perspective of obtaining more characteristics. Similarly, you can expand this problem into a large project, such as using telephone surveys to get more housing cases, or doing land surveys to To get more information about, the land and so on, so it's a complicated question. By the same token, we would very much like to know how well it works before spending a lot of time doing it. We can also try to add polynomial features, such as x 1 x_1x1square of x 2 x_2x2square of x 1 , x 2 x_1,x_2x1,x2The product of , we can spend a lot of time to consider this method, we can also consider other methods to reduce or increase the regularization parameter λ \lambdaThe value of lambda . The list we've made, and many of the approaches above, can be extended to a six-month or longer project. Unfortunately, the criteria that most people use to choose these methods is based on feeling, that is, most people choose one of these methods at random, for example, they will say "Oh, let's find more Let’s get some data,” and then spend six months collecting a bunch of data, and then maybe another person says, “Okay, let’s get some more features out of these houses.” I have sadly seen more than once many people spend at least six months on a method of their random choice, only to find out after six months or more that they have chosen a path of no return. road. Luckily, there are a bunch of simple ways to do more with less, weeding out at least half of the methods on your list, leaving the ones that are really promising, and there is also a very simple method that, if you use it, can easily Eliminate many options without any hassle, saving you a lot of unnecessary time. Finally achieve the purpose of improving the performance of the machine learning system Suppose we need to use a linear regression model to predict housing prices. When we use the trained model to predict unknown data, we find that there is a large error. What can we do next?

  1. Obtaining more training samples - usually effective, but expensive , the following methods may also be effective, you can consider using the following methods first.

  2. try to reduce the number of features

  3. try to get more features

  4. Try adding polynomial features

  5. Try to reduce the degree of regularization λ \lambdal

  6. Try to increase the degree of regularization λ \lambdal

Instead of randomly picking one of the above methods to improve our algorithm, we should use some machine learning diagnostics to help us know which of the above methods are effective for our algorithm.

In the next two videos, I first describe how to evaluate the performance of machine learning algorithms , and then in the next few videos, I will start to discuss these methods, which are also called "machine learning diagnostics". What I mean by "diagnostics" is this: it's a test that you perform to gain insight into whether an algorithm is working or not. This can usually also tell you what kind of attempt is meaningful to improve the effect of an algorithm. In this series of videos we will introduce specific diagnostic methods, but I want to make one point in advance that the implementation and implementation of these diagnostic methods will take some time, and sometimes it will take a lot of time to understand and implement , but this is indeed a good use of time, because these methods can save you months of time when developing learning algorithms, so in the next few lessons, I will first introduce how to evaluate your learning algorithm. After that, I'll cover some diagnostics that hopefully will make things clearer for you. In the next attempt, how to choose a more meaningful method.

10.2 Evaluating a hypothesis

Reference video: 10 - 2 - Evaluating a Hypothesis (8 min).mkv

In this section of the video I want to explain how to use the algorithms you have learned to evaluate hypothesis functions. In later courses, we will use this as a basis to discuss how to avoid overfitting and underfitting problems.

When we determine the parameters of the learning algorithm, what we consider is to choose the parameters to minimize the training error. Some people think that it must be a good thing to get a very small training error, but we already know that, just because this assumption has a lot of A small training error does not mean that it must be a good hypothesis function. And we also learned examples of overfitting hypothetical functions, so this generalization to new training sets is not applicable.

So, how do you tell if a hypothesis function is overfitting? For this simple example, we can for the hypothetical function h ( x ) h(x)h ( x ) to draw a graph, and then observe the trend of the graph, but for the general case of more than one feature variable, as well as problems with many feature variables, it will become very difficult to observe by drawing a hypothetical function Difficult or even impossible to achieve.

Therefore, we need another way to evaluate our hypothesis function overfitting test.

In order to test whether the algorithm is overfitting, we divide the data into training set and test set, usually use 70% of the data as the training set , and use the remaining 30% of the data as the test set . It is very important that both the training set and the test set contain various types of data. Usually we have to "shuffle" the data and then divide it into a training set and a test set.

Test set evaluation After letting our model learn its parameters through the training set, and applying the model to the test set, we can calculate the error in two ways:

  1. For linear regression models, we use the test set data to calculate the cost function JJJ
  2. For the logistic regression model, in addition to using the test data set to calculate the cost function:

KaTeX parse error: Expected group after '_' at position 48: …{m}_{test}}\sum_̲\limits{i=1}^{m…

The rate of misclassification, for each test set sample, is computed:

The calculated results are then averaged.

10.3 Model selection and cross-validation sets

Reference video: 10 - 3 - Model Selection and Train_Validation_Test Sets (12 min).mkv

Suppose we want to choose between 10 binomial models of different degrees :

Obviously, a polynomial model with a higher degree is more suitable for our training data set, but adapting to the training data set does not mean that it can be extended to the general situation. We should choose a model that is more suitable for the general situation. We need to use a cross-validation set to help with model selection.

That is: use 60% of the data as the training set, use 20% of the data as the cross-validation set, and use 20% of the data as the test set


The method of model selection is:

  1. Use the training set to train 10 models

  2. Calculate the cross-validation error (the value of the cost function) for the cross-validation set with 10 models

  3. Choose the model with the smallest cost function value

  4. Use the model selected in step 3 to calculate the generalization error (value of the cost function) on the test set

    Train/validation/test error

    Training error:

KaTeX parse error: Expected group after '_' at position 37: …frac{1}{2m}\sum_̲\limits{i=1}^{m…

Cross Validation error:

KaTeX parse error: Expected group after '_' at position 39: …1}{2m_{cv}}\sum_̲\limits{i=1}^{m…

Test error:

KaTeX parse error: Expected group after '_' at position 41: …{2m_{test}}\sum_̲\limits{i=1}^{m…

10.4 Diagnostic Bias and Variance

Reference video: 10 - 4 - Diagnosing Bias vs. Variance (8 min).mkv

When you run a learning algorithm, if the performance of the algorithm is not ideal, then there are probably two things: either the bias is relatively large, or the variance is relatively large. In other words, what's happening is either an underfitting or an overfitting problem. So which of these two situations is related to bias, which is related to variance, or is it related to both? Knowing this is very important in order to be able to tell which of the two situations is present. In fact, it is a very effective indicator, guiding the most effective methods and ways to improve the algorithm. In this video, I want to dig a little deeper into the issues of bias and variance. I hope you can gain a deeper understanding of them, and also figure out how to evaluate a learning algorithm, and be able to judge whether an algorithm is biased or not. Variance is problematic because this problem is very important in figuring out how to improve the performance of learning algorithms, and problems with high bias and high variance are basically problems of underfitting and overfitting.

We usually help by plotting the error of the cost function versus the degree of the polynomial for the training and cross-validation sets on :

Bias/variance

Training error: KaTeX parse error: Expected group after '_' at position 37: …frac{1}{2m}\sum_̲\limits{i=1}^{m…

Cross Validation error: KaTeX parse error: Expected group after '_' at position 39: …1}{2m_{cv}}\sum_̲\limits{i=1}^{m…

For the training set, when ddWhen d is small, the degree of model fitting is lower and the error is larger; asddAs d increases, the degree of fitting increases and the error decreases.

For the cross-validation set, when ddWhen d is small, the degree of model fitting is low and the error is large; but asddWith the increase of d , the error shows a trend of decreasing first and then increasing. The turning point is when our model starts to overfit the training data set.

If our cross-validation set has a large error, how do we judge whether it is variance or bias? From the graph above, we know that:

When approximating training set error and cross-validation set error: bias/underfitting

When the cross-validation set error is much larger than the training set error: variance/overfitting

10.5 Regularization and bias/variance

Reference video: 10 - 5 - Regularization and Bias_Variance (11 min).mkv

In the process of training the model, we generally use some regularization methods to prevent overfitting. But we may have too high or too small a degree of regularization, that is, when we choose the value of λ, we also need to think about issues similar to the number of polynomial models we just selected.

We choose a series of λ \lambda that we want to testLambda value,usually a value between 0-10 showing a 2-fold relationship(eg:0 , 0.01 , 0.02 , 0.04 , 0.08 , 0.15 , 0.32 , 0.64 , 1.28 , 2.56 , 5.12 , 10 0,0.01,0.02,0.04 ,0.08,0.15,0.32,0.64,1.28,2.56,5.12,100,0.01,0.02,0.04,0.08,0.15,0.32,0.64,1.28,2.56,5.12,10 out of 12). We also split the data into training set, cross-validation set and test set.

Select λ \lambdaThe method of λ is:

  1. Using the training set to train 12 models with different degrees of regularization
  2. Calculate the cross-validation error for the cross-validation set with 12 models
  3. Choose the model that yields the smallest cross-validation error
  4. Using the model selected in step 3 to calculate the generalization error on the test set, we can also plot the cost function error of the training set and the cross-validation set model and the value of λ on a graph:

• 当λ \lambdaWhen λ is small, the training set error is small (overfitting) and the cross-validation set error is large

• With λ \lambdaWith the increase of λ , the error of the training set increases continuously (underfitting), while the error of the cross-validation set decreases first and then increases

10.6 Learning Curve

Reference video: 10 - 6 - Learning Curves (12 min).mkv

The learning curve is a good tool. I often use the learning curve to judge whether a certain learning algorithm is in the problem of bias and variance. A learning curve is a good sanity check for a learning algorithm . The learning curve is the training set error and the cross-validation set error as the number of training set samples ( mmm ) plotted as a function of the graph.

That is, if we have 100 rows of data, we start with 1 row of data and gradually learn more rows of data. The idea is: when training with fewer rows of data, the trained model will be able to fit the less training data perfectly, but the trained model will not fit the cross-validation set data or the test set data well.

How to use learning curves to identify high bias/underfitting: As an example, we try to fit a straight line to the following data, and it can be seen that no matter how large the error is in the training set, it will not change much :

That is, in the case of high bias/underfitting , adding data to the training set may not necessarily help .

How to use the learning curve to identify high variance/overfitting : Assuming we use a very high-order polynomial model, and the regularization is very small, it can be seen that when the cross-validation set error is much larger than the training set error , adding more to the training set Multiple data can improve the effect of the model.

That is to say, in the case of high variance/overfitting , adding more data to the training set may improve the performance of the algorithm .

10.7 Deciding what to do next

Reference video: 10 - 7 - Deciding What to Do Next Revisited (7 min).mkv

We have introduced how to evaluate a learning algorithm, we have discussed the problem of model selection, bias and variance. So how do these diagnostic rules help us determine which methods may help improve the performance of learning algorithms, and which may be futile?

Let's go back to the original example again and look for answers there, which is our previous example. Looking back at the six optional next steps proposed in 1.1, let's see how we should choose under what circumstances:

  1. Get more training samples - address high variance

  2. Try to reduce the number of features - to solve high variance

  3. Try to get more features - address high bias

  4. Try adding polynomial features - addressing high bias

  5. Try to reduce the degree of regularization λ - to solve high bias

  6. Try increasing the degree of regularization λ - to solve high variance

Variance and bias of a neural network:

Using a smaller neural network , similar to the case with fewer parameters , is prone to high bias and underfitting, but is less computationally expensive . Using a larger neural network , similar to the situation with more parameters , can easily lead to high variance and overfitting. Although the calculation cost is relatively high, it can be adjusted by regularization methods to better adapt to the data.

It is usually better to choose a larger neural network and use regularization than to use a smaller neural network.

For the selection of the number of layers of the hidden layer in the neural network , the number of layers is usually gradually increased from one layer . In order to make a better choice, the data can be divided into training set, cross-validation set and test set, for different hidden layers. The number of neural networks is used to train the neural network,
and then the neural network with the smallest cross-validation set cost is selected .

Ok, so that concludes our introduction to the problem of bias and variance, and a learning curve approach to diagnosing it. You can use all of the above to determine which avenues might be helpful when improving the performance of a learning algorithm. And which methods may be meaningless. If you understand the content introduced in the above few videos and know how to use it. Then you can already use machine learning methods to effectively solve practical problems. You can also be like most machine learning practitioners in Silicon Valley, who use these learning algorithms to solve many practical problems every day. I hope that some of the techniques mentioned in these sections, about variance, bias, and diagnostics represented by learning curves, can really help you apply machine learning more effectively and make them work efficiently.

11. Machine Learning System Design

11.1 What to do first

Reference video: 11 - 1 - Prioritizing What to Work On (10 min).mkv

In the next video, I will talk about the design of machine learning systems. These videos will touch on the main issues you will encounter when designing complex machine learning systems. At the same time we will try to give some advice on how to cleverly build a complex machine learning system. The following courses may not be so mathematical, but I think the things we will cover are very useful and may save a lot of time when building large-scale machine learning systems.

This week we discuss a spam classifier algorithm as an example.

In order to solve such a problem, the first decision we have to make is how to choose and express the feature vector xxx . We can choose alist of the 100 words that appear most often in spam, and get our feature vector (1 if it appears, 0 if it does not) based on whether these words appear in the email, with a size of 100×1.

In order to build this classifier algorithm, we can do many things, such as:

  1. Collect more data so we have more samples of spam and non-spam

  2. Develop a complex set of features based on mail routing information

  3. Develop a complex set of features based on email body information , including handling that takes into account truncation

  4. Develop sophisticated algorithms for detecting deliberate misspellings (write watch as w4tch )

Among the above options, it is very difficult to decide which one should spend time and energy on, and it is better to choose wisely than to go with the feeling. When we use machine learning, we can always "brainstorm" and come up with a bunch of methods to try. In fact, by the time you need to brainstorm different ways to try and improve accuracy, you've probably outdone a lot of people. Most people don't try to list possible ways, what they do is wake up one morning and for some reason have a whim: "Let's try to collect a lot of data with the Honey Pot project."

We'll talk about error analysis in a later class , and I'll show you how to choose the right one from a bunch of different methods in a more systematic way. As a result, you're more likely to choose a really good approach that will take you days, weeks, or even months to do in-depth research.

11.2 Error Analysis

Reference video: 11 - 2 - Error Analysis (13 min).mkv

In this course, we will talk about the concept of Error Analysis . This will help you make decisions more systematically. If you are going to study machine learning stuff, or build machine learning applications, the best practice is not to build a very complex system with how complex variables; but to build a simple algorithm that you can implement very quickly it.

Whenever I work on a machine learning problem, I spend at most one day, literally 24 hours, trying to get results quickly, even if they don't work. Frankly speaking, there is no complicated system at all, but it is just a quick result . Even if it doesn't run perfectly, run it again, and finally check the data through cross-validation . Once that's done, you can plot the learning curve, and by plotting the learning curve, and examining the error, you can find out whether your algorithm has high bias and high variance problems, or other problems . After such an analysis, it is useful to decide whether to train with more data or to add more feature variables. The reason for this is: this is a good approach when you are new to machine learning problems, and you don't know in advance whether you need complex feature variables, or whether you need more data, or something else. It is very difficult to know what you should do in advance, because you lack evidence and lack of learning curve. Therefore, it is difficult for you to know where you should spend your time to improve the performance of the algorithm. But when you practice a very simple if not perfect method, you can make further choices by drawing the learning curve . You can avoid a premature optimization problem in computer programming this way, the idea that we have to use evidence to guide our decisions, how to allocate our time to optimize algorithms, not just intuition, Intuitive things are generally always wrong . One thing that is very useful, besides plotting the learning curve, is error analysis, by which I mean: when we were building a spam classifier, I would take a look at my cross-validation dataset and see for myself See which emails were misclassified by the algorithm. Therefore, through these spam and non-spam misclassified by the algorithm, you can find some systematic rules : what types of emails are always misclassified. After doing this frequently, this process can inspire you to construct new feature variables, or tell you: the shortcomings of the current system, and then inspire you how to improve it.

The recommended way to build a learning algorithm is:

1. Test simple algorithms : start with a simple algorithm that can be implemented quickly, implement the algorithm and test the algorithm with cross-validation set data

2. Draw the learning curve and decide whether to add more data, or add more features, or other options

3. Perform error analysis : Manually check the samples in the cross-validation set that produce prediction errors in our algorithm to see if there is a systematic trend in these samples

Taking our spam filter as an example, what the error analysis does is to examine all the emails in the cross-validation set that our algorithm made wrong predictions , and see if we can group these emails into classes. Examples include pharmaceutical spam, counterfeit spam, or password-stealing emails. Then see which group of emails the classifier has the most prediction error on, and start optimizing.

Think about how you can improve your classifier. For example, to find out if certain features are missing, count the number of times these features occur.

For example, record how many times misspellings occur, how many times abnormal mail routing occurs, etc., and then start optimizing from the most frequent occurrences.

Error analysis does not always help us judge what action should be taken. Sometimes we need to try different models and then compare them. When comparing models, use numerical values ​​to judge which model is better and more effective. Usually we look at the error of the cross-validation set.

In our spam classifier example, for "Should we treat discount/discounts/discounted/discounting as the same word?" If doing so would improve our algorithm, we would employ some truncation software. Error analysis cannot help us make this kind of judgment. We can only try two different solutions with and without truncation software, and then judge which one is better based on the results of numerical tests .

Therefore, when you are constructing a learning algorithm, you will always try a lot of new ideas and implement many versions of the learning algorithm. If every time you practice a new idea, you have to manually detect these examples, go See if it's poor performance or good performance, then it's hard for you to make a decision. Whether to use stemming or not, and whether it is case-sensitive. But through a quantitative numerical evaluation, you can look at the number and see whether the error has become larger or smaller. You can use it to practice your new ideas faster, and it basically tells you very intuitively: whether your idea improves the performance of the algorithm or makes it worse, which will greatly improve the speed when you practice the algorithm. So I strongly recommend performing error analysis on the cross-validation set rather than the test set . However, there are still some people who will do error analysis on the test set. Even if this is mathematically inappropriate. So I still recommend that you do error analysis on the cross-validation vector.

To sum up, when you are working on a new machine learning problem, I always recommend that you implement a simpler and faster algorithm, if not perfect. I almost never see people do it. What people often do is: spend a lot of time on constructing algorithms, constructing what they think is a simple method. So don't worry about your algorithm being too simple, or too imperfect, but implement your algorithm as fast as possible . Once you have an initial implementation, it becomes a very powerful tool to help you decide what to do next. Because we can first look at the errors caused by the algorithm, and through error analysis, we can see what mistakes he made, and then decide the way to optimize. Another thing is: if you have a fast and imperfect implementation of an algorithm, and you have a numerical evaluation data, this will help you try new ideas, and quickly find out whether these ideas you try can improve the performance of the algorithm , so that you will make decisions faster, what to discard in the algorithm, and what to absorb. Error analysis can help us systematically choose what to do.

11.3 Error Metrics for Class Skewness

Reference video: 11 - 3 - Error Metrics for Skewed Classes (12 min).mkv

In previous lessons, I mentioned error analysis, and the importance of setting error metrics. That is, set some real number to evaluate your learning algorithm and measure its performance, with the evaluation of the algorithm and the error measure. One important thing to note is that using an appropriate error metric, which can sometimes have very subtle effects on your learning algorithm, is the problem of skewed classes. The class skew situation manifests itself in the fact that our training set has a very large number of samples of the same class, with few or no samples of other classes.

For example, we want to use an algorithm to predict whether a cancer is malignant or not. In our training set, only 0.5% of the instances are malignant tumors. Suppose we write a non-learned algorithm that predicts that the tumor is benign in all cases, with an error of only 0.5%. However, the neural network algorithm we have obtained through training has a 1% error. At this time, the size of the error cannot be regarded as the basis for judging the effect of the algorithm.

Precision ( Precision ) and Recall ( Recall ) We divide the results predicted by the algorithm into four situations:

  1. True Positive ( True Positive, TP ): The prediction is true and the actual is true

2. True Negative ( TN ): The prediction is false, but the actual is false

3. False Positive ( FP ): The prediction is true, but the actual is false

4. False Negative ( FN ): The prediction is false, but the actual is true

Then: precision rate = TP/(TP+FP) . For example, among all the patients we predicted to have malignancy, the higher the percentage of patients who actually had malignancy, the better.

Recall rate = TP/(TP+FN) . For example, among all patients who actually have malignancy, the higher the percentage of patients who are successfully predicted to have malignancy, the better.

In this way, for the algorithm we just always predicted that the patient's tumor was benign, the recall rate is 0.

Predictive value
Positive Negtive
actual value Positive TP FN
Negtive FP TN

11.4 The Trade-Off Between Precision and Recall

Reference video: 11 - 4 - Trading Off Precision and Recall (14 min).mkv

In previous lessons, we talked about precision and recall as evaluation metrics for problems with skewed classes. In many applications, we hope to ensure the relative balance between precision and recall.

In this lesson, I'll show you how to do it, and also show you some more effective ways of evaluating precision and recall as metrics for algorithms. Continuing with the example of predicting tumor properties just now. Suppose, our algorithm outputs a result between 0-1, we use a threshold of 0.5 to predict true and false.

Precision **(Precision)=TP/(TP+FP)**
cases, among all the patients we predicted to have malignant tumors, the percentage of patients who actually have malignant tumors, the higher the better.

Recall **(Recall)=TP/(TP+FN)**cases, among all patients who actually have malignant tumors, the percentage of patients who are successfully predicted to have malignant tumors, the higher the better.

If we want the prediction to be true only when we are very confident (the tumor is malignant), that is, we want a higher precision rate, we can use a threshold greater than 0.5, such as 0.7, 0.9. By doing this we will reduce the number of false positives that patients are diagnosed with and increase the number of failures to predict that the tumor is malignant.

If we want to increase the recall rate and allow all patients who may be malignant tumors to be further examined and diagnosed as much as possible, we can use a threshold smaller than 0.5, such as 0.3.

We can draw the relationship between the recall rate and the precision rate under different thresholds into a chart, and the shape of the curve varies according to the data:

We would like to have a method that helps us choose this threshold. One method is to calculate the F1 value ( F1 Score ), which is calculated as:

F 1 S c o r e : 2 P R P + R { {F}_{1}}Score:2\frac{PR}{P+R} F1Score:2P+RPR

We choose the threshold that results in the highest F1 value.

11.5 Data for Machine Learning

Reference video: 11 - 5 - Data For Machine Learning (11 min).mkv

In the previous video, we discussed evaluation metrics. In this video, I'm going to switch things up a bit and discuss another important aspect of machine learning system design, which often involves how much data is used for training. In some of the previous videos, I have warned people not to start blindly, but to spend a lot of time collecting a lot of data, because the data is sometimes the only thing that actually makes a difference. But it turns out that there are certain conditions, and I'll talk about what those conditions are in this video. Getting a lot of data and training it in some type of learning algorithm can be an effective way to get a learning algorithm with good performance. And this is often the case when these conditions are true for your problem.
And you can get a lot of data. This can be a great way to get very high performance learning algorithms. So in this video, let's talk about that together.

Many, many years ago, two researchers I know, Michele Banko and Eric Brill , did an interesting study where they tried machine learning algorithms to distinguish common confusing words , they tried a lot of different algorithms, and These different types of algorithms work well when the amount of data is found to be very large.

For example, in sentences like: I ate __ eggs for breakfast ( to , two , too ), in this example, "I ate 2 eggs for breakfast" is an example of a confusing word. So they treat machine learning problems like this as a type of supervised learning problem , and try to classify them, what kind of words are appropriate in a specific position in an English sentence. They used several different learning algorithms, all of which were recognized as leading when they conducted research in 2001. So they used one variance, one variance for logistic regression , called a " perceptron " . They also adopted some algorithms that were commonly used in the past but are less used now, such as the Winnow algorithm, which is very similar to the regression problem , but it is different in some aspects. It was used more in the past, but it is not used much now. There's also a memory-based learning algorithm , which is less used now, but I'll talk about that a bit later, and they use a naive algorithm . The details of these specific algorithms are not that important, and we want to explore below when we would like to obtain more data instead of modifying the algorithm. What they did was they changed the size of the training dataset and tried to use these learning algorithms with different sizes of the training dataset and this is what they got.

These trends are very obvious. First, most algorithms have similar performance. Second, as the training data set increases, the horizontal axis represents the training set size in millions, from 0.1 million to 1000 10,000, that is to say, the samples of the 1 billion-scale training set, the performance of these algorithms has also been correspondingly enhanced.

In fact, if you choose any algorithm, you may be choosing an "inferior" algorithm, and if you give the inferior algorithm more data, then from these examples, it looks very likely that other algorithms Better, even better than the "superior algorithm". Because this original study was so influential, a series of many different studies have shown similar results. These results show that many different learning algorithms sometimes tend to perform very similarly , depending on some details, but what can really improve performance is that you can give an algorithm a lot of training data . Results like this lead to a general consensus in machine learning: "The one who succeeds is not the one with the best algorithm, but the one with the most data".

So when is this statement true and when is it false? Because if we have a learning algorithm, and if that statement is true, then getting a lot of data is often the best way to ensure that we have a high-performance algorithm, rather than debating what algorithm should be used.

If there are some assumptions under which there are a large number of training sets that we think are useful, we assume that in our machine learning problem, the feature value xxx contains enough information that we can use it to accurately predictyyy , for example, if we take some confusing words like:two,to,too, suppose it can describexxx , to capture the words around the blank space that needs to be filled, then after the feature is captured, we hope to have "I ate __eggs for breakfast", then there will be a lot of information to tell me the words I need to fill in is "two" (two), not the wordtoortoo, so feature capture, even a word in the surrounding words, can give me enough information to determine the labelyywhat is y . In other words, from these three groups of confusing words, what word should I choose to fill in the blank.

So let's take a look at situations where large amounts of data are helpful. Assume that the eigenvalues ​​have enough information to predict yyThe y value, suppose we use a learning algorithm that requires a lot of parameters, such as logistic regression or linear regression with many features, or a neural network with many hidden units, which is another learning algorithm with many parameters , these are very powerful learning algorithms, they have a lot of parameters, and these parameters can fit very complex functions, so I'm going to call these, and I'm going to think of these algorithms as low-bias algorithms because we are able to fit very complex functions, and because we have very powerful learning algorithms that can fit very complex functions. Chances are, if we run these algorithms on this data, the algorithm fits the training set well, so the training error will be low.

Now suppose we use a very, very large training set. In this case, although we would like to have many parameters, if the training set is larger than the number of parameters, or even more, then these algorithms are less likely to overkill. fit. That is to say, the training error is hopefully close to the test error.

Another way to think about this problem is that in order to have a high-performance learning algorithm, we hope that it will not have high bias and variance.

So the bias problem, we're going to solve it by making sure we have a learning algorithm that has a lot of parameters so that we can get a low bias algorithm, and that's guaranteed by using a very large training set.

We don't have a variance problem here, our algorithm will have no variance, and by putting these two values ​​together we can end up with a learning algorithm with low error and low variance. This allows us to test the test dataset well. Fundamentally, this is a key assumption: the eigenvalues ​​are informative enough, and we have a good class of functions, which is the key to why we can guarantee low error. It has a large training data set, which guarantees more variance values, so this gives us some possible conditions, if you have a large amount of data, and you train a learning algorithm with many parameters , then this will be a good way to provide a high-performance learning algorithm.

I find the key test: First, a human expert sees the eigenvalue xxx , can predictyyy value? Because this proves that $y$ can be based on the eigenvaluexxx is accurately predicted. Second, can we actually get a huge training set and train a learning algorithm with many parameters on this training set? If you can't do both, then more often than not, you'll end up with a learning algorithm that performs well.

Guess you like

Origin blog.csdn.net/m0_63748493/article/details/132088973