Andrew Ng machine learning (XI) - System Design

First, the construction of spam classifier
Give an example of spam classification: If you want to build a spam classifier, suppose we have some of the training set and tagged.

Including annotations spam expressed as y = 1 and non-spam expressed as y = 0.

Our method of supervised learning how to construct a classifier to distinguish between spam and non-spam mail?

In order to apply supervised learning, we must first determine how e-mail feature, structure vector x given training set of features and label x y, we can train a certain classifier, such as using logistic regression methods.

There is a number of feature selection process messages. For example, we may come up with a series of words that can be used to distinguish spam or not spam, for example, if there are messages that contain the word "deal (trade)" "buy (buy)" "discount (discount)" So it it is most likely a spam message, if a message contains my name "Andrew" explain this message, are less likely to be spam. For some reason, I think that "now (now)," the word indicates, this message may not spam, because I often get some very urgent message, of course, there are other words.

We can select hundreds of such words, given such a message, we can use this message to represent a feature vector, the method as shown:
Here Insert Picture Description

I chose the 100 words used to indicate the possibility of spam, so this feature vector x dimension is 100 and if this particular word, that word j appears in this message, then each feature variable values ​​xj It is 1 if xj is 0.

Although this process I have described my own chosen 100 words. But in practice most common practice is to traverse the entire training set. Then, in the training set (a lot of spam), the highest number of elected n words appear, n is generally between 10,000 and 50,000, and then these words as you use the feature. So different from the manual selection, these words will constitute a feature, so you can use them to do the spam classification.

If you are constructing a spam classifier, you should be faced with such a problem, that is: you go your way which improvements used, making it your spam classifier with high accuracy. Intuitively, is to collect large amounts of data, to generate this data object is called, is not it? In fact a lot of people really do, many people think the more the better collection of data will show the algorithm.

On the spam classification, there are called "Honey Pot" project. It can create a false email address, deliberately leaked these addresses to send spam people, so you can receive a lot of spam. You see, this is the case, we can get a lot of spam to train the learning algorithm, however, in previous lessons, we know a lot of data may be helpful, or may not.
Here Insert Picture Description

For most machine learning problems, there are many ways to enhance the effect of machine learning. For example, for purposes of spam You might think of a more complex characteristic variables, like path of the message information. This information is usually appears in the title of the message. Therefore, the spam sender when sending spam, they always try to get the source of some of the mail blurred. Or with a false message headers, or sent via unusual server mail with unusual routes, they can send junk mail to you, but the information may also be included in the message header section, so you can think of, we can constructed by the header portion of the message of the more complex features, to obtain a series of message routing information, and then determine whether this is a spam.

You may also think of other ways, for example, from the body of the message, look for some of the more complex features, such as: the word "discount" and whether the word "discounts" are the same. As another example, the word "deal (transaction)" and "dealer (dealer)" is also to be considered equivalent. Or even, as in this example, some words and some uppercase lower case, punctuation, or should we use to construct complex features variable. Because spam may be more of an exclamation point, these are not certain.

Similarly, we may construct a more complex algorithm to detect or correct those intentional misspellings. For example, "m0rtgage" "med1cine" "w4tches". Since spam senders did so, because if you put the 4 "w4tches", then, in a simple way we mentioned before, spam classifiers will not "w4tches" and "watches" as the same, so we can hardly distinguish between intentional misspelling of spam. Spam is also very clever, they do it to escape some of the filters. When we use machine learning, always "brainstorming" about ways to try to come up with a bunch of, like this.

By the way, I have some time, studied the problem of spam classification, in fact, I spent a lot of time studying this. Although I can understand the problem of spam classification, I do know some of this stuff, but I still hard to tell you these four methods, which you most go use. In fact, frankly, the most common situation is that a team may randomly determine one approach. But sometimes this method is not the most productive, you know you just randomly choose one of these methods.

In fact, when you need to come up with different ways to brainstorm to try to improve the accuracy of you it may have already surpassed a lot of people. Sadly: most people they do not try to list the possible methods. They did just wake up one day in the morning, for some reason, had a whim "let's try to collect large amounts of data with a Honey Pot Bar project." Whatever strange reason, the morning had an idea, or a randomly selected and then dry on six months, but I think we have a better way.

The next section, we will error analysis tells you how to use a more systematic approach to select the right one from a bunch of different methods.
Second, the error analysis
if you are ready to study machine learning something or construction machine learning applications. Best practices instead of creating a very complex system, with how complex variables, but to build a simple algorithm. So you can quickly implement it. Whenever I study machine learning problem, I will only spend up to a day to try to quickly get the results out. Even if the effect is not good, there is no result with a complex system, but just quickly obtained, even if the operation could not be perfect, but it is running again. Finally, the data is checked by cross validation.

Once done, then other problems by drawing a learning curve and the error is checked to find out if you have a problem algorithms high bias and high variance, or. After such analysis, the decision to come back with more training data or to add more features variable is useful.

The reason for this is that when you are new to machine learning problems, is a good way. You can not know in advance if you need a complex variable feature or if you need more data, or anything else, you should know ahead of time what to do is very difficult, because you lack of evidence, lack of learning curve. So, it's hard to know what you should spend time to improve local performance of the algorithm. But when you practice a very simple even if imperfect way, you can make further selections by drawing a learning curve.

This philosophy is we have to use the evidence to lead our decisions, how to allocate their time optimizing algorithm, rather than just intuition. With what intuition derived from general always wrong, in addition to the learning curve is drawn outside, a very useful thing is the error analysis.

When we construct such structure spam classifier, I'll take a look at my cross-validation data set, then see for himself what the message was incorrectly classified algorithm. Therefore, these algorithms are misclassified spam and non-spam, you can discover the laws of certain systemic, what type of messages are consistently being misclassified. Often, after doing so, the process can inspire you to construct a new feature variable or tell you now weaknesses of the system. How then inspire you to improve it.

Specifically, here is an example. Suppose you are constructing a spam classifier. You have 500 instances, cross-validation focused on, it is assumed in this example, the algorithm has a very high error rate. It misclassification of one hundred cross-validation instance. So I have to do a manual check this 100 error, and then classify them by hand. Based, for example: What are these types of messages, which variables can help this algorithm to classify them correctly. Specifically this is by identifying what types of messages, by checking which one hundred messages misclassified, I might find the message most likely to be misclassified. Mail may be related to drugs, these messages are basically the sale of drugs, or selling fake goods, such as selling off the table. Or some crooks also called phishing e-mail and so on. So, in time to check what messages are misclassified, I will look at each message, count, for example, in which 100 misclassified mail, I found 12 misclassified mail is and sale of drugs-related messages. 4 is selling fake goods to sell off the table or something else. Then I found that there are 53 e-mail phishing e-mail to trick you tell them your password. The remaining 31 other types of messages. Each category is calculated by a different number of messages, such as you might find that the algorithm to distinguish phishing emails at the time, always behaved badly. This shows that you should spend more time to study this type of message, and then see if you can construct a better characteristic variables of this type of message to correctly distinguish. At the same time, I have to do is look at what features variable may help the algorithm correctly classified messages.

We assume that the classification method can help us improve the performance of e-mail is: Check intentional misspellings, unusual message routing source, and punctuation unique way of spam, such as many exclamation points. As before, I will manually browse these messages, assuming there are five messages of this type, this type of 16, 32 of this type, as well as some other types. If this is the result you get from cross-validation, then this may indicate intentional misspelling occurs less frequently. It may not be worth your time to write algorithms to detect this type of message, but if you find a lot of spam has unusual punctuation rules, then this is a strong feature, you show that you should spend time to construct more complex based on punctuation characteristic variables. Thus, this type of detection is a manual error analysis process. Detection algorithm might mistake can often help you find more effective means, which explains why I always recommend to practice a fast, even an imperfect algorithm. What we really want is to find out what type of message, this is the most difficult classification algorithm out. For different algorithms, different machine learning algorithms, the problems they encountered in general always the same, even if imperfect some quick arithmetic through practice. You can quickly find the wrong location, and it is difficult to quickly identify algorithm exception processing. So you can concentrate on the real problem.

Finally, when constructing a machine learning algorithm, Another useful trick is to ensure that you have a way to assess your numerical machine learning algorithms.

Take a look at this example assumes we should be trying to decide whether to be like "discount" "discounts" "discounter" "discountring" Such words are treated as equivalent to a method, it is to check the first few letters of these words. For example, when you check in a few words at the beginning of these letters, you might find these words probably have the same meaning.

In natural language processing, this method is called stemming through software to achieve, if you want to try yourself, you can search on the Internet "Porter Stemmer (Porter stemming Act)" This is in the extraction of stem a relatively good software. This software will be the word "discount" "discounts" and so on are treated as the same word. But this Stemming software checks only the first few letters of the word, this is useful.

But it may also cause some problems. Because: For example, because the software would the word "universe (the universe)" and "university (University)" is also regarded as the same word. Because the two words that begin with the same letter. So, when you decide whether you should use stemming software for classification, it is always hard to tell particular error analysis, it does not help you decide stemming is not a good method.

In contrast, the best way to discover stemming software in the end there is no use for your classifier, is quickly begin to try to look at it how to behave in the end, in order to do so to evaluate your value by the algorithm is very useful. Specifically, you should naturally not be verified by cross word stemming and dry extraction algorithm validation error rate. So, if you do not use your algorithms stemming, for example, then you get 5% misclassification rate, then you can use stemming to run your algorithm, you get such a classification error of 3%. Then this reduces a lot of errors, so you decide to stemming a good way, in terms of this particular problem. There are a number of digital assessments that cross validation error rate, in future we will find the validation number in this example also requires some processing, but we can see or do that will make you able to do more quickly in future courses a decision, such as whether to use stemming.

If you practice every time a new idea, you have to manually detect these examples, go and see poor performance or good behavior, then this is difficult for you to decide in the end whether to use stemming, is case sensitive. But through a quantitative assessment of the value, you can look at this figure, the error is larger or smaller.

You can use it to practice your new ideas more quickly, which is basically very intuitive to tell you that your idea is to improve the performance of the algorithm, or make it worse? This will greatly increase the speed when you practice algorithms, so I strongly recommend set up the implementation of cross-validation error analysis, rather than on the test set. But there are some who will do error analysis on the test set, even if it mathematically speaking, is not appropriate. So I recommend you do the cross-validation error vector analysis.

in conclusion:

When you study a new machine learning problem, I always recommend that you implement a more simple and fast, if not so perfect algorithm. I almost never seen people do that, they are often doing things: to spend a lot of time to construct simple way they think in the construction algorithm. Therefore, do not worry about your algorithm is too simple or too imperfect, but your algorithm to achieve as quickly as possible. When you have the initial implementation, it becomes a very powerful tool to help you decide on the next practice. Because we can take a look at the wrong algorithm caused him to look at where they went through error analysis, and then to determine the optimal way. Another thing: suppose you have a fast algorithm is not perfect, there are a number of data assessment, which will help you to try out new ideas, and quickly find out if you try these ideas can improve the performance of the algorithm so, you will make a decision quickly: What to give up in the algorithm, what to adopt.

Third, the number of processing skewed
with assessment and error metric algorithm, there is one important thing to note: is the use of a suitable error metric, which sometimes cause very subtle effect on your learning algorithm. This important thing is skewed class (skewed classes) question:

Such as cancer classification problem: We have a characteristic variables of medical patients, we want to know if they have cancer, we assume that y = 1 indicating that the patient has cancer, y = 0 assume that they do not have cancer.

We train a logistic regression model, assuming that we use test set, test this classification model, and found that only 1% of errors, so we will make the right diagnosis 99%. Looks very good result, but we found in the test set only 0.5% of patients actually had cancer, so in this example, 1% error rate no longer look so good.

For a concrete example, there is a line of code that lets y is always equal to 0, so it is always prediction: no one get cancer. Then the algorithm is actually only 0.5% error rate. So this is even better than the error rate of 1% before we get. In this example, the number of negative samples and the number of positive samples compared to the very, very small, we call this situation is called a skewed class. The number of samples as compared with a class data of another class, a lot more. By always predict or always predict y = 0 y = 1 algorithm might perform very well.

Thus errors or classified using the classification accuracy as the evaluation metric may cause the following problems:

Let's say you have an algorithm that accuracy is 99.2%. Thus it is only 0.8% error, assuming that you make changes to your algorithm, now you get the accuracy of 99.5%, only 0.5% error. This in the end is not a algorithm to enhance it? With a real number as a measure of benefit assessment is this: It can help us to quickly decide whether we need to make some improvements to the algorithm, the accuracy improved from 99.2% to 99.5%. But we improvements in the end it is useful, or are we just replaced with the code. For example y = 0 is always predict such a thing, so if you have a skewed class, with a classification accuracy is not a good measure algorithm. Because you might get a very high accuracy, very low error rate. But we do not know if we really improved the quality of the classification model. Because there is always predict y = 0 is not a good classification model. Y = 0 but always predict your error will be reduced to such decreased to 0.5%.

When we encounter such a skewed class, we want to have a different error metric, which is called the metric for evaluating precision (precision) and recall (recall. Let me explain what we are used to test assumptions to evaluate a set of binary classification model, our learning algorithm to do is make a prediction value. If there is a sample of the actual class it belongs to is a predicted class is 1, then we put this sample is called true positives (true positive) learning algorithm to predict a negative value is equal to 0, the actual class indeed belong to 0, then we call this true negative (true negative) learning algorithms to predict a value equal to 1, but in fact it equal to 0, this is called a false positive (false positive). algorithm predictive value of 0, but the actual value is 1, is called a false negative (false negative). in this way, we have a 2x2 table based on the actual and predicted class category, so we With a another way to evaluate the performance of the algorithm.

We want to add two numbers, the first one is called precision, this means that for all our forecasts, they have patients with cancer patient ratio is how much real suffering from cancer.

Precision of a classification model = true positive / positive predictive = true positive / (true positives + false positives)

The higher the precision, the better.

We have to calculate another number, called the recall. Recall that if all of these in a dataset patients actually had cancer, the ratio of how much we correctly predicted that they had cancer.

Recall = true positive / positive real = true positive / (true positives + false negatives)

Similarly, the higher the rate, the better the recall.

By calculating precision and recall, we know better: the classification model good in the end.

Fourth, the critical value
Examples of cancer classification, if we want to in case we are very confident, the only predict a patient has cancer, a way to do this is to modify the algorithm. We will no longer critical value to 0.5, maybe, we just did predict y = 1 is greater than or equal to the value of h (x) in the case of 0.7, the regression model so you have a higher precision and lower the recall rate.

Because, when we make predictions, we give only a small fraction of patients predicted y = 1. Now we have this situation exaggerating a bit, we set the threshold at 0.9, we only in the case of at least 90% sure that this patient with cancer, predict y = 1. So these patients have a very large ratio of real suffering from cancer, so this is a high precision of the search model. But the recall rate becomes low, because we want to be able to correctly detect patients with cancer.

Now consider a different example. Suppose we want to avoid missing out people with cancer that we want to avoid false negatives. Specifically, if a patient actually has cancer, but we have not told him he had cancer, and that this could have serious consequences. In this example, we will set lower thresholds, such as 0.3. In this case, we will have a higher recall rate and lower model precision of.

Overall, therefore, for most of the regression model, you have to weigh the precision and recall:

Here Insert Picture Description
When you change the value of the threshold, I'm here drew a critical value, you can draw a curve trade-off precision and recall. A value here reflect a higher threshold value, this threshold may be equal to 0.99. We assume that only the prediction y = 1 only when there is certainty of greater than 99%, the possibility of at least 99%. So at this point the reaction rate high precision recall. However, a point here reflect a lower threshold, for example 0.01. There is no doubt where the predicted y = 1. If you do, you end up with very low precision, but a higher recall rate.

When you change the threshold, if you wish, you can draw all the curve regression models to see if you can get a range of precision and recall rates. Incidentally, precision - recall curve may be a variety of different shapes, it sometimes looks like, so sometimes. Precision - recall curve shape There are many possibilities, depending on the specific algorithm regression model. So That raises another interesting question that there is no way to automatically select critical value or, more broadly, if we have different algorithms or different ideas, how we compare different precision and recall it?
Here Insert Picture Description

Specifically, suppose we have three different learning algorithms, or three different learning curve is the same algorithm but with different thresholds. How do we decide which method is best, we talked about before in which one thing is the importance of assessment measurements.

The idea is to reflect your regression model through a specific number in the end how. But the precision and recall problems, we can not do that. Because here we have two numbers can judge. So, we often have to face such a situation, if we're trying to compare algorithm 1 and algorithm 2, we finally ask yourself in the end is the precision and recall of 0.4 to 0.5 is good, or should I say with precision 0.7 good recall of 0.1, or every time you design a new algorithm, you have to sit down and think: in the end good 0.5, 0.4 or good? Or better 0.7, 0.1 or good? If you finally sit down to think so, this will reduce your speed of decision making: it is useful to think in the end what changes should be incorporated into your algorithm. On the contrary, if we have an evaluation measure, a number can tell us in the end is good or Algorithm 1 Algorithm 2 good. This will help us more quickly determine which algorithm is better, but also to help us more quickly evaluate different changes in what should be integrated into the algorithm into the inside, then how can we get this metric to assess it?

You might try one thing is calculate the average precision and recall rates, with P and R to represent the precision and recall, you could do is to calculate their average, take a look at what model has the highest average. But this may not be a good solution, because the same as our previous example, if we always predictive regression model y = 1 to do so you can get very high recall rate, to get a very low precision. Conversely, if your model is always predict y = 0 That is, if rarely predict y = 1, corresponding to set a high threshold. Finally, you'll get a very high precision and very low recall rate.

The two extreme cases: a very high threshold, there is a very low threshold, any one of them is not a good model.

Conversely, there is a different way of precision and recall combination, called F value, a formula is. In this case, F is a value. We can be judged by the value of the algorithm F F 1 has the highest value, the second Algorithm 2, Algorithm 3 is the lowest. Therefore, the F value, we will choose the algorithm in these algorithms 1, F value is also called the value of F1, F1 general writing value, but it is generally only said F value, it will be considered part of the definition of precision and recall average of. But it will give the investigation a higher value low weight precision and recall rate.

So, you can see the molecule F value is the product of the search precision and recall, so if precision is equal to 0, or the recall rate is equal to 0, F value will be equal to zero. It combines the precision and recall, for a larger F value, precision and recall have to be large.

I must say that the formula can be combined with more precision and recall, F value formula is just one of them. But for historical reasons and habit that people use in machine learning F value, F value of this term, there is no special significance, so do not worry about it in the end why it is called the F value or the value of F1, but it gives you what you need effective way.

Because both the precision equal to 0, or the recall rate is equal to 0, it will get a very low F value. So if you want to get a high F value, precision and recall your algorithm to be close to unity. Specifically, if P = 0 or R = 0, the value will be equal to your F 0, for a perfect value F, if the precision is equal to 1, while the recall rate is equal to 1, that you get the value of F 1 is equal to 1 divided by 2 and then multiplied by 2. Then the F value is equal to 1, if you can get the most perfect precision and recall, in the middle of the values ​​0 and 1, which is often regression models most frequently occurring score.

In this video, we talked about how to balance precision and recall, and how we can change the threshold to decide whether we want to predict y = 1 or y = 0, for example, we need a 70% or 90% confidence level or other threshold is predicted y = 1, by varying the threshold, you can control after weighing precision and recall we talked about the F value, it weighed precision and recall rate, gives you an evaluation metric, of course, if your goal is to automatically select the threshold to decide what you want to predict y = 1 or y = 0, then a more ideal approach is to evaluate these different thresholds, the highest F value at the intersection test set, which a better way is automatically selected threshold.

Fifth, the use of large data sets
proved, under certain conditions, to obtain large amounts of data, and training in some type of learning algorithm. It can be an effective way to get a learning algorithm has a good performance. And this situation often occurs under these conditions are true for your question, and you can get a lot of data, this can be a good way to get very high performance learning algorithm.

I tell a story: many, many years ago, two researchers I know Michele Banko and Eric Brill conducted an interesting study. They are interested in the study of the effect of using different learning algorithms, and will use these results to the different training data set, comparing the two, when they consider the question of how to classify between words confusing.

For example, in this sentence: I ate a breakfast of eggs (to, two, too) in this case, the breakfast I ate two eggs. This is an example of a confusing word. This is another set of circumstances, so they put this machine learning problems such as a class of supervised learning problem, and try to classify. What kind of English words in a sentence specific location is appropriate.
Here Insert Picture Description

They chose four classification algorithms, these specific algorithm is not important, what they do is change the size of the training data set, and try to learn these algorithms for different sizes of the training data set, this is the result they got. The trend is very clear, first of all most of the algorithms have similar performance. Secondly, with an increase in the training data set, the performance of these algorithms are also correspondingly increased.

In fact, if you choose to select an "inferior" algorithm, more data if you give the inferior algorithms. Well, it may be higher than the "Excellence algorithm" better results like these give rise to a general consensus in machine learning: "The successful person is not the owner of the best algorithm, but who has the most data . " So this statement is true at what time, what time is it fake?

If you have lots of data, but you train a learning algorithm with many parameters, then this would be a good way to provide a high-performance learning algorithm.

I think the key test, I often ask myself, first of all a human expert can see the characteristic values of x confident predicted y value it? Since this can prove y may be accurately predicted according to the characteristic value x. Secondly, we can actually get a large set of training set and a lot of parameters learning algorithm in this training focused on training? If you can not do both, so more often. You will get a very good performance of the learning algorithm.
Reference: 10 Machine Learning (Andrew Ng): machine learning system design
Andrew Ng Machine Learning

Published 80 original articles · won praise 140 · views 640 000 +

Guess you like

Origin blog.csdn.net/linjpg/article/details/104176061