Wu Enda study notes-eleven, machine learning system design

11. Design of machine learning system

11.0 Prioritization of work: Examples of spam classification

We mark spam as spam with 1

Mark waste spam as non-spam and use 0 to indicate

If we have some such labeled spam and non-spam samples, how can we train a spam classifier? It is clear that this is a supervised learning problem. Assuming that we choose a logistic regression algorithm to train such a classifier, we must first select the appropriate features. Defined here:

  • x = characteristics of the mail;
  • y = spam (1) or not spam (0)

We can select 100 typical vocabulary sets to represent garbage/non-junk (words), such as deal, buy, discount, andrew, now, etc., which can be sorted in alphabetical order. For the email training samples that have been marked, if the word j appears in the sample in 100 vocabularies, use 1 to represent xj in the feature vector x, otherwise use 0 to represent it, so that the training sample is replaced by the feature vector x:

Note that in actual use, we will not manually select 100 typical words, but select the top n words with the highest frequency from the training set, such as 10,000 to 50,000.

So, how to train a spam classifier efficiently to make it have a higher accuracy rate and a lower error rate?

  • First of all, it is natural to consider collecting more data, such as the "honeypot" project, a project that specializes in collecting spam server ip and spam content;
  • But the last chapter has told us that more data is not better, so you can consider designing other complex features, such as sending information by mail, which is usually hidden at the top of spam;
  • You can also consider designing based on the characteristics of the message body, such as whether "discount" and "discounts" are regarded as the same word? How to deal with "deal" and "Dealer" in the same way? And whether to use punctuation as a feature?
  • Finally, consider using complex algorithms to detect incorrect spelling (spam will deliberately misspell words to avoid spam filters, such as m0rtgage, med1cine, w4tches)

11.1 What to do first

Take a spam classifier algorithm as an example for discussion.

In order to solve such a problem, the first decision we have to do is how to select and express the feature vector xxx . We can choose a list consisting of 100 words that appear most frequently in spam emails, and get our feature vector (appearing as 1, non-appearing as 0) according to whether these words appear in the email, and the size is 100×1.

In order to build this classifier algorithm, we can do many things, such as:

  1. Collect more data, so that we have more spam and non-spam samples
  2. Develop a series of complex features based on mail routing information
  3. Develop a series of complex features based on the body information of the email, including the processing of considering truncation
  4. Develop complex algorithms to detect deliberate spelling errors (w4tch for watch)

Among the above options, it is very difficult to decide which one should spend time and energy on. It is better to make a wise choice than to follow the feeling.

We will talk about error analysis in a subsequent course. I will show you how to use a more systematic method to choose the right one from a bunch of different methods. Therefore, you are more likely to choose a really good method that allows you to spend days, weeks, or even months on in-depth research.

11.2 Error Analysis

The recommended method for constructing a learning algorithm is:

  1. Start with a simple algorithm that can be implemented quickly, implement the algorithm and test the algorithm with cross-validation data
  2. Draw a learning curve and decide whether to add more data, or add more features, or other options
  3. Perform error analysis: manually check the samples in the cross-validation set that produce prediction errors in our algorithm to see if these samples have a systematic trend

Assuming that there are 500 email samples on the cross-validation set, and the algorithm misclassifies 100 emails, then we manually check these 100 bad cases and classify them as follows:

  • (i) What type of mail is it?
  • (ii) What kind of clues or features do you think may be helpful to the correct classification of the algorithm?

The importance of numerical evaluation:

After analyzing the bad case, we may consider the following methods:

  • For discount/discounts/discounted/discounting, can they be regarded as the same word?
  • Is it possible to use the "stemming" toolkit to take the stem of a word, such as "Porter stemmer"?

Error analysis cannot determine whether the above methods are effective, it only provides a way to solve the problem and a reference. Only after actual attempts can it be seen whether these methods are effective.

So we need to evaluate the algorithm numerically (such as cross-validation set error) to see the effect of the algorithm when a certain method is used or not, for example:

  • Stemming the wrong word in advance: 5% error rate vs. Extracting the stem of the word: 3% error rate
  • Case sensitivity (Mom / mom): 3.2% error rate

11.3 Error metrics for skewed classes

What is the asymmetry classification ?

Taking cancer prediction or classification as an example, we trained a logistic regression model. If it is cancer, y = 1, otherwise, y = 0.
On the test set, I found that the error rate of this model is only 1% (99% are all correct), which seems to be a very good result?
But in fact, only 0.5% of patients have cancer. If we don't use any learning algorithm, we will predict y = 0 for everyone in the test set, and there is no cancer. Then the error rate of this prediction method is only 0.5%, which is even better than the logistic regression model that we trained hard. This is an example of asymmetric classification. For such an example, it is risky to only consider the error rate .

Now we come to consider a standard measurement method: Precision/Recall (precision and recall)

First, define the positive and negative examples as follows:

True Positive (True Example, TP) is a positive sample predicted by the model to be positive; it can be called the correct rate

True Negative (Truly Negative, TN) is a negative sample predicted by the model to be negative; it can be called the correct rate that is judged to be false

False Positive (FP) is a negative sample predicted by the model to be positive; it can be called the false positive rate

False Negative (FN) is a positive sample predicted by the model to be negative; it can be called the false negative rate

Predictive value
Positive Negtive
Actual value Positive TP FN
Negtive FP TN

So for the example of cancer prediction we can define:

Precision-The actual number of cancer patients in the prediction (real cases) divided by the number of cancer patients we predicted:

T r u e   p o s i t i v e s p r e d o c t e d   p o s i t i v e = T r u e   P o s i t i v e T r u e   P o s i t i v e + F a l s e   P o s i t i v e \frac{True\ positives}{predocted\ positive}=\frac{True\ Positive }{True\ Positive + False\ Positive} predocted positiveTrue positives=True Positive+False PositiveTrue Positive

Precision = TP/(TP+FP)

Recall (recall rate)-the number of actual cancer patients in the forecast (real cases) divided by the actual number of cancer patients:

T r u e   p o s i t i v e s a c t u a l   p o s i t i v e = T r u e   P o s i t i v e T r u e   P o s i t i v e + F a l s e   N e g a t i v e \frac{True\ positives}{actual\ positive}=\frac{True\ Positive }{True\ Positive + False\ Negative} actual positiveTrue positives=True Positive+False NegativeTrue Positive

Recall = TP/(TP+FN)

11.4 Trading off precision and recall

Assuming that our classifier uses a logistic regression model, the predicted value is between 0 and 1: A common way to judge positive and negative cases is to set a threshold, such as 0.5:

  • If, then the prediction is 1, a positive example;
  • If, then the prediction is 0, a negative example;

At this time, we can calculate the precision and recall of this classifier :

  • Precision = TP/(TP+FP) cases. Among all the patients we predict to have malignant tumors, the percentage of patients who actually have malignant tumors, the higher the better.
  • Recall = TP/(TP+FN) cases. Among all patients who actually have malignant tumors, the percentage of patients with malignant tumors successfully predicted, the higher the better.

At this time, different thresholds will lead to different accuracy and recall, so how to weigh these two values?

For this example of cancer prediction:

Suppose we are very sure to predict that the patient will have cancer (y=1). At this time, we often set the threshold very high, which will lead to high precision and low recall (Higher precision, lower recall);

Assuming that we don’t want to misclassify too many cancer cases (to avoid false negative cases, if they have cancer, they are indeed classified as not having cancer). At this time, the threshold can be set lower, which will lead to a high recall rate. Low precision (Higher recall, lower precision);

These problems can be attributed to a Precision Recall curve, referred to as PR-Curve:
Insert picture description here
usually we will consider using their mean value for comparison, but this will introduce a problem, for example, the mean values ​​of the above three sets of Precision/Recall are: 0.45, 0.4, 0.51, the last group is the best, but is the last group really good? If we set the threshold very low, or even 0, then for all test sets, our predictions are y = 1, then the recall is 1.0. We don’t need any complicated machine learning algorithms at all, and we directly predict y = 1 is enough, so using the mean of Precision/Recall is not a good way.

We hope there is a way to help us choose this threshold. One way is to calculate the standard F-score or F1-score:

The calculation formula is: F 1 S core: 2 PRP + R { {F}_{1}}Score: 2\frac{PR}{P+R}F1Score:2P+RP R

We choose the threshold that makes the F1 value the highest.

The F value is a good trade-off between accuracy and recall, and the two extreme cases can also be well balanced.

11.5 The importance of data for machine learning (Data for machine learning)

When designing a high-accuracy machine learning system, how significant is the data? In 2001, Banko and Brill conducted an experiment to classify confusing words, that is, to choose a suitable word in the context of a sentence, for example:
For breakfast I ate ___ eggs

Given {to, two, too}, choose a suitable word.
They used the following machine learning algorithms:

  • -Perceptron(Logistic regression)
  • -Winnow
  • -Memory-based
  • -Naïve Bayes

According to the different scales of the training set, the accuracy of these algorithms was recorded, and the following figure was made: The
Insert picture description here
final conclusion is:

“It’s not who has the best algorithm that wins. It’s who has the most data."

Why choose big data?

Assuming that our features have a lot of information to accurately predict y, for example, the above example of confusing word classification, it has the context of the entire sentence that can be used;

Conversely, for example, when predicting housing prices, if there is only the feature of house size and no other features, can the prediction be accurate?

For such a problem, a simple test method is given such characteristics, can a human expert accurately predict y?

If a learning algorithm has many parameters, such as logistic regression/linear regression has many features, and neural networks have many hidden units, then its training set error will be small, but it is easy to fall into overfitting; if it is used again, it is very Large training data, then it will be difficult to overfit, and its training set error and test set error will be approximately equal and small. So big data is still very important for machine learning.

Guess you like

Origin blog.csdn.net/qq_44082148/article/details/104347864