Machine learning system design (design machine learning system)

Foreword

       This chapter is the last chapter contents immediately, in the last chapter we tell you later when we build a learning model, we found that use this model to predict new data, there is a great deviation, how can we improve it. It also introduced with the type of problems including high bias and high variance, with a learning curve and how to judge the effectiveness of this improved method, etc. In this chapter I will tell you before building a learning system to do what jobs.

      Finally, if there is understanding wrong with it, I hope you have educated us, thank you!

Chapter IX Machine learning system design (design machine learning system)

9.1 What priority do

      Here, we use an example to illustrate what priority should we do when building a learning system. In the first chapter of supervised learning to you, I gave you an example of the mail classification problem, where we have come up for discussion in detail. When you receive a message, how we judge that it is spam (junk mail) or non-spam (non-spam)? Figure 1 We received two messages.

                                                                                             Figure 1 two messages

      For the two messages shown in Figure 1, the left is from spam (1), the right side is non-spam (0), how do we choose as our input feature x, is determined to distinguish between? Here, we may choose whether to have the recipient's name, such as Andrew, or buy, deal, discount, now, etc. (these may be related to money and spam), we define x_{j}=\left\{\begin{matrix} 1 &if &word &j & appears &in & email\\ 0 & othewise & & & & & \end{matrix}\right., then for shown in Figure 2 we have e-mail x=\begin{bmatrix} 0\\ 1 \\ 1 \\ ... \\ 1 \\ ... \\ \end{bmatrix}\Leftarrow \begin{bmatrix} andrew\\ buy \\ deal \\ discount \\ ... \\ now \\ ... \end{bmatrix}, x\in R^{100}we selected 100 kinds of features to this statistic. In actual fact, we often want to select features more than that, we generally choose from 10,000 to 50,000 species, but these words are to undergo statistics appear more frequently, which can be more general.

                                                                                     FIG 2 spam

    How do we do to make our system has a lower error rate?

1. collect large amounts of data

2. The title of the complex structure of the feature information by mail

3. The structure of some of the features of the body of information, such as for whether the discount and discouts seen as the same word, as well as deal and dealer, or features for punctuation

4. study a complex algorithm to solve the problem of spelling errors (such as m0rtgage, med1cine)

...

For more than these practices, we are not directly answer whether it is useful in later analysis, we will explain everything.

9.2 Error Analysis

      By the previous analysis, we conduct a machine learning, our advice is: 1 start with a simple algorithm can be implemented to this problem began to start to realize the sooner the better, and then tested according to cross-data. In the case described here, not the beginning of a complex algorithm to choose better, faster solution to the problem has been what we want, then improved by testing. 2. The learning curve is determined by plotting the data or to add more features, etc. are valid. 3. The error analysis. We can have a manual which features based on those points the wrong kind of message, artificial statistical analysis.

      The third step below for the above error analysis, to give you a detailed explanation. Suppose we have a set of cross-data m_{cv}=500, of which 100 error message classification, the first step: We have this 100 messages artificial feature detection, look what all these categories, after analysis there pharma (drug), replica / fake (replica), steal passwords (password-stealing) ... there are some spelling mistakes statistics, the source of their path, common punctuation, etc., as is the statistical results in Figure 3.

                                                                         Feature Statistics Results misclassified map

Then the second step is to determine what features we use, that can be classified correctly to them.

      For whether it is useful, intuitive judgments we may not be scientific, we need it is to use the data to speak, in determining whether the discount, discounts, and so some of the words seen on the issue of the same word, there are currently a software is "stemming" , extraction of the trunk, as we mentioned above, the word can be judged as the same word. And if we have the universe and university, then it will be judged as the same word, such an error occurs, so we can not give if you can do this with an intuitive, we need to do is to both test data were used to cross to determine whether the result of the data to do so. After testing does not use stemming error rate: 5%, while the use of an error rate of 3%, can be found using the error rate has dropped, it is useful when later we need to determine a method is effective, we have You may be tested according to the crossing data.

9.3 Error metrics for skewed classes (standard error of the inclination and the like)

      There is a special kind of data, we use the example of prediction of cancer to be explained to you. In the cancer problem, we express suffering from cancer, with y = 0 means no cancer by y = 1. In the test data, we found that 1% of misjudgment, 99% of the right to judge, but the fact is that only 0.5% of people with cancer, this result is a lot of bias issue aside, we call this kind of problem tilt . If we are carrying on write algorithm, we do this:

function y=predictCancer(x)

          y=0; %ignore x

return

This time, we regardless of the input and output are both 0, the error rate of 0.5%.

    For this problem, we need to examine the performance is y = 1, so obviously can not do so directly. Here we give you define two new variables to a form shown in Figure 4, we have a precision (accuracy), recall (recall)

precision=\frac{True positive}{all positive}=\frac{True Pos}{TruePos+FalsePos}(In our forecast have cancer, we really have what is actually suffering from cancer of)

recall=\frac{TruePositive}{allPositive}=\frac{TruePos}{TruePos+FalseNeg}(We have cancer in the forecast, we have what is missing)

                                                                                  图4 实际和预测结果的关系

    在这里我们肯定既需要我们能够准确地预测出真正患有cancer的人,即precision要高,我们也希望患有cancer的人能被我们检测出来,即recall要高,而实际情况是很难两者都达到特别高的,那么我们该如何权衡了?在前面分类的问题中,0 \ leq h _ {\ theta} (x) \ leq 1,我们设的阈值是0.5,即当h(x)>=0.5时,我们预测为1,而当h(x)<0.5时,我们预测为0,假设我们把阈值设为0.9,那么这个时候对于我们要预测为1,即预测为cancer的可能性更小了,所以我们的precision会很高,而对于recall则会很低,因为阈值很高,我们遗漏的患者会更多。假设我们设的阈值是0.3,则结果正好相反,precision会很低,而recall会很高,看似两者很难选择,对此可能大家最容易想到的是取两者的平均值,即\frac{P+R}{2},对于如图5所示的P和R,我们对每一组分别求得平均值为0.45、0.4、0.51,我们会发现第三组貌似更好一些,然而对于第三组,我们会发现这就是对于所有的输入,预测结果都为1,显然是不可取的,所以在这里我们引入了一个新的判决方法叫F方法,常写作F_{1}=2\frac{PR}{P+R},这样我们就避免了输出一直是一种结果的问题,因为这样P或者R会特别小,因此我们的F也会特别小,则不是最合适的,比如图5所示F分别为0.45、0.175、0.039,很明显第三组是不合适的,而第一组才是最合适的。

                                                                                图5 Precision和Recall

9.4 机器学习的数据

      In front of everyone also explains that when we structure learning model is not very good, we often go to get more data, a large amount of data is indeed a great help to a machine learning system, some people said that a good the system does not have a good learning algorithm but has a lot of good data. For example, in predicting house price issue, if we only have a house the size of this data, we may make accurate predictions of the house it? The answer is no, because we do not know the house has a few rooms, new or old, and so on, so when there are large amounts of data, we can better judge. In front also told you about, and when there are large amounts of data over-fitting problem would be improved, J_{train}will be reduced, and approaches J_{test}and so on, so we really need more data, so we constructed our learning The system was more confident.

 

Guess you like

Origin blog.csdn.net/qq_36417014/article/details/84197620