Ten classic machine learning algorithms: talk about Bayesian decision making in a simple way (Bayesian formula, minimum risk Bayes, minimum error Bayes)

Preface


   People often say that when learning something, if you can explain it to others in a simple way, then you really understand it. I happened to be learning pattern recognition recently, so I used it to practice writing. Bayes Decision is one of the top ten classic machine learning algorithms. It is a typical statistical machine learning. In fact, we often use it in our daily judgments, but we haven't found it.

   It is assumed that you have in front of 10 papers , the teacher told you have 5 copies is that there is no review and no real review of the science slag , there are 5 parts it is that there is no review was a good review of the science of tyrants , drawn at random from the inside you A copy came out with a score of 90+ . Regardless of the name, you would probably say that this is a paper for Xueba . Maybe you didn’t realize that Bayes was used inadvertently at the moment you made your judgment.

excuse me
   Bayesian decision-making in practice: Ten classic machine learning algorithms: Naive Bayesian image segmentation in practice-Nemo fish image segmentation

   EM and Bayesian: Top Ten Classical Machine Learning Algorithms: A New Path EM Algorithm + Gaussian Mixture Model Actual Combat

Anti-reptile logo: CSDN original notes https://blog.csdn.net/sinat_35907936/article/details/108894542


Bayesian Inverse Probability Formula


  The Bayesian formula or Bayesian inverse formula was published by Mr. Bei in the middle of the 18th century. It was not very popular when it was first published, but it is now a well-known formula. The formula is designed to pass a known result , and combined with some empirical or statistical information to push down the most likely outcome of the causes of the so-called Executive fruit Soin. The observational data we obtain is often a mixture of multiple sources, and it is our job to determine the true source of the data-classification .

  Corresponding to the above quoted example, the known result is that the student has passed 90+ , and the empirical information is that the probability of 90+ for the exam is much higher than the probability of 90+ for the exam. The statistical information is 10 papers, two Each group has half. The original 10 papers were randomly drawn, and the probability of belonging to the two groups should be the same, but after knowing the results, the experience information tells us that this paper is more likely to belong to the Xueba group. How likely is the specific possibility? The Bayesian formula is needed.

Bayes

Anti-reptile logo: CSDN original notes https://blog.csdn.net/sinat_35907936/article/details/108894542

   The Bayesian formula consists of a prior probability, a total probability and two conditional probabilities, as shown in formula (1). Now try to combine formulas to symbolize and quantify the process in the case.

Insert picture description here

   Suppose w1 represents the student scum group (category 1), w2 represents the student master group (category 2), x=0 indicates that the curling score does not exceed 90 events, x=1 indicates the curling score 90+ events, and U indicates the total number of test papers .

   Let P(wi) denote the proportion of the number of shares in the two groups (classes), then P(w1)=0.5, P(w2)=0.5 , that is, half of each, this probability is called the prior probability .

   Suppose that through all previous examination information , the probability of scoring 90+ in group w1 is 0.2, and the probability of scoring 90+ in group w2 is 0.8, that is, P(x=1|w1)=0.2, P(x=1|w2 )=0.8 , this probability is often called the conditional probability of the class . It reflects the most essential difference between the two-here represents the probability of testing 90+, which is the most important basis for classification .

   Use P(x=1) to denote the total probability of 90+ for the two groups of w1 and w2 , which is a total probability .
   The final result is the probability that the 90+ test paper comes from the two groups (categories) of w1 and w2 , namely P(w1|x=1), P(w2|x=1) , which is also a conditional probability, often called the posterior Probability .

Anti-reptile logo: CSDN original notes https://blog.csdn.net/sinat_35907936/article/details/108894542

   It is easy to find that the posterior probability is actually a measure of the contribution of each component to the result. A large probability means that the component (class) accounts for a large proportion of all the results. When the score of the test paper is not known in the quoted example, the paper may belong to any one of the 10 people, that is, both components (categories) contribute 5 people in probability, each accounting for 0.5. And after knowing that the volume score was 90+, the contribution quietly changed.

   Group w1 (class) has contributed U x P(w1) x P(x=1|w1)=10x0.5x0.2=1 person in probability ;

   The w2 group (class) has contributed U x P(w2) x P(x=1|w2)=10x0.5x0.8=4 people in probability ;

   People who test 90+ in probability are the sum of two groups (classes), namely U x P(w1) x P(x=1|w1) + U x P(w2) x P(x=1|w2) = 5 people

  Then w1 proportion contributed by the group (s) the probability, i.e. w1 posterior probability P (w1 | = X. 1). 1 = / = 0.2. 5 . w2 proportion contributed by the group (s) the probability, i.e. w2 posterior probability P (w2 | X) =. 4 / 0.8 =. 5 . Since the posterior probability of the Xueba group (class) w2 is much larger, it is reasonable to determine that 90+ papers are mostly Xueba in the cited examples .

  Similarly, if the score of the paper drawn does not exceed 90, you can also calculate a posterior probability with Bayesian, P(w1|x=0) = 0.8, P(w2|x=0) = 0.2, then It is much more likely to belong to the scumbag group.


Bayesian decision


  Bayesian decision-making is based on the Bayesian formula to calculate the posterior probability, and further attribution decision-classification , as in the above cited example, the decision is to decide that the paper with 90+ or ​​not more than 90 points belongs to the w1 group (class ) Or belong to the w2 group (class). It mainly includes two decision-making methods, namely Bayesian decision with minimum error and Bayesian decision with minimum risk. The former is a decision made under the condition of ideal or equal status of all categories , while the latter has to consider the cost of the decision itself and the unequal status of each category .

Anti-reptile logo: CSDN original notes https://blog.csdn.net/sinat_35907936/article/details/108894542

  • Minimum error Bayesian decision

  Choose the group or class with the largest posterior probability, the probability of correct judgment is the largest, and the probability of making mistakes is the smallest, that is, the smallest error Bayesian decision = the largest posterior Bayesian decision . Since the probability is non-negative, if the error rate of each decision is the smallest, then the total error rate is also the smallest.

  In the above example, when x=1, P(w1|x=1)=0.2, P(w2|x=1)=0.8, because P(w1|x=1)<P(w2|x=1) , So if 90+ papers are assigned to w2, the probability of making mistakes will be the smallest. The probability of making a mistake is the probability that a 90+ paper may belong to w1, that is, P(w1|x=1)=0.2 = 1-P(w2|x=1). In the same way, when x=0, assigning 90 or 90- to w1, the probability of making a mistake will be the smallest. The probability of making a mistake is P(w2|x=0)=0.2=1-P(w1|x=0). Then, if you make 10 such decisions, the average error rate is (5* 0.2+5* 0.2)/10 = 0.2.

  Looking closely at the above cited examples, you will feel that it is more like a simple mathematical problem rather than a pattern recognition problem. Because in actual pattern recognition, first of all, the data x to be classified will often not only have two values ​​[0,1], but a series of values , such as [60,62,80,95,90...]; then The conditional probability of the corresponding class is often not a few isolated impulses, but a continuous probability density function (PDF), as shown in Figure 1.

Conditional probability of class and conditional probability density function of class

Figure 1. The conditional probability of a class and the conditional probability density function of a class

  
  Note: I don’t know if a friend has the same question as me, why the data itself is discrete , but the probability density function is obtained . Obviously , it is possible to count the probability of each x to obtain a discrete distribution law . The boss used the maximum likelihood to estimate the continuous probability density function. My personal understanding is that in order to better estimate the distribution of the population , we designed the model to be used outside the sample. The sample cannot include all the data . Therefore, according to the sample, the smooth and continuous PDF estimated with the maximum likelihood will be better than The discrete distribution law of direct statistical samples is more in line with the overall distribution.

Anti-reptile logo: CSDN original notes https://blog.csdn.net/sinat_35907936/article/details/108894542

  If the conditional probability of the class is a continuous density function, the probability of making a mistake will not be the expectation of the probability of decision-making errors in several discrete samples (such as x=1, x=0 in the quoted example) -sum , but continuous sample decision-making Expectation of error probability- integral .

  From equation (1), we can see that we can judge the posterior probability by comparing the numerators . Therefore, the distribution is drawn as shown in Figure 2. It is the probability density function of the class multiplied by a constant (a priori Probability, does not change with sampling, is a constant value).

  Obviously, t in the figure is the decision point. When x<t , the contribution w1 to the generated data x is greater than w2, so the minimum error Bayesian decision assigns x to w1 . When x>t , the contribution to the generated data x w2 is greater than w1, so the minimum error Bayesian decision classifies x as w2 .

Two types of error rate

Figure 2. Two types of error rates

  The result of the smallest error Bayesian decision is that all x that falls in the region R1 are attributed to w1 , including the w2 component mixed therein , and all x that falls in the region R2 are attributed to w2 , including the mixture The w1 component.

  So for each x falling in the region R1, the probability of being judged wrong is 1-P(w1|x)=P(w2|x), then the average error rate of w1 is P(w2|x) in x <t on the expectation, it is the area of the diagonal area in the figure , as shown in equation (2).
Insert picture description here
  Similarly, the average error rate of w2 is the expectation of P(w1|x) on x>t , which is the area of the grid in the figure

  • Minimum risk Bayesian decision

  Choose the group or class with the smallest decision risk . When the costs of different decisions are not the same, we will add different weights to each decision . Give a chestnut. If a person is in the early stage of cancer, and the model determines that he is normal, then such decisions may cost life , so such decisions need to be given high weight . And if a person is normal and the model judges him to be early cancer, he will spend a few more days at most on examination fees and fears . The cost may be negligible relative to life, so such decisions need to be given low weight. Obviously, there is no cost for correct judgment, and the weight is 0.

  Let’s use the above-mentioned examples for in-depth analysis. Assuming that a piece of student scum ( w1 ) paper is wrongly judged as a student ( w2 ), it will cause 10 points of critical damage to both parties , otherwise only 1 point of critical damage. If the judgment is correct, Everyone is in peace.

  Then we add 10 weights to the decision of Xueba (w1) to be Xueba (w2) , and then add 1 weight to the decision that Xueba (w2) is to be Xueba (w1) , If the correct weight is zero, then we can get the following decision table .

Insert picture description here

  Assume that a 90+ paper is also drawn, that is, x=1. From the above analysis, we can know that there is a probability that 0.2 of this paper is derived from w1, and a probability of 0.8 is derived from w2, and the state is a mixture of w1 and w2 . For the smallest error Bayesian decision, the paper judged as Xueba (w2) is the best decision. Now if it is still judged as w2, then the risk is R2 = 10* 0.2 + 0* 0.8 = 2, and if judged as w1, the risk is R1 = 0* 0.2 + 0.8*1 = 0.8. For the minimum risk Bayesian decision, it is judged that w1 is the best decision. There is a 180-degree turn in the plot, which may be the reason why the awesome characters can turn the tide-the power is heavy.


reference


  Zhang Xuegong. Pattern Recognition (Third Edition) . M. Tsinghua University Press. 2010.

Guess you like

Origin blog.csdn.net/sinat_35907936/article/details/108894542