Bayes formula/Bayes rule/Bayes theorem

introduce

Bayes' rule

What is Bayes' theorem used for? Simply put, probability prediction: under certain conditions, what is the probability of an event happening?

wikiThe reason for this theorem is very clear, in order to cover the inverse probability scenario:

Before Bayesian wrote this article, people have been able to calculate "forward probability", such as "Suppose there are N white balls and M black balls in the bag, you reach in and touch it, what is the probability of finding a black ball? ".

And a natural question is the other way around: "If we don't know the ratio of black and white balls in the bag in advance, but close our eyes and take out one (or several) balls and observe the colors of these balls, then we What conjectures can be made about the proportion of black and white balls in the bag." This problem is the so-called inverse problem.

Learn about the formula

Under the condition that event B occurs, the probability of event A occurring is:
write picture description here

Similarly, under the condition that event A occurs, the probability of event B occurring is:

write picture description here

It is easy to derive:

write picture description here

Assuming that if P(A)≠0, then we can get Bayes' theorem for predicting probability:

write picture description here

This theorem can obviously be deduced to multiple conditions, such as in the case of 2 conditions:

write picture description here

Classic Case

What is the probability that a person of a certain religion is a terrorist?

100%The hypothetical terrorists all believe in a certain religion, and someone believes in a certain religion, it does not mean that the person 100%is a terrorist, and the prior probability needs to be considered, assuming that there are 7万terrorists in the world (global population 70亿), assuming that there 1/3are people in the world who believe in a certain religion , then what is the probability that this person is a terrorist?

untie:

What we want to solve is this probability:P(恐怖分子|信某教)

Apply the formula to get:

P(恐怖分子|信某教)

= P(信某教|恐怖分子) P(恐怖分子) / P(信某教)

= 100% * (7万人/70亿人) / (1/3)

=0.003%

That is, the probability of 3 in 100,000.

Extending it further, from a mathematical point of view, it is correct that the Democratic Party does not target a certain religious group, but the assumption that 100%the a certain religion is relatively…

What is the probability of drug use by an employee who tests positive?

Assume that a routine test result is both sensitive and reliable 99%, that is, (+)the 99%. (-)The probability of each test being negative for a non -user is 99%. Suppose a company tests all employees for drug use, and employees are known 0.5%to use drugs. What is the probability of each employee who tests positive for drug use?

untie:

What we want to solve is this probability:P(吸毒|检测呈阳性的雇员)

Apply the formula to get:

P(吸毒|检测呈阳性雇员)

= P(检测呈阳性雇员|吸毒) P(吸毒) / P(检测呈阳性雇员)

= 99% * 0.5% / [P(检测呈阳性雇员∩吸毒) + P(检测呈阳性∩不吸毒)]

= 99% * 0.5% / [P(检测呈阳性雇员|吸毒) * P(吸毒) + P(检测呈阳性|不吸毒) * P(不吸毒)]

= 99% * 0.5% / [99% * 0.5% + 1% * 99.5%]

=0.3322

That is to say, despite the high accuracy of drug use detection 99%, Bayes' theorem tells us that if someone tests positive, the probability of using drugs is only about 50% , and the probability 33%of not using drugs is relatively high.

However, it should also be noted that the accuracy of the test greatly affects the probability of the result. If the test accuracy is achieved 99.9%, the probability of an employee who tests positive will increase 83.39%.

Spam filtering

This is the approach mentioned by Paul Graham in Hackers and Painters . In fact, this problem can be reversed. What we want to solve is this probability: P(垃圾邮件|检测到某种特征).

This certain feature can be a keyword , it can be time , it can be frequency , it can be email attachment type ...including the above-mentioned mixed features and so on.

Let's start with the simplest keywords to speculate. According to my personal experience, a Chinese-style spam email is likely to contain two words: invoice . Well, then the probability of whether an email we want to solve is spam becomes P(垃圾邮件|检测到“发票”关键词), according to Bayes' theorem

P(垃圾邮件|检测到“发票”关键词)

= P(检测到“发票”关键词|垃圾邮件) / P(检测到“发票”关键词)

OK, here comes a question, how do we know the probability of the invoice keyword appearing in spam ?

How do you know the probability of the invoice keyword appearing in all emails? In theory, unless we count all emails, we can't get it. At this time, we have to make a compromise and make an approximation in engineering. We find a certain number of real emails and divide them into two groups, a group of normal emails and a group of spam emails, and then calculate, look at the word invoice, What is the probability of appearing in spam and what is the probability of appearing in normal mail.

Obviously, if the number of training here is larger, the calculated probability will be closer to the true value. The email scale used by Paul Graham is normal email and spam email 4000封. If a word appears only in spam, Paul Graham assumes that it appears frequently in normal mail 1%, and vice versa, to avoid probability 0. As the number of messages increases, the calculation results are automatically adjusted.

In this case, the formula continues to be broken down as follows:

P(垃圾邮件|检测到“发票”关键词)

= P(检测到“发票”关键词|垃圾邮件) / P(检测到“发票”关键词)

= P(检测到“发票”关键词|垃圾邮件) / [P(检测到“发票”关键词∩垃圾邮件) + P(检测到“发票”关键词∩正常邮件)]

= P(检测到“发票”关键词|垃圾邮件) / [P(检测到“发票”关键词|垃圾邮件) / P(垃圾邮件) + P(检测到“发票”关键词|正常邮件) / P(正常邮件)]

Then the initial value can be calculated according to the probability obtained by the training model. After that, a large number of users can mark the spam as normal mail, move the normal mail to spam, and perform repeated training and correction until it approaches a reasonable value.

However, there is also a problem here, that is, no matter how high the probability of a single keyword (single condition) is, it is still possible that this email is not spam, so when applying Bayes' theorem here, we obviously need to use a lot of A condition, that is, to calculate this probability:

P(垃圾邮件|检测到“A”关键词,检测到“B”关键词,检测到"C",...)

Paul Graham 's approach is to select the P(垃圾邮件|检测到“X”关键词)highest 15个词and calculate their joint probability. (If the keyword appears for the first time, Paul Graham assumes that the value is equal 0.4, which is considered to be negative normal).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325504234&siteId=291194637