【Data Analysis】Bayes Principle

Article source: Official Account-Intelligent IT System.


The Bayesian principle is similar to probability inversion, where the posterior probability is derived from the prior probability. Its formula is as follows:


In big data analysis, this theorem can be well used for deduction and prediction. Many e-commerce companies and user orientations can refer to this method to deduce unknown data from existing data and classify them for follow-up operations.


For example, on the website of a home buying agency, there are already 8 customers, and the information is as follows:


User ID age gender income marital status whether to buy a house
1 27 male 15W no no
2 47 Female 30W Yes Yes
3 32 male 12W no no
4 24 male 45W no Yes
5 45 male 30W Yes no
6 56 male 32W Yes Yes
7 31 male 15W no no
8 23 Female 30W Yes no


At this time, a new customer came, who has not yet bought a house, and the information is as follows:

age gender income marital status
34 Female 31W no


So how do you judge whether she will buy it, and do you need to give her a home buying recommendation?


We use Bayesian theory to calculate its probability. Among the above-mentioned 8 existing customers, there are four dimensions, age, gender, income, and marital status. These four dimensions constitute the standard for measuring whether to buy a house in the end. We divide the records into two tables according to whether or not to buy a house in the end:

Those who bought a house (Exhibit 1):

User ID age gender income marital status whether to buy a house
2 47 Female 30W Yes Yes
4 24 male 45W no Yes
6 56 male 32W Yes Yes

Those who did not buy a house (Exhibit 2):

User ID age gender income marital status whether to buy a house
1 27 male 15W no no
3 32 male 12W no no
5 45 male 30W Yes no
7 31 male 15W no no
8 23 Female 30W Yes no


The probability of buying a house is represented by P(a1), which is 3/8, and the probability of not buying a house is represented by P(a2), which is 5/8.


We analyze from these four dimensions in turn:

age:

Here we divide it into three stages: 20-30, 30-40, and 40+ according to age groups. This new client is aged 30-40.

P(b1|a1) --- The probability of 30-40 buying a house is 1/3

P(b1|a2) --- The probability of 30-40 not buying a house is 2/5

income:

Here we divide the salary into three levels: 10-20, 20-40, and 40+. The revenue for this new client is in the 20-40 range.

P(b2|a1) --- The probability of 20-40 buying a house is 2/3

P(b2|a2) --- 20-40 probability of not buying a house is 2/5

marital status:

The new client is unmarried

P(b3|a1) --- The probability of unmarried buying a house is 1/3
P(b3|a2) --- The probability of unmarried not buying a house is 3/5

gender:

new client is female

P(b4|a1) --- The probability that a woman will buy a house is 1/3
P(b4|a2) --- The probability that a woman will not buy a house is 1/5


OK, now start integrating:

新用户买房的统计概率为P(b|a1)P(a1),其中P(b|a1)为P(b1|a1)P(b2|a1)P(b3|a1)P(b4|a1),那么为0.33*0.66*0.33*0.33*3/8 = 0.0089


新用户不会买房的统计概率为P(b|a2)P(a2),其中P(b|a2)为P(b1|a2)P(b2|a2)P(b3|a2)P(b4|a2),那么为0.4*0.4*0.6*0.2*5/8 = 0.012


由结果得知,该用户不会买房的概率大,所以可以将其分类至不会买房的类别。


公众号-智能化IT系统。每周都有技术文章推送,包括原创技术干货,以及技术工作的心得分享。扫描下方关注。



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324487854&siteId=291194637