Article source: Official Account-Intelligent IT System.
The Bayesian principle is similar to probability inversion, where the posterior probability is derived from the prior probability. Its formula is as follows:
In big data analysis, this theorem can be well used for deduction and prediction. Many e-commerce companies and user orientations can refer to this method to deduce unknown data from existing data and classify them for follow-up operations.
For example, on the website of a home buying agency, there are already 8 customers, and the information is as follows:
User ID | age | gender | income | marital status | whether to buy a house |
1 | 27 | male | 15W | no | no |
2 | 47 | Female | 30W | Yes | Yes |
3 | 32 | male | 12W | no | no |
4 | 24 | male | 45W | no | Yes |
5 | 45 | male | 30W | Yes | no |
6 | 56 | male | 32W | Yes | Yes |
7 | 31 | male | 15W | no | no |
8 | 23 | Female | 30W | Yes | no |
At this time, a new customer came, who has not yet bought a house, and the information is as follows:
age | gender | income | marital status |
34 | Female | 31W | no |
So how do you judge whether she will buy it, and do you need to give her a home buying recommendation?
We use Bayesian theory to calculate its probability. Among the above-mentioned 8 existing customers, there are four dimensions, age, gender, income, and marital status. These four dimensions constitute the standard for measuring whether to buy a house in the end. We divide the records into two tables according to whether or not to buy a house in the end:
Those who bought a house (Exhibit 1):
User ID | age | gender | income | marital status | whether to buy a house |
2 | 47 | Female | 30W | Yes | Yes |
4 | 24 | male | 45W | no | Yes |
6 | 56 | male | 32W | Yes | Yes |
Those who did not buy a house (Exhibit 2):
User ID | age | gender | income | marital status | whether to buy a house |
1 | 27 | male | 15W | no | no |
3 | 32 | male | 12W | no | no |
5 | 45 | male | 30W | Yes | no |
7 | 31 | male | 15W | no | no |
8 | 23 | Female | 30W | Yes | no |
The probability of buying a house is represented by P(a1), which is 3/8, and the probability of not buying a house is represented by P(a2), which is 5/8.
We analyze from these four dimensions in turn:
age:
Here we divide it into three stages: 20-30, 30-40, and 40+ according to age groups. This new client is aged 30-40.
P(b1|a1) --- The probability of 30-40 buying a house is 1/3
P(b1|a2) --- The probability of 30-40 not buying a house is 2/5
income:
Here we divide the salary into three levels: 10-20, 20-40, and 40+. The revenue for this new client is in the 20-40 range.
P(b2|a1) --- The probability of 20-40 buying a house is 2/3
P(b2|a2) --- 20-40 probability of not buying a house is 2/5
marital status:
The new client is unmarried
P(b3|a1) --- The probability of unmarried buying a house is 1/3
P(b3|a2) --- The probability of unmarried not buying a house is 3/5
gender:
new client is female
P(b4|a1) --- The probability that a woman will buy a house is 1/3
P(b4|a2) --- The probability that a woman will not buy a house is 1/5
OK, now start integrating:
新用户买房的统计概率为P(b|a1)P(a1),其中P(b|a1)为P(b1|a1)P(b2|a1)P(b3|a1)P(b4|a1),那么为0.33*0.66*0.33*0.33*3/8 = 0.0089
新用户不会买房的统计概率为P(b|a2)P(a2),其中P(b|a2)为P(b1|a2)P(b2|a2)P(b3|a2)P(b4|a2),那么为0.4*0.4*0.6*0.2*5/8 = 0.012
由结果得知,该用户不会买房的概率大,所以可以将其分类至不会买房的类别。
公众号-智能化IT系统。每周都有技术文章推送,包括原创技术干货,以及技术工作的心得分享。扫描下方关注。