Naive Bayes model and case (Python)

Table of contents

1 Algorithm Principle of Naive Bayes

2 Bayesian model under one-dimensional feature variables

3 Bayesian model under two-dimensional feature variables

Bayesian model under 4 n-dimensional feature variables

5 sklearn implementation of Naive Bayesian model

6 Case: Tumor Prediction Model

6.1 Reading data and partitioning

6.1.1 Reading data

6.1.2 Divide feature variables and target variables

6.2 Model construction and use

6.2.1 Divide training set and test set

6.2.2 Model building

6.2.3 Model prediction and evaluation

reference books


1 Algorithm Principle of Naive Bayes

Bayesian classification is one of the most widely used classification algorithms in machine learning .

Naive Bayesian is the simplest type of Bayesian model, and the core of its algorithm is the Bayesian formula shown below.

Where P(A) is the probability of event A occurring, P(B) is the probability of event B occurring, P(A|B) represents the probability of event A occurring under the condition that event B occurs, and similarly P(B|A ) represents the probability of event B occurring under the condition that event A occurs.

To give a simple example: It is known that the probability P(A) of a person catching a cold (event A) in winter is 40%, and the probability P(B) of a person sneezing (event B) is 80%. If the probability P(B|A) is 100%, what is the probability P(A|B) that a person has a cold if he starts to sneeze? The solution process is as follows.

2 Bayesian model under one-dimensional feature variables

Take a detailed example: how to judge whether a person has a cold. Suppose there are already 5 sets of sample data, see the table below.

Only one feature variable "sneezing (X1)" is selected, its value is 1 for sneezing, 0 for not sneezing; the target variable is "cold (Y)", its value is 1 for cold, 0 for not cold.

Now, based on the above data, use the Bayesian formula to predict whether a person has a cold.

For example, if a person sneezes (X1=1), does he have a cold? The problem is actually to predict the probability P(Y|X1) that he has a cold. Substituting the feature variable and the target variable into the Bayesian formula, the calculation formula shown below can be obtained.

According to the above data, the probability of catching a cold (Y=1) under the condition of sneezing (X1=1) can be calculated, and the calculation process is as follows.

Among them, P(X1=1|Y=1) is the probability of sneezing under the condition of catching a cold. Here, 3 out of 4 samples of catching a cold sneezed, so the probability is 3/4; P(Y=1) is the probability of catching a cold among all samples, here 4 out of 5 samples have a cold, so the probability is 4/5; P(X1=1) is the probability of sneezing among all samples, here 4 out of 5 samples The sample sneezes, so the probability is 4/5.

Similarly, under the condition of sneezing (X1=1), the calculation process of the probability of not catching a cold (Y=0) is as follows.

Among them, P(X1=1|Y=0) is the probability of sneezing without catching a cold, which is 1; P(Y=0) is the probability of not catching a cold in all samples, which is 1/5; P(X1= 1) is the probability of sneezing in all samples, which is 4/5.

Because 3/4 is greater than 1/4, the probability of catching a cold is higher under sneezing conditions than without catching a cold.

3 Bayesian model under two-dimensional feature variables

On the basis of one of the above characteristic variables, add another characteristic variable—headache (X2), whose value is 1 for headache, and 0 for no headache; the target variable is still cold (Y). See the table below for sample data.

Based on the above data, the Bayesian formula is still used to predict whether a person has a cold. For example, if a person sneezes and has a headache (X1=1, X2=1), does he have a cold? The problem is actually to predict the probability P(Y|X1, X2) that he has a cold. Substituting the feature variable and the target variable into the Bayesian formula, the calculation formula shown below can be obtained.

Now to calculate and compare the size of P(Y=1|X1, X2) and P(Y=0|X1, X2), from the above formula, the denominator P(X1, X2) of the two is the same, so directly Just calculate and compare the size of the molecules P(X1, X2|Y)P(Y) of the two.

Before calculation, it is necessary to introduce the independence assumption of the Naive Bayesian model : each feature in the Naive Bayesian model is independent of each other , that is, P(X1, X2|Y)=P(X1|Y)P(X2 |Y). Therefore, the calculation formula of the molecule can be converted into the following form.

Under the premise of independence assumption, calculate the probability P(Y=1|X1=1, X2=1) of catching a cold (Y=1) under the conditions of sneezing and headache (X1=1, X2=1), then transform To calculate the value of P(X1=1|Y=1)P(X2=1|Y=1)P(Y=1), the calculation process is as follows.

Similarly, to calculate the probability P(Y=0|X1=1, X2=1) of not having a cold (Y=0) under the conditions of sneezing and headache (X1=1, X2=1), that is, to calculate P( The value of X1=1|Y=0)P(X2=1|Y=0)P(Y=0) is calculated as follows.

Because 9/20 is greater than 1/5, the probability of catching a cold with sneezing and headache is higher than the probability of not catching a cold.

Bayesian model under 4 n-dimensional feature variables

On the basis of 2 characteristic variables, the Bayesian formula is extended to n characteristic variables X1, X2, ..., Xn, the formula is as follows.

The Naive Bayesian model assumes that each feature is independent of each other after the target value is given, and the calculation formula of the molecule can be written in the following form.

Among them, P(X1|Y), P(X2|Y), P(Y) and other data are all known, so it can be calculated that the target variable takes a certain value under the condition that n characteristic variables take different values The probability of , and select the one with higher probability to classify the sample.

5 sklearn implementation of Naive Bayesian model

# 这里用的是高斯贝叶斯模型
from sklearn.naive_bayes import GaussianNB

X = [[1,2],[3,4],[5,6],[7,8],[9,10]]
y = [0,0,0,1,1]

model = GaussianNB()
model.fit(X,y)

model.predict([[5,5]])

# 输出结果
# array([0])

6 Case: Tumor Prediction Model

Taking a typical tumor prediction model in the medical industry as an example, this article explains how to apply the Naive Bayesian model in actual combat to predict whether a tumor is benign or malignant.

The judgment of the nature of the tumor affects the treatment method and recovery speed of the patient. The traditional approach is that doctors judge the nature of tumors based on dozens of indicators. The prediction effect depends on the doctor's personal experience and is inefficient. However, through machine learning, we are expected to be able to quickly predict the nature of tumors.

6.1 Reading data and partitioning

6.1.1 Reading data

Firstly, the following code is used to import the data of 6 feature dimensions and tumor properties of breast tumor patients in a hospital. A total of 569 patients, including 358 cases of benign tumors, 211 cases of malignant tumors.

The six characteristic variables are "maximum circumference", "maximum sag", "average sag", "maximum area", "maximum radius" and "average gray value".

The target variable is "tumor nature", where 0 means the tumor is malignant and 1 means the tumor is benign.

6.1.2 Divide feature variables and target variables

6.2 Model construction and use

6.2.1 Divide training set and test set

6.2.2 Model building

6.2.3 Model prediction and evaluation

Use the model fitted to the training set to make predictions on the test set.

Use the relevant knowledge points of creating DataFrame to summarize the prediction result y_pred and the actual value y_test in the test set. The code is as follows.

Note: y_pred is a one-dimensional array, and y_test is a Series object, which requires a unified type

It can be seen that the prediction accuracy of the top 5 items is 80%.

The prediction accuracy of all test set data can be viewed through the following code.

The Naive Bayesian model belongs to the classification model , so the ROC curve can also be used to evaluate its predictive effect.

from sklearn.metrics import roc_curve
fpr,tpr,thres = roc_curve(y_test,y_pred_proba[:,1])

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']  #设置正常显示中文
plt.plot(fpr,tpr)
plt.xlabel('假警报率')
plt.ylabel('命中率')
plt.title('ROC曲线')
plt.show()

To sum up, the Naive Bayesian model is a very classic machine learning model. It is mainly based on the Bayesian formula. During the application process, the features in the data set will be regarded as independent of each other without considering the relationship between features. relationship, so the calculation speed is faster. Compared with other classic machine learning models, the generalization ability of the Naive Bayesian model is slightly weaker, but when the number of samples and features increases, its prediction effect is also good.

reference books

"Python Big Data Analysis and Machine Learning Business Case Practice"

Guess you like

Origin blog.csdn.net/qq_42433311/article/details/124193724