Data Mining Learning - Naive Bayesian Classification Algorithm Beast Cancer Actual Combat

Table of contents

1. Statistical knowledge related to Naive Bayesian classification algorithm

2. Naive Bayesian Classifier

3. Python implementation of naive Bayesian classifier

(1) Call the sklearn library, which needs to be installed

(2) Example 1 (check the data distribution and data format)

 (3) Example 2 (Analyze the entire breast cancer data set with the naive Bayesian classification algorithm, and train to obtain a model for judging whether the tumor is benign or malignant)

1. View the features and labels of the dataset

 2. Divide the entire data set into a training set and a test set by train_test_split() and view the data form

 3. Fit the training data set with the Gaussian Naive Bayesian classification algorithm

 4. Complete Gaussian Naive Bayesian classification algorithm training breast cancer dataset code


1. Statistical knowledge related to Naive Bayesian classification algorithm

  (1) Conditional independence formula: P(X,Y)=P(X)*P(Y)

(2) Conditional probability formula: P(X|Y)=P(X,Y)/P(Y), P(Y|X)=P(X,Y)/P(X)

  (3) Total probability formula:

 

  (4) Bayesian formula:

2. Naive Bayesian Classifier

Basic idea: Assuming that the samples to be classified obey a certain probability distribution, first estimate the prior probability of an unclassified sample through the classified sample data, and then use the Bayesian formula to calculate the posterior probability of the unclassified sample (ie Predict the probability that the sample belongs to a certain class), and finally select the category with the largest posterior probability as the category to which the unclassified sample belongs.

3. Python implementation of naive Bayesian classifier

(1) Call the sklearn library, which needs to be installed

Data set description: Use the data set breast cancer that comes with the sklearn library (breast cancer patient data: a total of 569 instances, including 212 benign instances and 357 malignant instances. Each instance includes 30 attribute values, each attribute value Fine-needle aspiration digital images taken from breast lumps, including the mean and variance of 10 features. These 10 features include radius, perimeter, and area, etc.)

(2) Example 1 (check the data distribution and data format)

Example code:

import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer=load_breast_cancer()
cancerdf=pd.DataFrame(cancer.data,columns=cancer.feature_names)
print(cancerdf.head()) # head()默认显示前5行数据

  Running results (display the first 5 rows of data):

 (3) Example 2 (Analyze the entire breast cancer data set with the naive Bayesian classification algorithm, and train to obtain a model for judging whether the tumor is benign or malignant)

1. View the features and labels of the dataset

code:

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
import seaborn as sns

cancer=load_breast_cancer()
print("肿瘤的分类:",cancer['target_names'])
print("肿瘤的分类:",cancer['feature_names'])

operation result:

 2. Divide the entire data set into a training set and a test set by train_test_split() and view the data form

code:

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
import seaborn as sns
cancer=load_breast_cancer()
x,y=cancer.data,cancer.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=3)
print(x_train.shape)# 查看训练集数据形态
print(x_test.shape)# 查看测试集数据形态

operation result:

 3. Fit the training data set with the Gaussian Naive Bayesian classification algorithm

code:

clf=GaussianNB()
clf.fit(x_train,y_train)#对训练集进行拟合
print(clf.score(x_train,y_train))
print(clf.score(x_test,y_test))

operation result:

(As you can see, the test set accuracy is as high as 0.947)

 4. Draw the confusion matrix

code:

pred=clf.predict(x_test)
cm=confusion_matrix(pred,y_test)
plt.figure(dpi=300)
sns.heatmap(cm,cmap=sns.color_palette("Blues"),annot=True,fmt='d')
plt.xlabel('实际类别')
plt.ylabel('预测类别')
plt.show()    

operation result:

 4. Complete Gaussian Naive Bayesian classification algorithm training breast cancer dataset code

Encapsulate the above code and add a function code to visualize the confusion matrix to get the code of the complete training process, as follows:

(The following code is written by the blogger according to personal needs, and can also be adjusted according to personal needs)

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
import seaborn as sns

#训练模型函数
def model_fit(x_train,y_train,x_test,y_test):
    clf=GaussianNB()
    clf.fit(x_train,y_train)#对训练集进行拟合
    print(clf.score(x_train,y_train))
    print(clf.score(x_test,y_test))
    pred=clf.predict(x_test)
    cm=confusion_matrix(pred,y_test)
    return cm

#混淆矩阵可视化
def matplotlib_show(cm):
    plt.figure(dpi=100)#设置窗口大小(分辨率)
    plt.title('Confusion Matrix')

    labels = ['a', 'b', 'c', 'd']
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels)
    plt.yticks(tick_marks, labels)
    sns.heatmap(cm, cmap=sns.color_palette("Blues"), annot=True, fmt='d')
    plt.ylabel('real_type')#x坐标为实际类别
    plt.xlabel('pred_type')#y坐标为预测类别
    plt.show()

if __name__ == '__main__':
    cancer = load_breast_cancer()
    x, y = cancer.data, cancer.target
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=3)
    cm=model_fit(x_train,y_train,x_test,y_test)
    matplotlib_show(cm)

operation result:

 

 

Guess you like

Origin blog.csdn.net/weixin_52135595/article/details/126689536