Naive Bayes-based classification of breast cancer datasets

1. About the author

Gong Yinghao, male, School of Electronic Information, Xi'an Polytechnic University, graduate student of the class of 2021
Research direction: Machine Vision and Artificial Intelligence
Email: [email protected]

Wu Yanzi , female, School of Electronic Information, Xi'an Polytechnic University, 2021 graduate student, Zhang Hongwei Artificial Intelligence Research Group
Research direction: Pattern Recognition and Artificial Intelligence
Email: [email protected]

2. Naive Bayes Algorithm

2.1 Bayesian algorithm

Let the class C={y(1),y(2)…y(m)}, the feature vector X={x(1)},x(2)…x(n)},
the work of the Bayesian classifier The principle is as follows:
1. Find the probability of each category under the condition of X:
insert image description here
2. The category corresponding to the item with the highest probability is the predicted category of X: insert image description here
the category corresponding to the maximum value is the predicted category. However, in actual situations, it is often impossible to obtain directly, so the solution of the maximum value is converted into a solution according to the Bayesian formula:
insert image description hereP(X) is independent of the category and is a fixed value, which will not affect the solution of the maximum value in the formula. The subsimplifies to the following form:
insert image description here
Summary: A Bayesian classifier is a maximum posterior probability estimate

2.2 Naive Bayes Algorithm

Why simple? Naive means that each feature x(i) in the feature vector X={x(1),x(2)...x(n)} is independent of each other. Then the Naive Bayes classifier can be expressed as:
insert image description here

3. Naive Bayes Algorithm in Sklearn

In sklearn, several naive Bayes implementation algorithms are provided. Different naive Bayes algorithms mainly have different distribution assumptions of P(xi|y), and then use different parameter estimation methods. We can find that the naive Bayes algorithm is mainly to calculate P(xi|y). Once P(xi|y) is determined, the probability of finally belonging to each category is naturally solved.
Three commonly used Naive Bayes are:
◎Gaussian Naive Bayes (GaussianNB)
◎Bernoulli Naive Bayes (BernoulliNB)
◎Multinomial Naive Bayes (MultinomialNB)

3.1 Gaussian Naive Bayes Algorithm

Applicable to continuous variables, it is assumed that each feature xi obeys a normal distribution under each category y, and the algorithm internally uses the probability density function of the normal distribution to calculate the probability as follows:
insert image description here

3.2 Polynomial Naive Bayes Algorithm

Multinomial Naive Bayes, suitable for discrete variables, assumes that each feature xi is subject to a multinomial distribution under each category y, so each feature value cannot be negative. The calculation is as follows:
insert image description here

3.3 Bernoulli Naive Bayes

Assuming that there are only two possible outcomes of test E: A and A¯, then E is called a Bernoulli test. Bernoulli Naive Bayes, suitable for discrete variables, assumes that each feature xi is subject to an n-fold Bernoulli distribution (binomial distribution) under each category y, because the Bernoulli test has only two results, so , the algorithm will first binarize the eigenvalues ​​(assuming the binarization results are 1 and 0), the calculation method is as follows:
insert image description here

In the training set, the following estimates are made:
insert image description here

4. Experimental procedure

4.1 Dataset Introduction

The breast cancer dataset built into scikit-learn comes from the Wisconsin Breast Cancer dataset in the UC Irvine Machine Learning Repository. The dataset originated from the Wisconsin Clinical Science Center. Each record represents a sample of follow-up data for breast cancer. These are DR Wolberg's data sets of medical indicators collected from 1984-1995 follow-up of consecutive breast cancer patients, including only those with invasive breast cancer and no distant metastasis.
Number of sample instances in the dataset: 569, of which 357 are benign and 212 are malignant.
Number of features (attributes): 30 feature attributes and 2 classification targets (Malignant-Malignant, Benign-Benign).
Feature (attribute) information: The 30 numerical measurements consist of the mean, standard deviation, and worst (i.e., maximum) value of 10 different features of the digitized nuclei. These features include:
radius: mean of distances from center to points on the perimeter
texture: standard deviation of gray-scale values
​​perimeter
area
smoothness : local variation in radius lengths Compactness
: perimeter^2 / area - 1.0
concavity: severity of concave portions of the contour
concave points
symmetry
· fractal dimension(分形维数):“coastline approximation” - 1

4.2 Code Implementation

#导入sklearn库自带的乳腺癌数据集,分别使用GaussianNB、MultinomialNb、BernouliNB3种分类器进行分类预测,并比较输出3种分类器预测的准确率优劣。

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split

#下载数据集并对数据集分割
cancers = datasets.load_breast_cancer()
X = cancers.data
Y = cancers.target
#print(X.shape)
#print(Y.shape)
#print(cancers.DESCR)
#将数据集拆分为训练集和测试集,训练集占80%,测试集占20%
# 注意返回值: 训练集train,x_train,y_train,测试集test,x_test,y_test
# x_train为训练集的特征值,y_train为训练集的目标值,x_test为测试集的特征值,y_test为测试集的目标值
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

#三种朴素贝叶斯分类方式

#高斯贝叶斯分类器
model_linear1 = GaussianNB()
model_linear1.fit(x_train, y_train)
train_score1 = model_linear1.score(x_train, y_train)
test_score1 = model_linear1.score(x_test, y_test)
print('高斯贝叶斯训练集的准确率:%f; 测试集的准确率:%f'%(train_score1, test_score1))
preresult = model_linear1.predict(x_test)
print(preresult)
print(y_test)

#多项式贝叶斯分类器
model_linear2 = MultinomialNB()
model_linear2.fit(x_train, y_train)
train_score2 = model_linear2.score(x_train, y_train)
test_score2 = model_linear2.score(x_test, y_test)
print('多项式贝叶斯训练集的准确率:%f; 测试集的准确率:%f'%(train_score2, test_score2))
preresult = model_linear2.predict(x_test)
print(preresult)
print(y_test)

#伯努利贝叶斯分类器
model_linear3 = BernoulliNB()
model_linear3.fit(x_train, y_train)
train_score3 = model_linear3.score(x_train, y_train)
test_score3 = model_linear3.score(x_test, y_test)
print('伯努利贝叶斯训练集的准确率:%f; 测试集的准确率:%f'%(train_score3, test_score3))
preresult = model_linear3.predict(x_test)
print(preresult)
print(y_test)

4.3 Experimental results

insert image description here

refer to

1. Reference dataset .

Guess you like

Origin blog.csdn.net/m0_37758063/article/details/123645319