Exploring Bayes - Polynomial/Bernoulli/Complement Naive Bayes

Table of contents

1. Multinomial Naive Bayes and its variations

1.1 Multinomial Naive Bayes MultinomialNB

2. Bernoulli Naive Bayes BernoulliNB

3. Complementary Naive Bayes

4. Exploring Bayesian: Bayesian sample imbalance problem


1. Multinomial Naive Bayes and its variations

1.1 Multinomial Naive Bayes MultinomialNB

class sklearn.naive_bayes.MultinomialNB(alpha=1.0,fit_prior=True,class_prior=None)

        Multinomial Bayes is also based on the original Bayesian theory, but assumes that the probability distribution follows a simple multinomial distribution. The multinomial distribution comes from the multinomial experiment in statistics, which can be specifically explained as follows: the experiment includes n repeated experiments, and each experiment has different possible results. In any given trial, the probability of a particular outcome occurring is constant.
        For example, if a feature matrix represents the result of tossing a coin, the probability of getting the head is P(X=Front|Y) = 0.5, and the probability of the back is P(X=Negative|Y) = 0.5, only these two Two possibilities, and the two outcomes do not interfere with each other, the probability of two random events sums to 1, which is the binomial distribution. In this case, the feature matrix suitable for multinomial Naive Bayes should look like this:

test number X1: Heads appear X2: Tails appear
0 0 1
1 1 0
2 1 0
3 0 1

Assuming that another feature matrix represents the result of throwing dice, you ican take values ​​in [1,2,3,4,5,6], the six results do not interfere with each other, and as long as the sample size is large enough, the probability is 1 /6, which is a multinomial distribution. The characteristic matrix of the multinomial distribution should look like this:

test number appear 1 appear 2 appear 3 appear 4 appear 5 appear 6
0 1 0 0 0 0 0
1 0 0 0 0 0 1
2 0 0 1 0 0 0
……
m 0 0 0 0 0 1

It can be seen that multinomial distribution is good at categorical variables. In its principle assumption, the P(X_{i}|Y)probability of probability is discrete, and different X_{i}conditions P(X_{i}|Y)are independent of each other and do not affect each other. Although the multinomial distribution in sklearn can also handle continuous variables, in reality, if we really want to deal with continuous variables, we should use Gaussian Naive Bayes. The experimental results in polynomial experiments are very specific, and the features involved are often concepts such as times, frequencies, counts, and occurrences. These concepts are all discrete positive integers. Therefore, polynomial naive Bayes in sklearn is not Accepts negative input .

        Due to such properties, the feature matrix of Multinomial Naive Bayes is often sparse (not necessarily always sparse), and it is often used for text classification. We can use the well-known TF-IDF vector technique, or we can use the common and simple word count vector approach with Bayesian. Both of these methods are common text feature extraction methods, which can be easily implemented through sklearn.

        From a mathematical point of view, under a label category Y=c, there is a set of parameter vectors corresponding to the features \theta _{c}=(\theta _{c1},\theta _{c2},....,\theta _{cn}), where n represents the total number of features. A parameter corresponding to \theta_{ci}the th feature under this label category . iThis parameter is defined by us as:

Denoted as P(X_{i}|Y=c), it means that when the condition Y=c is fixed,  X_{i}the probability that a set of samples will be taken on the value of this feature. For a feature matrix with a structure of (m, n) under the label category, there are:

X_{y}=\begin{bmatrix} x_{11} & x_{12}& x_{13}& ...& x_{1n}\\x_{21} & x_{22}& x_{23}& ...& x_{2n}\\ x_{31} & x_{32}& x_{33}& ...& x_{3n}\\ & & ...& & \\ x_{m1} & x_ {m2}& x_{m3}& ...& x_{mn}\\ end{bmatrix}

Each of these x_{of}is a feature X_{i}, and based on these understandings, the parameters are solved for by smoothed maximum likelihood estimation θ y:

\alphaKnown as the smoothing coefficient, make it \alpha>0 to prevent the 0 probability that some words that have appeared in the training data do not appear in the test set, so as to avoid the situation where the parameter θ is 0. If \alpha=1, this smoothing is called Laplace smoothing, \alpha<1, called Lidstone smoothing. Both types of smoothing are statistical methods commonly used in natural language processing to smooth classified data.

In sklearn, the class MultinomialNB used to perform multinomial Naive Bayes contains the following parameters and attributes:

parameter
alpha : Floating point number, optional (1.0 by default)
parameter for Laplace or Leadstone smoothing \alpha, if set to 0, it means no smoothing option at all. However, it should be noted that smoothing is equivalent to artificially adding some noise to the probability, so the \alphalarger the setting, the lower the accuracy of polynomial Naive Bayes (although the impact is not very large).
fit_prior : Boolean value, optional (default is True)
whether to learn the prior probability P(Y=c). If set to false, all sample class outputs have the same class prior probability. That is, the probability that each label class appears is \frac{1}{n.classes}.
class_prior: An array-like structure, the structure is (n_classes, ), which can be left blank (the default is None). The
prior probability P(Y=c) of the class. If no specific prior probability is given, it will be calculated automatically based on the data

 Usually, when instantiating Multinomial Naive Bayes, leave all parameters default. Let’s try to build a simple example of polynomial Naive Bayes first:

from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
from sklearn.metrics import brier_score_loss
class_1=500
class_2=500 #两个类别分别设定500个样本
centers=[[0.0,0.0],[2.0,2.0]] #设定两个类别的中心
clusters_std=[0.5,0.5] #设定两个类别的方差
x,y=make_blobs(n_samples=[class_1,class_2],
              centers=centers,
              cluster_std=clusters_std,
              random_state=0,shuffle=False)
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)
# 归一化,确保输入的矩阵不带有负数
mms=MinMaxScaler().fit(xtrain)
xtrain_=mms.transform(xtrain)
xtest_=mms.transform(xtest)
# 建立多项式朴素贝叶斯分类器
mnb=MultinomialNB().fit(xtrain_,ytrain)
(ytrain==1).sum()/ytrain.shape[0] #计数:所有标签类别=1的样本量
0.49857142857142855
(ytrain==0).sum()/ytrain.shape[0] #计数:所有标签类别=1的样本量
0.5014285714285714
mnb.score(xtest_,ytest)
0.5433333333333333
brier_score_loss(ytest,mnb.predict_proba(xtest_)[:,0],pos_label=0)
0.24977828412546035

The Brill score is higher and the accuracy is lower, convert xtrain into type data:

from sklearn.preprocessing import KBinsDiscretizer #对连续型变量分箱
kbs=KBinsDiscretizer(n_bins=10,encode='onehot').fit(xtrain)
xtrain_=kbs.transform(xtrain)
xtest_=kbs.transform(xtest)
# 建立多项式朴素贝叶斯分类器
mnb=MultinomialNB().fit(xtrain_,ytrain)
mnb.score(xtest_,ytest)
0.9966666666666667
brier_score_loss(ytest,mnb.predict_proba(xtest_)[:,0],pos_label=0)
0.001459393277821188

It can be seen that the basic operations and codes of multinomial naive Bayes are relatively simple. The same data, if binned by dummy variables, the effect of polynomial Bayesian will be improved by leaps and bounds.

2. Bernoulli Naive Bayes BernoulliNB

class sklearn.naive_bayes.BernoulliNB (alpha=1.0, binarize=0.0,fit_prior=True,class_prior=None)

        Multinomial Naive Bayes can handle binomial distribution (coin flip) and multinomial distribution (dice throw) at the same time, among which binomial distribution is also called Bernoulli distribution, which is common in reality and has many superior mathematical properties distributed. Therefore, since there is a multinomial Naive Bayes, it is natural to deal with the Naive Bayes of the binomial distribution: Bernoulli Naive Bayes.
        The Bernoulli Bayes class BernoulliNB assumes that the data obeys the multivariate Bernoulli distribution, and applies the training and classification process of Naive Bayes on this basis. In simple terms, the multivariate Bernoulli distribution means that there can be multiple features in the data set, but each feature is classified into two categories, which can be represented by Boolean variables, or {0, 1} or {-1, 1} and so on for any combination of two categories. Therefore, this class requires the sample to be converted into a binary feature vector. If the data itself is not binary, you can use the parameter binarize in the class that is specially used for binarization to change the data.
        Bernoulli Naive Bayes is very similar to Multinomial Naive Bayes and both are commonly used for text classification data. But since Bernoulli Naive Bayes deals with the binomial distribution, it cares more about "existence or non-existence" rather than the number or frequency of "how many times it appears". This is what Bernoulli Bayes and Multinomial Bayes is fundamentally different. In the case of text classification, Bernoulli Naive Bayes can use word occurrence vectors instead of word count vectors to train a classifier. Bernoulli Naive Bayes works better on datasets with shorter documents.

Bernoulli Naive Bayes
alpha : Floating point number, optional (1.0 by default)
parameter for Laplace or Leadstone smoothing \alpha, if set to 0, it means no smoothing option at all. But it should be noted that smoothing is equivalent to artificially adding some noise to the probability, so \alphathe larger the setting, the lower the accuracy of Bernoulli Naive Bayes (although the impact is not very large), and the Brill score will also Gradually rise.
binarize : floating point number or None, can be left blank, the default is 0
to binarize the feature threshold, if set to None, it is assumed that the feature has been binarized
fit_prior : Boolean value, optional (default is True)
whether to learn the prior probability P(Y=c). If set to false, the prior probability is not used, but the uniform prior probability (uniform prior) is used, that is, the probability of each label class is considered to be\frac{1}{n.classes}
class_prior: An array-like structure, the structure is (n_classes, ), it is optional (the default is None) and the
prior probability P(Y=c) of the class. If no specific prior probability is given, it will be calculated automatically based on the data.

 In sklearn, the implementation of Bernoulli Naive Bayes is also very simple:

from sklearn.naive_bayes import BernoulliNB
# 普通来说应使用二值化的类sklearn.preprocessing.Binarizer来将特征二值化
# 然而这样效率过低,因此没选择归一化之后直接设置一个阈值
mms=MinMaxScaler().fit(xtrain)
xtrain_=mms.transform(xtrain)
xtest_=mms.transform(xtest)
# 不设置二值化
bnl_=BernoulliNB().fit(xtrain_,ytrain)
bnl_.score(xtest_,ytest)
0.49666666666666665
brier_score_loss(ytest,bnl_.predict_proba(xtest_)[:,1],pos_label=1)
0.25000009482193225
# 设置阈值为0.5
bnl_=BernoulliNB(binarize=0.5).fit(xtrain_,ytrain)
bnl_.score(xtest_,ytest)
0.983333333333333
brier_score_loss(ytest,bnl_.predict_proba(xtest_)[:,1],pos_label=1)
0.010405875827339534

Like multinomial Bayes, the results of Bernoulli Bayes are greatly affected by the structure of the data. Therefore, choosing Bayesian according to the appearance of the data is a very important point in the selection of Bayesian models.

3. Complementary Naive Bayes

Complementary Naive Bayes is an improvement of Multinomial Naive Bayes, and the parameters it contains are also very similar to Multinomial Naive Bayes.

4. Exploring Bayesian: Bayesian sample imbalance problem

from sklearn.naive_bayes import MultinomialNB,GaussianNB,BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.metrics import brier_score_loss as BS,recall_score,roc_auc_score as AUC
class_1=50000 #多数类为50000个样本
class_2=500 #少数类为500个样本
centers=[[0.0,0.0],[5.0,5.0]]
clusters_std=[3,1] #设定两个类别的方差
x,y=make_blobs(n_samples=[class_1,class_2],
               centers=centers,
               cluster_std=clusters_std,
               random_state=0,shuffle=False
              )
from sklearn.naive_bayes import ComplementNB
from time import time
import datetime
name=["Multinomial","GaussianNB","Bernoulli","Complement"]
models=[MultinomialNB(),GaussianNB(),BernoulliNB(),ComplementNB()]
for name,clf in zip(name,models):
    times=time()
    xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)
    #预处理
    if name!="GaussianNB":
        kbs=KBinsDiscretizer(n_bins=10,encode='onehot').fit(xtrain)
        xtrain=kbs.transform(xtrain)
        xtest=kbs.transform(xtest)
        
    clf.fit(xtrain,ytrain)
    y_pred=clf.predict(xtest)
    proba=clf.predict_proba(xtest)[:,1]
    score=clf.score(xtest,ytest)
    print(name)
    print("\tBrier:{:.3f}".format(BS(ytest,proba,pos_label=1)))
    print("\tAccuracy:{:.3f}".format(score))
    print("\tRecall:{:.3f}".format(recall_score(ytest,y_pred)))
    print("\tAUC:{:.3f}".format(AUC(ytest,proba)))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
    
Multinomial
	Brier:0.007
	Accuracy:0.990
	Recall:0.000
	AUC:0.991
00:00:040009
GaussianNB
	Brier:0.006
	Accuracy:0.990
	Recall:0.438
	AUC:0.993
00:00:023005
Bernoulli
	Brier:0.009
	Accuracy:0.987
	Recall:0.771
	AUC:0.987
00:00:035008
Complement
	Brier:0.038
	Accuracy:0.953
	Recall:0.987
	AUC:0.991
00:00:033008

It can be found that the complementary naive Bayesian sacrifices part of the overall accuracy and Brill index, but obtains a very high recall rate Recall, captures 98.7% of the minority classes, and maintains the same basis as the original Multinomial Naive Bayes consistent AUC scores. Compared with other Bayesian algorithms, our complement Naive Bayesian is also very fast. If our goal is to capture the minority class, we will undoubtedly want to choose Complementary Naive Bayes as our algorithm.

 
 

Guess you like

Origin blog.csdn.net/weixin_60200880/article/details/129307096