9. Bernoulli Naive Bayes algorithm (Bernoulli NB, Bernoulli Naive Bayes) (supervised learning)

Bernoulli Naive Bayes: Naive Bayes classifier for multivariate Bernoulli models

1. Algorithm idea

Like the polynomial classifier, this classifier also works on discrete data. The difference is that the polynomial classifier works on occurrence counts while the Bernoulli classifier works on binary/boolean features.

2. Official website API

Government API
Shipping:from sklearn.naive_bayes import BernoulliNB

class sklearn.naive_bayes.BernoulliNB(*, alpha=1.0, force_alpha='warn', binarize=0.0, fit_prior=True, class_prior=None)

There are quite a lot of parameters here. For specific parameter usage, you can learn based on the demo provided on the official website and try it out. Here are some commonly used parameters for explanation.

①Smoothing parameter alpha

Additional (Laplacian/Leadstone) smoothing parameter (set alpha=0 and force_alpha=True to indicate no smoothing)
Floating point number, default is 1.0
You can also pass in array format, where array is each feature value

The specific official website details are as follows:
Insert image description here

Instructions

BernoulliNB(alpha=1.2)
or
beyond = ['cat','dog']
bernoulli = BernoulliNB(alpha=beyond)

②force_alpha

If False, and alpha is less than 1e-10, alpha will be set to 1e-10, the default value
If True, alpha will remain unchanged
If alpha is too close to 0, it may cause Wrong number

The specific official website details are as follows:
Insert image description here

Instructions

BernoulliNB(force_alpha=True)

③Sample feature binarization threshold binarize

binarize the threshold for binarizing sample features (mapping to Boolean values)
if None, It is assumed that the input already contains a binary vector
, the parameter type is floating point, and the default value is 0.0

The specific official website details are as follows:
Insert image description here

Instructions

BernoulliNB(binarize=1.0)

④fit_prior

Whether to learn category prior probability. If False, the uniform prior will be used; the default is True

The specific official website details are as follows:
Insert image description here

Instructions

BernoulliNB(fit_prior=False)

⑤Class prior probability class_prior

class_prior class prior probability; if specified, the prior probability will not be adjusted based on the data; the default is None

The specific official website details are as follows:
Insert image description here

Instructions

beyond = ['cat','dog']
bernoulli = BernoulliNB(class_prior=beyond)

⑥Finally build the model

BernoulliNB(alpha=1.2,force_alpha=True,binarize=1.0,fit_prior=False)

3. Code implementation

①Guide package

Here you need to evaluate, train, save and load the model. The following are some necessary packages. If an error is reported during the import process, just install it with pip.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

②Load the data set

The data set can be simply created by itself in csv format. What I use here is 6 independent variables X and 1 dependent variable Y.
Insert image description here

fiber = pd.read_csv("./fiber.csv")
fiber.head(5) #展示下头5条数据信息

Insert image description here

③Divide the data set

The first six columns are the independent variable X, and the last column is the dependent variable Y

Official API of commonly used split data set functions:train_test_split
Insert image description here
test_size: Proportion of test set data
train_size: Proportion of training set data
random_state: Random seed
shuffle: Whether to disrupt the data
Because my data set here has a total of 48, training set 0.75, test set 0.25, that is, 36 training sets and 12 test sets

X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']

X_train, X_test, y_train, y_test = train_test_split(X,Y,train_size=0.75,test_size=0.25,random_state=42,shuffle=True)

print(X_train.shape) #(36,6)
print(y_train.shape) #(36,)
print(X_test.shape) #(12,6)
print(y_test.shape) #(12,)

④Construct the BernoulliNB model

You can try setting and adjusting the parameters yourself.

bernoulli = BernoulliNB(alpha=1.2,force_alpha=True,binarize=1.0,fit_prior=False)

⑤Model training

It’s that simple, a fit function can implement model training

bernoulli.fit(X_train,y_train)

⑥Model evaluation

Throw the test set in and get the predicted test results

y_pred = bernoulli.predict(X_test)

See if the predicted results are consistent with the actual test set results. If consistent, it is 1, otherwise it is 0. The average is the accuracy.

accuracy = np.mean(y_pred==y_test)
print(accuracy)

can also be evaluated by score. The calculation results and ideas are the same. They all look at the probability of the model guessing correctly in all data sets. However, the score function has been encapsulated. Of course, the incoming The parameters are also different, you need to import accuracy_score, from sklearn.metrics import accuracy_score

score = bernoulli.score(X_test,y_test)#得分
print(score)

⑦Model testing

Get a piece of data and use the trained model to evaluate
Here are six independent variables. I randomly throw them alltest = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
into the model. Get the prediction result, prediction = bernoulli.predict(test)
See what the prediction result is and whether it is the same as the correct result, print(prediction)

test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
prediction = bernoulli.predict(test)
print(prediction) #[2]

⑧Save the model

bernoulli is the model name, which needs to be consistent
The following parameter is the path to save the model

joblib.dump(bernoulli, './bernoulli.model')#保存模型

⑨Load and use the model

bernoulli_yy = joblib.load('./bernoulli.model')

test = np.array([[11,99498,5369,9045.27,28.47,3827588.56]])#随便找的一条数据
prediction = bernoulli_yy.predict(test)#带入数据,预测一下
print(prediction) #[4]

Complete code

Model training and evaluation does not include ⑧⑨.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

fiber = pd.read_csv("./fiber.csv")
fiber.head(5) #展示下头5条数据信息

X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']

X_train, X_test, y_train, y_test = train_test_split(X,Y,train_size=0.75,test_size=0.25,random_state=42,shuffle=True)

print(X_train.shape) #(36,6)
print(y_train.shape) #(36,)
print(X_test.shape) #(12,6)
print(y_test.shape) #(12,)

bernoulli = BernoulliNB(alpha=1.2,force_alpha=True,binarize=1.0,fit_prior=False)

bernoulli.fit(X_train,y_train)

y_pred = bernoulli.predict(X_test)
accuracy = np.mean(y_pred==y_test)
print(accuracy)

score = bernoulli.score(X_test,y_test)#得分
print(score)

test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
prediction = bernoulli.predict(test)
print(prediction) #[2]

Guess you like

Origin blog.csdn.net/qq_41264055/article/details/133235317