Getting Started with Python Machine Learning - Bayesian Algorithm Study Notes


foreword

The Bayesian formula is a statistical probability theorem that we are familiar with in high school. The shock that the Bayesian formula has brought to me is that someone can predict the unknown based on the known. There is an indescribable sense of mystery! Perhaps in the secular concept, this kind of future prediction is very absurd and illogical, but Bayes uses a mathematical model to cover the reasoning process of solving the posterior probability distribution with the prior probability, which is really incredible!


1. Introduction to Bayesian Algorithms

Bayesian algorithm is a statistical learning method based on Bayesian theorem for classification, prediction and inference problems. Its basic idea is to use the prior probability and sample data to calculate the posterior probability for classification or prediction.

insert image description here

Specifically, the Bayesian algorithm assumes that the classification result is jointly determined by multiple features, and these features are independent of each other. By statistically analyzing the training data of the known classification, the conditional probability of each feature for each classification can be calculated, that is, the probability of a certain feature appearing given a certain classification. These conditional probabilities are used together with the prior probability of each category (that is, the probability of occurrence of each category without any data), and the posterior probability of each category can be calculated by Bayesian theorem, that is, in The probability of occurrence of each class given the features. Finally, classify or predict according to the magnitude of the posterior probability.

There are two commonly used implementations of Bayesian algorithms: Naive Bayesian and Bayesian Networks. The Naive Bayesian algorithm assumes that all features are independent of each other, so it can simplify the calculation process. Bayesian network is a graphical model, which is used to describe the relationship between variables and can deal with the dependence between features.

Bayesian algorithm has been widely used in text classification, spam filtering, recommendation system and other fields. It not only has high classification accuracy, but also can deal with multi-classification and high-dimensional data and other problems.

2. Mathematical principle of Bayesian algorithm

1. Conditional probability

insert image description here

The conditional probability is the posterior probability. What it wants to express is the probability of event A occurring under the condition that another event B has occurred. We use P ( A ∣ B ) P(A|B)P ( A B ) to express the conditional probability
P ( A ∣ B ) = P ( AB ) P ( B ) P(A|B) = \frac{P(AB)}{P(B)}P(AB)=P(B)P(AB)

insert image description here

The conditional probability formula describes that the posterior probability can be derived through the prior probability, that is, but we know the probability of event A happening and the probability of event A and event B occurring together, we can reversely deduce event A in another event B probability of occurrence under the conditions that have occurred

2. Total probability formula

Conditional probability alone is not enough. If we continue to apply the idea of ​​conditional probability to solve some problems in the event set, then we need to calculate the total probability. If events B 1 , B 2 , B 3 , ⋯ , B n B_{1}, B_{2}, B_{3}, \cdots, B_{n}B1,B2,B3,,Bnform a complete event group BBB and both have positive probability, then we can use the total probability formula to express the eventAAThe probability of A happening

P ( A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i ) P(A)= \sum_{i=1}^{n} P(A|B_{i})P(B_{i}) P(A)=i=1nP(ABi)P(Bi)

The total probability formula combines conditional probability and prior probability, and reveals the reverse conversion process of conditional probability in the event set, which is an extension of the conditional probability of a single event

3. Bayesian formula

insert image description here
Bayesian has studied a very interesting thing. If another complementary conditional probability is calculated through a conditional probability transformation, then we can predict some very interesting things, so Bayesian theorem was born, Bayesian The idea of ​​can be summarized as prior probability + data = posterior probability. Bayes formula can be expressed asThe probability of event A given that event B occurs is equal to the probability of event B given that event A occurs over the probability of event B

P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = \frac{P(B|A)P(A)}{P(B)} P(AB)=P(B)P(BA)P(A)

In machine learning, we equate event A and event B to features and labels, so we can get the basic theorem of Bayesian algorithms. Almost all Bayesian algorithms are optimized based on the following principles

insert image description here

4. Naive Bayes Classifier

In real life, if you want to make a prediction about the occurrence of something, you must consider many factors and characteristics. The emergence of Naive Bayes helps us simplify the cumbersome relationship considerations between many characteristic factors. The Naive Bayesian algorithm is a classification algorithm based on Bayesian theorem and the assumption of independence of feature conditions. Its core idea is to use the sample data set of known categories to predict the category of new samples by calculating the conditional probability between features. Category label yyThe conditional probability of y in several characteristic factors can be expressed as

P ( y ∣ x 1 , x 2 , . . . , x n ) = P ( y ) P ( x 1 , x 2 , . . . , x n ∣ y ) P ( x 1 , x 2 , . . . , x n ) P(y|x_1, x_2, ..., x_n) = \frac{P(y)P(x_1, x_2, ..., x_n|y)}{P(x_1, x_2, ..., x_n)} P(yx1,x2,...,xn)=P(x1,x2,...,xn)P(y)P(x1,x2,...,xny)

Among them, yyy represents the category,x 1 , x 2 , . . . , xn x_1, x_2, ..., x_nx1,x2,...,xnRepresents the feature vector, P ( y ∣ x 1 , x 2 , . . . , xn ) P(y|x_1, x_2, ..., x_n)P(yx1,x2,...,xn) in a given eigenvectorx 1 , x 2 , . . . , xn x_1, x_2, ..., x_nx1,x2,...,xnUnder the condition that the sample belongs to category yyprobability of y .

The Naive Bayesian algorithm assumes that all features are independent, that is to say, when we consider several influencing factors of one thing, we regard each factor as an independent individual that does not affect each other. Because of this assumption, the number of conditional probabilities included in the model is greatly reduced, and the learning and prediction of Naive Bayes is greatly simplified. According to the independence assumption, P ( x 1 , x 2 , . . . , xn ∣ y ) P(x_1, x_2, ..., x_n|y)P(x1,x2,...,xny ) is expanded as the product of the conditional probabilities of each feature in a given category:

P ( x 1 , x 2 , . . . , x n ∣ y ) = ∏ i = 1 n P ( x i ∣ y ) P(x_1, x_2, ..., x_n|y) = \prod_{i=1}^{n} P(x_i | y) P(x1,x2,...,xny)=i=1nP(xiy)

where P ( y ) P(y)P ( y ) means categoryyyThe prior probability of y in the sample,P ( xi ∣ y ) P(x_i|y)P(xiy ) means that in a given categoryyyUnder the condition of y , feature xi x_ixiProbability of occurrence, P ( x 1 , x 2 , . . . , xn ) P(x_1, x_2, ..., x_n)P(x1,x2,...,xn) represents the eigenvectorsx 1 , x 2 , . . . , xn x_1, x_2, ..., x_nx1,x2,...,xnprobability of occurrence.

P ( y ∣ x 1 , x 2 , . . . , x n ) = P ( y ) ∏ i = 1 n P ( x i ∣ y ) P ( x 1 , x 2 , . . . , x n ) P(y|x_1, x_2, ..., x_n) = \frac{P(y) \prod\limits_{i=1}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)} P(yx1,x2,...,xn)=P(x1,x2,...,xn)P ( and )i=1nP(xiy)

In practice, since P ( x 1 , x 2 , . . . , xn ) P(x_1, x_2, ..., x_n)P(x1,x2,...,xn) is the same for all categories, so the denominator can be omitted, only the numerator part is considered, and the category with the largest posterior probability is selected as the prediction result, and finally we will gety ^ \hat{y}y^Indicates the predicted category

y ^ = arg ⁡ max ⁡ y P ( y ) ∏ i = 1 n P ( x i ∣ y ) \hat{y} = \arg\max_{y} P(y) \prod_{i=1}^{n} P(x_i | y) y^=argymaxP ( and )i=1nP(xiy)

5. Gaussian Naive Bayes Classifier and Bernoulli Naive Bayes Classifier

The Gaussian Naive Bayesian algorithm is a common form of the Naive Bayesian algorithm, which is suitable for the case where the feature variables are continuous values. In the Gaussian Naive Bayesian algorithm, it is assumed that the characteristic variables under each category obey the Gaussian distribution, so the probability density function of the Gaussian distribution can be used to calculate the conditional probability. The final predicted category y ^ \hat{y}y^The expression is

y ^ = arg ⁡ max ⁡ y P ( y ) ∏ i = 1 n 1 2 π σ y , i 2 exp ⁡ ( − ( x i − μ y , i ) 2 2 σ y , i 2 ) \hat{y} = \arg\max_{y} P(y) \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma_{y,i}^2}} \exp\left(-\frac{(x_i - \mu_{y,i})^2}{2\sigma_{y,i}^2}\right) y^=argymaxP ( and )i=1n2 p.s _y,i2 1exp(2 py,i2(ximy,i)2)

The Bernoulli Naive Bayesian algorithm is a common form of the Naive Bayesian algorithm, which is suitable for the case where the feature variable is a binary variable. In the Bernoulli Naive Bayes algorithm, it is assumed that each feature variable is a binary variable, that is, there are only two values, such as 0 and 1. Therefore, the conditional probability of each characteristic variable has only two values, corresponding to the cases where the characteristic variable takes the value of 0 and 1 respectively. The final predicted category y ^ \hat{y}y^The expression is

y ^ = arg ⁡ max ⁡ y P ( y ) ∏ i = 1 n P i ∣ y x i ( 1 − P i ∣ y ) 1 − x i \hat{y} = \arg\max_{y} P(y) \prod_{i=1}^{n} P_{i|y}^{x_i} (1 - P_{i|y})^{1-x_i} y^=argymaxP ( and )i=1nPiyxi(1Piy)1xi

3. Python implements naive Bayesian classification

Python realizes the idea of ​​naive Bayesian classification, first guide the package, this time we choose the moon dataset and block dataset, choose Gaussian naive Bayesian classifier and Bernoulli naive Bayesian classifier, and then instantiate the object And create a canvas, then define a drawing function, first standardize the data set and draw a grid, then draw a scatter plot and create a sub-graph, and finally train the model, and return the predicted value and elongate it to visualize the contour line of the decision boundary

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_blobs
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB


# 模型的名字
names = ["Gaussian", "Bernoulli"]
# 创建我们的模型对象
classifiers = [GaussianNB(), BernoulliNB()]
# 创建数据集
datasets = [ make_moons(noise=0.2, random_state=0),make_blobs(centers=2, random_state=2),]
# 创建画布
figure = plt.figure(figsize=(12, 8))


def plot_clf(NB_clf, dataset, name, i):
    X, y = dataset
    #     标准化数据集
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)
    #     对画布画网格线
    x1_min, x1_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    x2_min, x2_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    array1, array2 = np.meshgrid(np.arange(x1_min, x1_max, 0.2),
                                 np.arange(x2_min, x2_max, 0.2))
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#fafab0', '#9898ff'])
    i += 1
    ax = plt.subplot(len(dataset), 2, i)
    ax.set_title("dataset")
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
               cmap=cm_bright, edgecolors='k')
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
               cmap=cm_bright, alpha=0.6, edgecolors='k')
    ax.set_xlim(array1.min(), array1.max())
    ax.set_ylim(array2.min(), array2.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1
    ax = plt.subplot(len(dataset), 2, i)
    clf = NB_clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    Z = clf.predict_proba(np.c_[array1.ravel(), array2.ravel()])[:, 1]
    Z = Z.reshape(array1.shape)
    ax.contourf(array1, array2, Z, cmap=cm, alpha=.8)
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
               edgecolors='k')
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
               edgecolors='k', alpha=0.6)
    ax.set_xlim(array1.min(), array1.max())
    ax.set_ylim(array2.min(), array2.max())
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title(name)
    ax.text(array1.max() - .3, array2.min() + .3, ('{:.1f}%'.format(score * 100)),
            size=15, horizontalalignment='right')


for i in range(2):
    plot_clf(classifiers[i], datasets[i], names[i], 2*i)
    plt.tight_layout()
plt.show()

insert image description here


Summarize

The above is the entire content of the Bayesian algorithm study notes. This note briefly introduces the mathematical principles of the Bayesian algorithm and the program ideas implemented by python. The Naive Bayesian algorithm has many advantages, such as good interpretability, which can give the degree of influence of each feature on the classification, which is easy to understand and explain; the calculation speed is fast, and it is suitable for processing large-scale data sets and high-dimensional data; for Noise data and missing data have good robustness; it performs well in natural language processing tasks such as text classification and sentiment analysis. In general, the significance of the Bayesian algorithm is still immeasurable.

Guess you like

Origin blog.csdn.net/m0_55202222/article/details/130792139