[AI] Python implements Naive Bayesian algorithm

Naive Bayes Algorithm

1. Algorithm derivation

Naive Bayes (Naive Bayes) algorithm is a classification method based on Bayesian theorem and independent assumptions of characteristic conditions. The overfitting phenomenon of using sample information alone is eliminated. The algorithm shows a high accuracy rate in the case of a large data set, and the algorithm itself is relatively simple.

The goal of the Naive Bayesian algorithm is to calculate the posterior probability for each category according to the input characteristics, and select the category with the largest posterior probability as the model output. The posterior probability is derived from the conditional probability according to the Bayesian criterion, assuming that each sample contains nnn features, a total ofmmclass m , ck c_kckIndicates one of the categories, the following is the formula derivation of the Naive Bayesian algorithm:

① Prior probability:
P ( Y = ck ) P(Y=c_k)P ( Y)=ck)

② Conditional probability:
P ( X = x ∣ Y = ck ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = ck ) P(X=x |Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k)P(X=xY=ck)=P(X(1)=x(1),...,X(n)=x(n)Y=ck)

Let x ( j ) x^{(j)}x( j ) haveSSS kinds of value methods, assuming that each feature is directly independent of each other, thenthe nnn features have an average ofS n S^nSThe range of n parameters is too large; in practice, the training set often does not have such a wide range of parameters, and the characteristics of the data often have a certain degree of independence. Therefore, one can boldly assume thatthe nnThe n features are independent of each other. In this way, the expression of the conditional probability is transformed into:

P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 0 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{j=0}^nP(X^{(j)}=x^{(j)}|Y=c_k) P(X=xY=ck)=P(X(1)=x(1),...,X(n)=x(n)Y=ck)=j=0nP(X(j)=x(j)Y=ck)

③ Posterior probability:
P ( Y = ck ∣ X = x ) = P ( X = x ∣ Y = ck ) P ( Y = ck ) ∑ k P ( X = x ∣ Y = ck ) P ( Y = ck ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k}P(X=x|Y=c_k)P(Y =c_{k})}P ( Y)=ckX=x)=kP(X=xY=ck) P ( AND=ck)P(X=xY=ck) P ( AND=ck)

Substituting the conditional probability formula into the posterior probability formula, the posterior probability expression is transformed into:
P ( Y = ck ∣ X = x ) = P ( Y = ck ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = ck ) ∑ k P ( Y = ck ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = ck ) P(Y=c_k|X=x)= \frac{P(Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_{k}P (Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)}P ( Y)=ckX=x)=kP ( Y)=ck)j=1nP(X(j)=x(j)Y=ck)P ( Y)=ck)j=1nP(X(j)=x(j)Y=ck)

④ Thus, the optimization objective of the Naive Bayesian classifier can be expressed as
y = argmaxck P ( Y = ck ∣ X = x ) = argmaxck P ( Y = ck ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = ck ) y=\mathop{argmax}\limits_{c_k}P(Y=c_k|X=x)=\mathop{argmax}\limits_{c_k}P(Y=c_k)\prod_{j =1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)y=ckargmaxP ( Y)=ckX=x)=ckargmaxP ( Y)=ck)j=1nP(X(j)=x(j)Y=ck)

2. Algorithm process

Above we have deduced the calculation objective of the Naive Bayesian algorithm, and the following formula gives the operation process of the algorithm:

① Calculate the prior probability (where the denominator indicates that there are NNN categories, numerator representation For sampleiii , if its classificationyi = ck y_i=c_kyi=ck, the numerator takes the value 1, otherwise it takes the value 0):

P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N P(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)}{N} P ( Y)=ck)=Ni=1NI(yi=ck)

② Calculate the conditional probability (indicates that the category is ck c_kckjj _The value of j features isx ( j ) x^{(j)}x( probability of j ) :

P ( X ( j ) = x ( j ) ∣ Y = c k ) = ∑ i = 1 N I ( X i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{(j)}=x^{(j)}|Y=c_k)=\frac{\sum_{i=1}^{N}I(X_i^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^{N}I(y_i=c_k)} P(X(j)=x(j)Y=ck)=i=1NI(yi=ck)i=1NI(Xi(j)=ajl,yi=ck)

④ For a given sample x = ( x ( 1 ) , x ( 2 ) , . . . , x ( n ) ) T x=(x^{(1)},x^{(2)},.. .,x^{(n)})^Tx=(x(1),x(2),...,x(n))T , calculate a sizekkvector of k ::

P ( Y = c k ) ∏ i = 1 n P ( X ( j ) = x ( j ) ) P(Y=c_k)\prod_{i=1}^nP(X^{(j)}=x^{(j)}) P ( Y)=ck)i=1nP(X(j)=x(j))

⑤ In this size kkTake argmax argmax in k samplesa r g m a x , determine samplexxClassification of x

y = a r g m a x c k P ( Y = c k ) ∏ i = 1 n P ( X ( j ) = x ( j ) ) y=\mathop{argmax}\limits_{c_k}P(Y=c_k)\prod_{i=1}^nP(X^{(j)}=x^{(j)}) y=ckargmaxP ( Y)=ck)i=1nP(X(j)=x(j))

3. Algorithm example

Take online community messages as an example. In order not to affect the development of the community, we need to block insulting speech, so we need to build a quick filter, if a message uses negative or insulting language, then mark the message as inappropriate. Filtering this type of content is a very common need. Establish two types for this question: insulting and non-insulting, represented by 1 and 0 respectively.

We treat text as word vectors or term vectors, that is, convert sentences into vectors. Consider the words that appear in all documents, and then decide which words to include in the vocabulary or the desired vocabulary set, and then you must convert each document into a vector on the vocabulary.

The function below takes a simple training dataset and vectorizes it for easy processing by the algorithm. For example, if the training data has two samples as [['a', 'c', 'g'], ['a', 'b', 'c'. 'm']], then the vocabulary is [ 'a', 'b', 'c', 'g', 'm'], the position containing the corresponding vocabulary in the sample is set to 1, otherwise it is set to 0. Therefore, two samples can be vectorized as: [[1, 0, 1, 1, 0], [1, 1, 1, 1, 0]], so that each sample has 5 features, and subsequent processing Standards can be unified.

import numpy as np


def vectorization(word_list, vocabulary):
    return_vector = [0] * len(vocabulary)
    for word in word_list:
        if word in vocabulary:
            return_vector[vocabulary.index(word)] = 1
    return return_vector


def load_training_data():
    training_data = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                     ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                     ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                     ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                     ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    training_labels = [0,1,0,1,0,1]
    
    # 构建词汇表
    vocabulary = set()
    for item in training_data:
        vocabulary = vocabulary | set(item)
    vocabulary = list(vocabulary)
    
    # 将训练数据向量化
    training_mat = []
    for item in training_data:
        training_mat.append(vectorization(item, vocabulary))
    
    return training_mat, training_labels

The following function is the execution flow of the Naive Bayesian algorithm, which is arranged with reference to the algorithm flow in the second part above. The difference is that Laplace smoothing is used in the code, which is to prevent 0 from appearing in the multiplication operation and make the entire expression 0, which is obviously unreasonable.

def naive_bayes(test_data):
    # 计算先验概率
    training_mat, training_labels = load_training_data()
    feature_size = len(training_mat[0])
    training_size = len(training_mat)
    p1_prior = (sum(training_labels) + 1) / (float(training_size) + 2)
    p0_prior = 1 - p1_prior
    # 计算条件概率
    feature_cnt1 = np.zeros(feature_size)
    feature_cnt0 = np.zeros(feature_size)
    for i in range(training_size):
        if training_labels[i] == 1:
            feature_cnt1 += training_mat[i]
        else:
            feature_cnt0 += training_mat[i]
    p1_condition = (feature_cnt1 + 1) / (feature_cnt1.sum() + feature_size)
    p0_condition = (feature_cnt0 + 1) / (feature_cnt0.sum() + feature_size)
    # 计算目标函数
    p1_pred, p0_pred = p1_prior, p0_prior
    test_data = vectorization(test_data, vocabulary)
    for i in range(feature_size):
        if test_data[i] == 1:
            p1_pred *= p1_condition[i]
        else:
            p0_pred *= p0_condition[i]
    if p1_pred > p0_pred:
        return 1
    else:
        return 0
    

test_data = ['stupid', 'stop', 'how', 'problems']
pred_label = naive_bayes(test_data)
print('test_data = ', test_data)
print('pred_label = ', pred_label)

Run the code given above, the execution result is:

test_data =  ['stupid', 'stop', 'how', 'problems']
pred_label =  1

Guess you like

Origin blog.csdn.net/Elford/article/details/128189678