Naive Bayes' theorem, examples and Python implementation

Preliminary appreciated that: have a set of inputs, there are many possibilities based on this input, output, necessary to calculate the likelihood of each output, the maximum likelihood is output as an output corresponding to this input.

So, how to solve this problem?

Bayes gives another thought. To be judged according to the historical record.

The idea is this:

1, according to the Bayes formula: P (Output | Input) = P (Input | Output) * P (output) / P (input)

2, P (input) = historical data, an input of the proportion of all samples;

3, P (output) = historical data, an output of the proportion of all samples;

4, P (Input | Output) = historical data, an input, an output of the number of percentage of all samples, such as: 30-year-old, male, noon to eat noodles, in which the [30-year-old man is input], [] at noon to eat noodles is output.

 

First, the definition of conditional probability and Bayes' formula

Second, naive Bayes classification algorithm

Naive Bayesian classification algorithm is a kind of supervision can be binary or multi-classification. Example of a data set as shown below:

 

 

There is now a new sample, X = (Age: <= 30, revenue: medium, whether students: Yes, credibility: in), the goal is to classify the use of Bayesian classifier. Suppose category C (c1 = c2 = Yes or No), so our goal is to find P (c1 | X) and P (c2 | X), who is relatively bigger, then X will be divided into a class.

Here, formulaic Naive Bayes classification process.

 

 

Third, examples

Here, the following set of data as a training set for a new sample X = (Age: <= 30, revenue: medium, whether students: Yes, Credit:) as the test samples were classified.

We can describe attributes and class attribute in this example, in association with the formula, and computing.

Python reference implementation code

#coding:utf-8
# 极大似然估计  朴素贝叶斯算法
import pandas as pd
import numpy as np

class NaiveBayes(object):
    def getTrainSet(self):
        dataSet = pd.read_csv('F://aaa.csv')
        dataSetNP = np.array(dataSet)  #将数据由dataframe类型转换为数组类型
        trainData = dataSetNP[:,0:dataSetNP.shape[1]-1]   #训练数据x1,x2
        labels = dataSetNP[:,dataSetNP.shape[1]-1]        #训练数据所对应的所属类型Y
        return trainData, labels

    def classify(self, trainData, labels, features):
        #求labels中每个label的先验概率
        labels = list(labels)    #转换为list类型
        labelset = set(labels)
        P_y = {}       #存入label的概率
        for label in labelset:
            P_y[label] = labels.count(label)/float(len(labels))   # p = count(y) / count(Y)
            print(label,P_y[label])

        #求label与feature同时发生的概率
        P_xy = {}
        for y in P_y.keys():
            y_index = [i for i, label in enumerate(labels) if label == y]  # labels中出现y值的所有数值的下标索引
            for j in range(len(features)):      # features[0] 在trainData[:,0]中出现的值的所有下标索引
                x_index = [i for i, feature in enumerate(trainData[:,j]) if feature == features[j]]
                xy_count = len(set(x_index) & set(y_index))   # set(x_index)&set(y_index)列出两个表相同的元素
                pkey = str(features[j]) + '*' + str(y)
                P_xy[pkey] = xy_count / float(len(labels))
                print(pkey,P_xy[pkey])

        #求条件概率
        P = {}
        for y in P_y.keys():
            for x in features:
                pkey = str(x) + '|' + str(y)
                P[pkey] = P_xy[str(x)+'*'+str(y)] / float(P_y[y])    #P[X1/Y] = P[X1Y]/P[Y]
                print(pkey,P[pkey])

        #求[2,'S']所属类别
        F = {}   #[2,'S']属于各个类别的概率
        for y in P_y:
            F[y] = P_y[y]
            for x in features:
                F[y] = F[y]*P[str(x)+'|'+str(y)]     #P[y/X] = P[X/y]*P[y]/P[X],分母相等,比较分子即可,所以有F=P[X/y]*P[y]=P[x1/Y]*P[x2/Y]*P[y]
                print(str(x),str(y),F[y])

        features_label = max(F, key=F.get)  #概率最大值对应的类别
        return features_label


if __name__ == '__main__':
    nb = NaiveBayes()
    # 训练数据
    trainData, labels = nb.getTrainSet()
    # x1,x2
    features = [8]
    # 该特征应属于哪一类
    result = nb.classify(trainData, labels, features)
    print(features,'属于',result)
    
    
#coding:utf-8
#朴素贝叶斯算法   贝叶斯估计, λ=1  K=2, S=3; λ=1 拉普拉斯平滑
import pandas as pd
import numpy as np

class NavieBayesB(object):
    def __init__(self):
        self.A = 1    # 即λ=1
        self.K = 2
        self.S = 3

    def getTrainSet(self):
        trainSet = pd.read_csv('F://aaa.csv')
        trainSetNP = np.array(trainSet)     #由dataframe类型转换为数组类型
        trainData = trainSetNP[:,0:trainSetNP.shape[1]-1]     #训练数据x1,x2
        labels = trainSetNP[:,trainSetNP.shape[1]-1]          #训练数据所对应的所属类型Y
        return trainData, labels

    def classify(self, trainData, labels, features):
        labels = list(labels)    #转换为list类型
        #求先验概率
        P_y = {}
        for label in labels:
            P_y[label] = (labels.count(label) + self.A) / float(len(labels) + self.K*self.A)

        #求条件概率
        P = {}
        for y in P_y.keys():
            y_index = [i for i, label in enumerate(labels) if label == y]   # y在labels中的所有下标
            y_count = labels.count(y)     # y在labels中出现的次数
            for j in range(len(features)):
                pkey = str(features[j]) + '|' + str(y)
                x_index = [i for i, x in enumerate(trainData[:,j]) if x == features[j]]   # x在trainData[:,j]中的所有下标
                xy_count = len(set(x_index) & set(y_index))   #x y同时出现的次数
                P[pkey] = (xy_count + self.A) / float(y_count + self.S*self.A)   #条件概率

        #features所属类
        F = {}
        for y in P_y.keys():
            F[y] = P_y[y]
            for x in features:
                F[y] = F[y] * P[str(x)+'|'+str(y)]

        features_y = max(F, key=F.get)   #概率最大值对应的类别
        return features_y


if __name__ == '__main__':
    nb = NavieBayesB()
    # 训练数据
    trainData, labels = nb.getTrainSet()
    # x1,x2
    features = [10]
    # 该特征应属于哪一类
    result = nb.classify(trainData, labels, features)
    print(features,'属于',result)

 

参考链接:

https://blog.csdn.net/ten_sory/article/details/81237169

https://www.cnblogs.com/yiyezhouming/p/7364688.html

Guess you like

Origin www.cnblogs.com/cqliu/p/11200100.html