[Reprinted] Naive Bayes python to achieve prediction _Python detailed explanation of the method of naive Bayes classifier

Reference link: Python Naive Bayes Classifier

This article describes how to implement a naive Bayes classifier in Python. Share with you for your reference, as follows:

 Bayes Theorem

 Bayes' theorem is a theorem modified by subjective judgments of the probability distribution of observations (that is, prior probability), and has an important position in probability theory.

 The prior probability distribution (marginal probability) refers to the probability distribution based on subjective judgment rather than the sample distribution, and the posterior probability (conditional probability) is the conditional probability distribution obtained from the prior probability distribution of the sample distribution and unknown parameters.

 Bayesian formula:

 P(A∩B) = P(A)*P(B|A) = P(B)*P(A|B)

 Deformed:

 P(A|B)=P(B|A)*P(A)/P(B)

 among them

 P(A) is the prior probability or marginal probability of A. It is called "prior" because it does not consider the B factor.

 P(A|B) is the conditional probability of A after B occurs, and is also called the posterior probability of A.

 P(B|A) is the conditional probability of B after the occurrence of A is known, and is also called the posterior probability of B, which is called the likelihood here.

 P(B) is the prior probability or marginal probability of B, which is called a standardized constant here.

 P(B|A)/P(B) is called the standard likelihood.

 Naive Bayes Classification (Naive Bayes)

 The naive Bayes classifier assumes conditional independence between attributes when estimating the conditional probability of a class.

 First define

 x = {a1,a2,...} is a sample vector, a is a characteristic attribute

 div = {d1 = [l1,u1],...} a division of characteristic attributes

 class = {y1,y2,...} the category to which the sample belongs

 Algorithm flow:

 (1) Calculate the prior probability p(y[i]) for each category through the distribution of categories in the sample set

 (2) Calculate the frequency p(a[j] in d[k] | y[i]) of each feature attribute division under each category

 (3) Calculate p(x|y[i]) for each sample

 p(x|y[i]) = p(a[1] in d | y[i]) * p(a[2] in d | y[i]) * ...

 All feature attributes of the sample are known, so the interval d to which the feature attributes belong is known.

 The value of p(a[k] in d | y[i]) can be determined by (2), and p(x|y[i]) can be obtained.

 (4) According to Bayes' theorem:

 p(y[i]|x) = ( p(x|y[i]) * p(y[i]) ) / p(x)

 Because the denominator is the same, only the numerator needs to be calculated.

 p(y[i]|x) is the probability that the observed sample belongs to the classification y[i], and find the classification corresponding to the maximum probability as the classification result.

 Example:

 Import data set

 {a1 = 0, a2 = 0, C = 0} {a1 = 0, a2 = 0, C = 1}

 {a1 = 0, a2 = 0, C = 0} {a1 = 0, a2 = 0, C = 1}

 {a1 = 0, a2 = 0, C = 0} {a1 = 0, a2 = 0, C = 1}

 {a1 = 1, a2 = 0, C = 0} {a1 = 0, a2 = 0, C = 1}

 {a1 = 1, a2 = 0, C = 0} {a1 = 0, a2 = 0, C = 1}

 {a1 = 1, a2 = 0, C = 0} {a1 = 1, a2 = 0, C = 1}

 {a1 = 1, a2 = 1, C = 0} {a1 = 1, a2 = 0, C = 1}

 {a1 = 1, a2 = 1, C = 0} {a1 = 1, a2 = 1, C = 1}

 {a1 = 1, a2 = 1, C = 0} {a1 = 1, a2 = 1, C = 1}

 {a1 = 1, a2 = 1, C = 0} {a1 = 1, a2 = 1, C = 1}

 Calculate the prior probability of the category

 P(C = 0) = 0.5

 P(C = 1) = 0.5

 Calculate the conditional probability of each feature attribute:

 P(a1 = 0 | C = 0) = 0.3

 P(a1 = 1 | C = 0) = 0.7

 P(a2 = 0 | C = 0) = 0.4

 P(a2 = 1 | C = 0) = 0.6

 P(a1 = 0 | C = 1) = 0.5

 P(a1 = 1 | C = 1) = 0.5

 P(a2 = 0 | C = 1) = 0.7

 P(a2 = 1 | C = 1) = 0.3

 Test sample:

 x = { a1 = 1, a2 = 2}

 p(x | C = 0) = p(a1 = 1 | C = 0) * p( 2 = 2 | C = 0) = 0.3 * 0.6 = 0.18

 p(x | C = 1) = p(a1 = 1 | C = 1) * p (a2 = 2 | C = 1) = 0.5 * 0.3 = 0.15

 Calculate P(C | x) * p(x):

 P(C = 0) * p(x | C = 1) = 0.5 * 0.18 = 0.09

 P(C = 1) * p(x | C = 2) = 0.5 * 0.15 = 0.075

 So the test sample is considered to be of type C1

 Python implementation

 The training process of the naive Bayes classifier is to calculate the probability table in (1), (2), and the application process is to calculate (3), (4) and find the maximum value.

 Still use the original interface for class encapsulation:

 from numpy import *

 class NaiveBayesClassifier(object):

 def __init__(self):

 self.dataMat = list()

 self.labelMat = list()

 self.pLabel1 = 0

 self.p0Vec = list()

 self.p1Vec = list()

 def loadDataSet(self,filename):

 fr = open(filename)

 for line in fr.readlines():

 lineArr = line.strip().split()

 dataLine = list()

 for i in lineArr:

 dataLine.append(float(i))

 label = dataLine.pop() # pop the last column referring to label

 self.dataMat.append(dataLine)

 self.labelMat.append(int(label))

 def train(self):

 dataNum = len (self.dataMat)

 featureNum = len(self.dataMat[0])

 self.pLabel1 = sum(self.labelMat)/float(dataNum)

 p0Num = zeros(featureNum)

 p1Num = zeros(featureNum)

 p0Denom = 1.0

 p1Denom = 1.0

 for i in range(dataNum):

 if self.labelMat[i] == 1:

 p1Num += self.dataMat[i]

 p1Denom += sum(self.dataMat[i])

 else:

 p0Num += self.dataMat[i]

 p0Denom += sum(self.dataMat[i])

 self.p0Vec = p0Num/p0Denom

 self.p1Vec = p1Num/p1Denom

 def classify(self, data):

 p1 = reduce(lambda x, y: x * y, data * self.p1Vec) * self.pLabel1

 p0 = reduce(lambda x, y: x * y, data * self.p0Vec) * (1.0 - self.pLabel1)

 if p1 > p0:

 return 1

 else:

 return 0

 def test(self):

 self.loadDataSet('testNB.txt')

 self.train()

 print(self.classify([1, 2]))

 if __name__ == '__main__':

 NB = NaiveBayesClassifier()

 NB.test()

 Matlab

 Matlab's standard toolbox provides support for naive Bayes classifiers:

 trainData = [0 1; -1 0; 2 2; 3 3; -2 -1; -4.5 -4; 2 -1; -1 -3];

 group = [1 1 -1 -1 1 1 -1 -1]';

 model = fitcnb(trainData, group)

 testData = [5 2;3 1;-4 -3];

 predict(model, testData)

 fitcnb is used to train the model, and predict is used to predict.

 I hope this article will help you in Python programming.

Guess you like

Origin blog.csdn.net/u013946150/article/details/112976672