Self-understanding Bayesian algorithm is to judge whether C belongs to class A or class B by probability. The following is the specific code (python3.5 test passed)
A wave of text flow explanations
1) Load the training data and the category corresponding to the training data
2) Generate a vocabulary set, which is the union of all training data
3) Generate a vector set of training data, that is, a vector set containing only 0 and 1
4) Calculate the probabilities of the training data
5) Load test data
6) Generate a vector set of test data
7) The probability of test data vector * training data is finally summed
8) Get the category of the test data
specific code implementation
from numpy import * #Bayesian algorithm def loadDataSet(): trainData=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] labels =[0, 1, 0, 1, 0, 1] # 1 means insulting speech, 0 means normal speech return trainData, labels #Generate vocabulary def createVocabList(trainData): VocabList = set([]) for item in trainData: VocabList = VocabList|set(item) #Take the union of two sets return sorted(list(VocabList)) #Return after sorting the results #Generate a vector set containing only 0 and 1 for the training data def createWordSet(VocabList, trainData): VocabList_len = len(VocabList) #The length of the vocabulary set trainData_len = len(trainData) #The length of the training data WordSet = zeros((trainData_len,VocabList_len)) #The length of the row is the length of the training data and the length of the column is the length of the vocabulary set. for index in range(0,trainData_len): for word in trainData[index]: if word in VocabList: #In fact , the position corresponding to the word contained in the training data is 1 and the other is 0 WordSet[index][VocabList.index(word)] = 1 return WordSet #Calculate the probability of each vector set def opreationProbability(WordSet, labels): WordSet_col = len(WordSet[0]) labels_len = len (labels) WordSet_labels_0 = zeros(WordSet_col) WordSet_labels_1 = zeros(WordSet_col) num_labels_0 = 0 num_labels_1 = 0 for index in range(0,labels_len): if labels[index] == 0: WordSet_labels_0 += WordSet[index] #Vector addition num_labels_0 += 1 #Count else : WordSet_labels_1 += WordSet[index] #Vector addition num_labels_1 += 1 #Count p0 = WordSet_labels_0 * num_labels_0 / labels_len p1 = WordSet_labels_1 * num_labels_1 / labels_len return p0, p1 trainData, labels = loadDataSet() VocabList = createVocabList(trainData) train_WordSet = createWordSet(VocabList,trainData) p0, p1 = opreationProbability(train_WordSet, labels) #At this point, even if the training is completed #Start testing testData = [[ ' not ' , ' take ' , ' ate ' , ' my ' , ' stupid ' ]] #Test data test_WordSet = createWordSet(VocabList, testData) #vector set of test data res_test_0 = sum(p0 * test_WordSet) res_test_1 = sum(p1 * test_WordSet) if res_test_0 > res_test_1: print ( " belongs to category 0 " ) else : print ( " belongs to category 1 " )
Solemnly declare:
I found that the results I calculated were different from those of others, and the conclusions were the same. I don't know the specific reason. This is how I understand it. Maybe my understanding is wrong. I hope God can give me some guidance. . .
Partly see the blog post of the Great God
Link https://blog.csdn.net/moxigandashu/article/details/71480251