statement
This article refers to the code in the book "Machine Learning Combat", combined with the explanation of the book, plus his own understanding and elaboration
Machine learning combat series blog post
- Machine learning combat-k-nearest neighbor algorithm improves the matching effect of dating sites
- Machine learning combat-decision tree construction, drawing and examples: predicting contact lens types
- Machine learning combat-the application and example of Naive Bayes algorithm: use Bayes classifier to filter spam
- Machine learning combat-Logistic regression and examples: predicting mortality of sick horses from hernia
Naive Bayesian definition
Suppose now that there is a jar with 7 stones, 3 of them are gray and 4 are black. If a stone is randomly taken from the jar, what is the likelihood of being a gray stone? Since there are 7 possibilities for taking stones, 3 of them are gray, so the probability of taking out gray stones is 3/7. So what is the probability of getting black stones? Obviously, it is 4/7. We use P (gray) to denote the probability of taking gray stones. The probability value can be obtained by dividing the number of gray stones by the total number of stones.
If these 7 stones are placed in two buckets, how should the above probability be calculated?
To calculate P (gray) or P (black), it is necessary to know in advance whether the information of the bucket where the stone is located will change the result? You may have thought of a method to calculate the probability of taking gray stones from bucket B. This is called conditional probability. Assume that the calculation is the probability of taking gray stones from bucket B. This probability can be written as P (gray | bucketB), which we call "the probability of taking out gray stones under the condition that the stones are known to come from bucket B". It is not difficult to get, P (gray | bucketA) value is 2/4, P (gray | bucketB) value is 1/3.
Bayesian criterion tells us how to exchange the conditions and results in conditional probability, that is, if P (x | c) is known and P (c | x) is required, then the following calculation method can be used:
Document classification using Naive Bayes
An important application of machine learning is the automatic classification of documents. In document classification, the entire document (such as an email) is an instance, and some elements in the email constitute features. Below we use some pre-classified documents to train the Naive Bayes classifier. Then preclassify the unknown information.
Construct training data
def loadDataSet():
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1] #1 is abusive, 0 not
return postingList,classVec
postingList is a matrix containing words, and each row is a sentence, which can also be called a document. The i-th value in classVec represents whether the i-th row of the matrix above has insulting language, 1 represents yes, 0 represents no.
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet)
Here a set is created, all words are put into the set set, and the number of types of words is counted
def setOfWords2Vec(vocabList,inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
print("the word: %s is not in my Vocabulary!" % word)
return returnVec
Here all documents are converted into vectors consisting of 0 and 1
def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
p0Num = np.zeros(numWords); p1Num = np.zeros(numWords)
p0Denom = 0.0; p1Denom = 0.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = p1Num/p1Denom
p0Vect = p0Num/p0Denom
return p0Vect,p1Vect,pAbusive
The input parameters in the code function are the document matrix trainMatrix and the vector trainCategory composed of the category labels of each document. First, calculate the probability that the document belongs to an insulting document (class = 1), that is, P (1). To calculate p (wi | c1) and p (wi | c0), you need to initialize the numerator and denominator variables in the program. Because there are so many elements in w , you can use the NumPy array to quickly calculate these values. The denominator variable in the above program is a NumPy array with the number of elements equal to the size of the vocabulary. In the for loop, it is necessary to traverse all the documents in the training set trainMatrix. Once a word (insulting or normal word) appears in a document, the number of words (p1Num or p0Num) corresponding to the word is increased by 1, and in all documents, the total number of words in the document is also increased accordingly 1 . The same calculation must be performed for both categories. Finally, divide each element by the total number of words in the category.
When using the Bayesian classifier to classify a document, the product of multiple probabilities is calculated to obtain the probability that the document belongs to a certain category, that is, p (w0 | 1) p (w1 | 1) p (w2 | 1) If one of the probabilities is 0, then the final product is also 0. To reduce this effect, the number of occurrences of all words can be initialized to 1, and the denominator can be initialized to 2. In addition, the probability can be logarithmic, which can be multiplied by addition.
def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
p0Num = np.ones(numWords); p1Num = np.ones(numWords)
p0Denom = 2.0; p1Denom = 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = np.log(p1Num/p1Denom)
p0Vect = np.log(p0Num/p0Denom)
return p0Vect,p1Vect,pAbusive
At this point the data is ready, let ’s build the classifier
Building a Bayesian classifier
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + np.log(pClass1) # element-wise mult
p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0
The classifier is very simple, it can be calculated directly according to the Bayesian formula.
Test classifier
def testingNB():
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))
testEntry = ['love', 'my', 'dalmation']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
testEntry = ['stupid', 'garbage']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
Using two sets of data to test the classifier, the test results are as follows:
['love', 'my', 'dalmation'] classified as: 0
['stupid', 'garbage'] classified as: 1
So far, we have considered the appearance of each word as a feature, which can be described as a set-of-words model. If a word appears more than once in a document, it may mean that it contains certain information that the word does not appear in the document. This method is called a bag-of-words model. In the bag of words, each word can appear multiple times, while in the word set, each word can only appear once. In order to adapt to the word bag model, the function setOfWords2Vec () needs to be slightly modified. The modified function is called bagOfWords2Vec ()
def bagOfWords2VecMN(vocabList, inputSet):
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec
Example: Using Naive Bayes to filter spam
Divide mail
def textParse(bigString): #input is big string, #output is word list
regEx = re.compile('\\W')
listOfTokens = regEx.split(bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
Use regular expression to divide the string here
Mail classification
def spamTest():
docList=[]; classList = []; fullText =[]
for i in range(1,26):
wordList = textParse(open("./email/spam/%d.txt" % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('./email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#create vocabulary
trainingSet = list(range(50)); testSet=[] #create test set
for i in range(10):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
for docIndex in trainingSet:#train the classifier (get probs) trainNB0
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(np.array(trainMat),np.array(trainClasses))
errorCount = 0
for docIndex in testSet: #classify the remaining items
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(np.array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print("classification error",docList[docIndex])
print('the error rate is: ',float(errorCount)/len(testSet))
The final result is:
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is: 0.1
Explain that the selected ten messages have a 90% accuracy