Machine learning combat-the application and example of Naive Bayes algorithm: use Bayes classifier to filter spam

statement

        This article refers to the code in the book "Machine Learning Combat", combined with the explanation of the book, plus his own understanding and elaboration

Machine learning combat series blog post

 

Naive Bayesian definition

        Suppose now that there is a jar with 7 stones, 3 of them are gray and 4 are black. If a stone is randomly taken from the jar, what is the likelihood of being a gray stone? Since there are 7 possibilities for taking stones, 3 of them are gray, so the probability of taking out gray stones is 3/7. So what is the probability of getting black stones? Obviously, it is 4/7. We use P (gray) to denote the probability of taking gray stones. The probability value can be obtained by dividing the number of gray stones by the total number of stones.
 

                          

        If these 7 stones are placed in two buckets, how should the above probability be calculated?

        To calculate P (gray) or P (black), it is necessary to know in advance whether the information of the bucket where the stone is located will change the result? You may have thought of a method to calculate the probability of taking gray stones from bucket B. This is called conditional probability. Assume that the calculation is the probability of taking gray stones from bucket B. This probability can be written as P (gray | bucketB), which we call "the probability of taking out gray stones under the condition that the stones are known to come from bucket B". It is not difficult to get, P (gray | bucketA) value is 2/4, P (gray | bucketB) value is 1/3.
        Bayesian criterion tells us how to exchange the conditions and results in conditional probability, that is, if P (x | c) is known and P (c | x) is required, then the following calculation method can be used:
                                          

Document classification using Naive Bayes

        An important application of machine learning is the automatic classification of documents. In document classification, the entire document (such as an email) is an instance, and some elements in the email constitute features. Below we use some pre-classified documents to train the Naive Bayes classifier. Then preclassify the unknown information.

Construct training data

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

        postingList is a matrix containing words, and each row is a sentence, which can also be called a document. The i-th value in classVec represents whether the i-th row of the matrix above has insulting language, 1 represents yes, 0 represents no.

 

def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document)
    return list(vocabSet)

Here a set is created, all words are put into the set set, and the number of types of words is counted

 

def setOfWords2Vec(vocabList,inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec

Here all documents are converted into vectors consisting of 0 and 1

 

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = np.zeros(numWords); p1Num = np.zeros(numWords)
    p0Denom = 0.0; p1Denom = 0.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    return p0Vect,p1Vect,pAbusive

        The input parameters in the code function are the document matrix trainMatrix and the vector trainCategory composed of the category labels of each document. First, calculate the probability that the document belongs to an insulting document (class = 1), that is, P (1). To calculate p (wi | c1) and p (wi | c0), you need to initialize the numerator and denominator variables in the program. Because there are so many elements in w , you can use the NumPy array to quickly calculate these values. The denominator variable in the above program is a NumPy array with the number of elements equal to the size of the vocabulary. In the for loop, it is necessary to traverse all the documents in the training set trainMatrix. Once a word (insulting or normal word) appears in a document, the number of words (p1Num or p0Num) corresponding to the word is increased by 1, and in all documents, the total number of words in the document is also increased accordingly 1 . The same calculation must be performed for both categories. Finally, divide each element by the total number of words in the category.

        When using the Bayesian classifier to classify a document, the product of multiple probabilities is calculated to obtain the probability that the document belongs to a certain category, that is, p (w0 | 1) p (w1 | 1) p (w2 | 1) If one of the probabilities is 0, then the final product is also 0. To reduce this effect, the number of occurrences of all words can be initialized to 1, and the denominator can be initialized to 2. In addition, the probability can be logarithmic, which can be multiplied by addition.

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = np.ones(numWords); p1Num = np.ones(numWords)     
    p0Denom = 2.0; p1Denom = 2.0                       
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = np.log(p1Num/p1Denom)
    p0Vect = np.log(p0Num/p0Denom)
    return p0Vect,p1Vect,pAbusive

        At this point the data is ready, let ’s build the classifier

Building a Bayesian classifier

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)  # element-wise mult
    p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0

        The classifier is very simple, it can be calculated directly according to the Bayesian formula.

Test classifier

def testingNB():
    listOPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
    testEntry = ['stupid', 'garbage']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))

        Using two sets of data to test the classifier, the test results are as follows:

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1

        So far, we have considered the appearance of each word as a feature, which can be described as a set-of-words model. If a word appears more than once in a document, it may mean that it contains certain information that the word does not appear in the document. This method is called a bag-of-words model. In the bag of words, each word can appear multiple times, while in the word set, each word can only appear once. In order to adapt to the word bag model, the function setOfWords2Vec () needs to be slightly modified. The modified function is called bagOfWords2Vec ()

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

 

Example: Using Naive Bayes to filter spam

Divide mail

def textParse(bigString):    #input is big string, #output is word list
    regEx = re.compile('\\W')
    listOfTokens = regEx.split(bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

        Use regular expression to divide the string here

 

Mail classification

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open("./email/spam/%d.txt" % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('./email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(np.array(trainMat),np.array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(np.array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print("classification error",docList[docIndex])
    print('the error rate is: ',float(errorCount)/len(testSet))

        The final result is:

classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is:  0.1

        Explain that the selected ten messages have a 90% accuracy

Published 60 original articles · Like 89 · Visits 10,000+

Guess you like

Origin blog.csdn.net/qq_41685265/article/details/105317607