Naive Bayesian machine learning of the real distinction between Sun Xiaochuan edc fans and fans (python achieve)

Machine learning of reading notes + own real combat.

I. Introduction Naive Bayes algorithm

Suppose you have a data set consisting of two categories (simplification), the classification for each sample, we have known. Data distribution below (taken from FIG MLiA):

 Red and blue classification map

 

Now there is a new point new_point (x, y), whose classification is unknown. We can p1 (x, y) represents the probability of a red-based data points (x, y) belongs, and also can be used p2 (x, y) represents the probability of a class of blue data points (x, y) belongs to. New_point that should go in the red, blue and what kind of it?

We propose such a rule:

If p1 (x, y)> p2 (x, y), then (x, y) is the red category.

If p1 (x, y) <p2 (x, y), then (x, y) as a blue type.

In other human language to describe this rule: Choose a high probability of a class classification as a new point. This is the core idea of ​​Bayesian decision theory, that is, with the highest probability of selection decisions.

Bayesian classification criteria defined by the conditional probability of ways:

If p (red | x, y)> p (blue | x, y), then (x, y) belongs to a class of red.

If p (red | x, y) <p (blue | x, y), then (x, y) belongs to a class of blue.

In other words, in the event of a new classification requires a point, we only need to calculate this point

max (p (c1 | x, y), p (c2 | x, y), p (c3 | x, y) ... p (cn | x, y)). For its most probable label, is it the new classification points.

So the question is, for how to solve the classification i p (ci | x, y)?

Yes, that is the Bayesian formula:

  

Bayesian formula

Example: a document to determine whether the document is abusive

Training data: take off in a site 6 comments document, known they are normal or insulting document, will he get word thesaurus, and calculate different word in each category corresponding word frequency.

Test data: whether the removal of the other site document review, will he get the word vector word, and thesaurus and contrast, to get the word appears in the thesaurus.

Where X represents a plurality of features (words) x1, x2, x3 ... eigenvector thereof.

P (bad | x), said: Known Review This review is the probability of abusive comments.

Bayesian formula, the conversion:

P(bad|X) = p(X|bad) p(bad) / p(X)

P(normal | X) = p(X|normal)p(normal) / p(X)

Compare the size of two probabilities above, if p (bad | X)> p (normal | X), then this comment is abusive, not vice versa.

The formal definition of Bayesian classifier is as follows:

      1, set

It is a term to be classified, and each a is a feature vector x.

      2, there is a set of categories

 

      3, computing

      4, if

then 

      So the key now is how to calculate the conditional probability of each step 3 in. We can do this:

      1, find a collection of items to be classified a known classification of this collection is called the training set.

      2, the conditional probability statistics to get the various features of the property is estimated in each category. which is.

      3, if the individual features are conditionally independent attributes, according to the Bayes' theorem to derive the following:

      Because the denominator is a constant for all categories, as long as we maximize the molecule can be. And because each attribute is conditionally independent features, so there are:  

 

Here to introduce naive Bayes assumption that the. If you think that each word is a separate feature , then the comment content can be expanded as: word (x1, x2, x3 ... xn ), and therefore has the following formula derivation

 

P(bad|X) = p(X|bad)p(bad) = p(x1, x2, x3, x4...xn | bad) p(bad)

Assuming that all conditions are independent of each word , then further split:

P(bad|X) = p(x1|bad)p(x2|bad)p(x3|bad)...p(xn|bad) p(bad)

看公式p(bad|X)=p(x1|bad)p(x2|bad)p(x3|bad)...p(xn|bad) p(bad)

Thus, P (xi | ad) is easy to solve, is the frequency of abuse in the document the training data, each word appears. * Note that this is not the probability of the test data, it should also be multiplied by 0/1, but would make the whole probability 0 to 0, so we get all the data (ln), so multiplication becomes addition , i.e., ln (p (bad | x)) = Σ (0 or 1) * ln (xi | bad)

So the original equation becomes ln (p (x | bad)) ~ ln (p (bad | x)) + ln (p (bad))

P (bad) probability accounts for insulting document training set. Our problem will be solved.

"Machine learning real" all-English documents used in the book, where we change my dataset into comments under Sun Xiaochuan and edc war, of course, such as the code word will be modified accordingly.

Where 1 is abusive sentences, powder circle is 0 sentences (title of the party, and not distinguish between the fans, but to abuse it or not)

The overall code is as follows:

from numpy import *
import jieba
import jieba.posseg as pseg
import re
 
import jieba.analyse

def fenci(str):
    tags = pseg.cut(str) #jieba分词
    res=[]
    for t in tags:
        if(t.flag!='w' and t.flag!='x'):
            res.append(t.word)
    return res

def loadDataSet():
    coachData=['这群楼上都他妈屌丝的不行',
               '一群屌丝。弱智一样。',
               '宝宝,好好休息',
               '纯路人,你处世性格就像个铁憨憨',
               '就想说,不要在意那些有恶意的人,如果你在意那就让他们得意了,您应该时刻记着还有我们在时刻支持您,不管以前的路怎么走,往后的日子会越来越好,加油!',
               '女菩萨,看看?嗷',
               '透你们?的',
               '女菩萨可以嗦一下你的??',
               '在我心中你是最好的 ',
               '宝宝注意身体 工作加油',
               '啊啊啊啊啊注意身体 好好休息',
               '早早早 好好休息好好照顾自己']
    classVec = [1,1,0,1,0,1,1,1,0,0,0,0]    #1 is abusive, 0 not
    postingList=[]
    for i in range(len(coachData)):
        
        postingList.append(fenci(coachData[i]))
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        #else: print("the word: %s is not in my Vocabulary!" % word)
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
testStr='铁憨憨,给老子爬'

testEntry=fenci(testStr)
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
testStr='宝宝,好好照顾好自己呀'
testEntry=fenci(testStr)
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
testStr='天空一声惊雷响,新津降下孙笑川。 年少不知精子贵,欲把香?都透穿。 手持雷霆双节棍,亲?莱莱一锅端。 步入中年威名响,敢骂冠希铁憨憨。 自称地下Rap皇,电鳗 ? 恰的酸。 老来亲属无人怜,日日走访鬼门关。 终因纵欲无节制,小命丧于红塔山。'
testEntry=fenci(testStr)
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
Published 16 original articles · won praise 3 · Views 1361

Guess you like

Origin blog.csdn.net/weixin_40631132/article/details/89052629