[Vernacular resolved] layman Naive Bayes model Principles and Applications

[Vernacular resolved] layman Naive Bayes model Principles and Applications

0x00 Summary

Naive Bayesian machine learning model is a concept often mentioned. But I believe many of my friends are knowing but not the why. This article will try to use and easy to understand way to introduce the principle of naive Bayes model, and to help our in-depth understanding of this concept through specific application scenarios and source code.

0x01 IT concepts

1. Classification

  • Known m samples (x1, y1), ...... (xm, ym), x is the characteristic variable, y is the corresponding category. Requirements have a mapping rule or model function h, for a new sample xt, can accurately predict yt = h (xt) as possible.

  • We can also from the viewpoint of probability to consider the above-mentioned problems. Suppose there are m y categories, i.e. y1, ...... yn ∈ {C1, ...... Cm}, to the sample xt, if the conditional probability can be calculated for each category P (C1 | xt) , ...... P (Cm | xt), you can consider the probability that the largest category is the category xt belongs.

h called a classifier. Job classification algorithm is to build a classifier H .

  • A classifier to give an intuitive understanding of the probability of each class is the posterior probability calculated by the probability and outputs the classification result of the highest category.
  • Content classification algorithms are requirements for a given feature, let us come to class, which is the key to all classification problems. Each with a different classification algorithms, corresponding to different core idea.

2. Naive Bayes

Theoretical foundation Naive Bayes (Naive Bayes) algorithm is Bayes' Theorem and conditional independence assumption of a classification method. Assuming simple means between the respective conditions independent of each feature.

The basic Bayesian classifier: On the basis of statistical data on the basis of some characteristic properties found, to calculate the probability of each category, the probability of finding the largest class in order to achieve the classification. That Bayes classifier by predicting the probability that an object belongs to a category, and then predict its category.

  • Find a collection of items to be sorted known classification of this collection is called the training set.

  • The conditional probability of statistical properties to get the various features in each category estimate.
  • Find out the maximum probability of that class.

3. Commentary formula

3.1 Bayes Theorem

Hu Yanzhuo foregoing methods are:
problem solving (A): Hu Yanzhuo want to know whether they are public Ming Gege confidant, with A to represent "Your brother is a confidant."
Known results (B): Big Brother bow down to you. Recorded as event B.
Reasoning results P (A | B): Big Brother wants to worship this event for you to judge you as a brother, as the probability of a confidant.

Then there are:

P(A|B) = P(B|A)P(A)/P(B)
P(A|B) 也就是在B事件"大哥下拜"发生之后,对A事件"大哥视你为心腹"概率的重新评估。

In fact, the above formula also implies: Bayesian formula can divide people into two categories: Big Brother's henchmen / general subordinate .

3.2 with the idea of ​​re-interpretation category

So Bayes formula can be used to re-interpret the idea category.

We understood B "having a characteristic", understood as the A "category labels." In the most simple binary classification problem (or not is judged), we will be understood as A "belongs to a class " label.

P(类别|特征)=P(特征|类别)P(类别)/P(特征)
  • P (A) is the prior probability represents the probability distribution of each category;
  • P (B | A) is the conditional probability represents the probability under the premise of some sort, something happens; the conditional probability can be obtained by statistics, where the need to introduce the concept of maximum likelihood estimation.
  • P (A | B) is the posterior probability that an event has occurred, and it belongs to a category of probability, with the posterior probability of the sample can be classified. The larger posterior probability indicates a greater likelihood of something belong to this category, the more reason to return to it under this category.

3.3 expanded to a plurality of condition (feature)

Before assuming that A only B only one condition, but in practice, very few cases only by one thing a feature of the impact of factors often affect more than one thing. Suppose, B are factors of n, namely b1, b2, ..., bn.

Then P (A | B) can be written as:

P(A|b1,b2,...,bn) = P(A) P(b1,b2,...,bn|A) / P(b1,b2,...,bn)

Since it is assumed between b1 to bn from these features, the probability distribution is independent of the criteria, wherein each bi not associated with other features. So you can do the following conversion

P(b1,b2,...,bn|A) = P(b1|A)P(b2|A)...P(bn|A)

This conversion is actually the joint distribution of independent variables = variables prior distribution of the product. But here is the conditional probability, but because of the condition before and after the conversion has the same A, from the point of view A sample space, in fact, the joint distribution into the product prior distribution.

So Bayes' theorem can be derived as follows

P(A|b1,b2,...,bn) = P(A) [P(b1|A)P(b2|A)...P(bn|A)] / P(b1,b2,...,bn)

0x02 Hu Yanzhuo how to apply Naive Bayes model Category:

Saying the previous text [vernacular resolved] Bayes 'theorem in simple terms , the Hu Yanzhuo by Bayes' theorem, launched its own is not a public Ming Gege confidant conclusion. Although some tightness, but also curious about the power of Bayes' theorem, so he decided to classify Ma chiefs and military leaders step with Naive Bayes model.

1. minimalist version Naive Bayesian Classification Model

There are a minimalist naive Bayesian classifier model version, can distinguish two classes (A1, A2), there are also used to classify the two features (B1, B2).
So the formula is:

P(A|B1,B2) = P(A) [P(B1|A)P(B2|A)] / P(B1,B2)

This is the classification:

P(A|B1,B2) = P(A) [P(B1|A)P(B2|A)] / P(B1,B2) = P(A) [P(B1|A)P(B2|A)] / [P(B1) P(B2)]

b1,b2表示特征变量,Ai表示分类,p(Ai|b1,b2)表示在特征为b1,b2的情况下分入类别Ai的概率

Again under review naive Bayes classifier, the probability of a category by predicting an object belongs to, and then predict its category.

  • Find a collection of items to be sorted known classification of this collection is called the training set.

  • The conditional probability of statistical properties to get the various features in each category estimate.
  • Find out the maximum probability of that class.

2. known conditions

Sample is 10 Ma leaders, 10 infantry leaders, now set as follows:

已知有两个分类:
A1=马军头领
A2=步军头领

两个用来分类的特征:
F1=纹身
F2=闹事

特征可以如下取值:
f11 = 有纹身
f12 = 无纹身
f21 = 爱闹事
f22 = 不爱闹事

With classifier model and pre-conditions, the following look at how to derive a classifier model parameters.

3. The training process and data

The following statistics are based on known data come. It is trained by the actual value classification parameters .

假定 马军头领中,2位有纹身,1位爱闹事,步兵头领中,7位有纹身,6位爱闹事。所以得到统计数据如下:

P(有纹身) = P(f11) = (7+2)/20 = 9/20 = 0.45
P(无纹身) = P(f12) = 11/20 = 0.55
P(爱闹事) = P(f21) = 7/20 = 0.35
P(不爱闹事) = P(f22) = 13/20 = 0.65

P(F1=f11|A=A1) = P(有纹身|马军头领) = 2/20 = 0.1
P(F1=f12|A=A1) = P(无纹身|马军头领) = 8/20 = 0.4
P(F1=f11|A=A2) = P(有纹身|步兵头领) = 7/20 = 0.35
P(F1=f12|A=A2) = P(无纹身|步兵头领) = 3/20 = 0.15
P(F2=f21|A=A1) = P(爱闹事|马军头领) = 1/20 = 0.05
P(F2=f22|A=A1) = P(不爱闹事|马军头领) = 9/20 = 0.45
P(F2=f21|A=A2) = P(爱闹事|步兵头领) = 6/20 = 0.3
P(F2=f22|A=A2) = P(不爱闹事|步兵头领) = 4/20 = 0.2

This training (statistics) out of a parametric model of classification .

Before it can be combined classifier

P(A|F1,F2) = P(A) [P(F1|A)P(F2|A)] / P(F1,F2) = P(A) [P(F1|A)P(F2|A)] / [P(F1) P(F2)]

Come to the right "to be classified data" do deal with.

4. How classification

If there is someone chieftains x: no tattoos, no trouble. Calculation performed twice for two classifications (Ma leader, infantry leaders), two values ​​obtained.

(No tattoos, no trouble) the possibility Ma chieftains

P(马军头领|不纹身,不闹事) = P(马军头领)  [P(无纹身|马军头领) P(不闹事|马军头领) ] / [P(无纹身)P(不闹事)]

P(A=A1|x) = p(A=A1) P(F1=f12|A=A1)p(F2=f22|A=A1) / [P(f12)P(f22)] = 0.5 * 0.4 * 0.45 / [0.55 * 0.65] = 0.18 / [0.55 * 0.65] = 0.25

(No tattoos, no trouble) is the possibility of infantry chieftains

P(步兵头领|不纹身,不闹事) = P(步兵头领)  [P(无纹身|步兵头领) P(不闹事|步兵头领) ] / [P(无纹身)P(不闹事)]

P(A=A2|x) = p(A=A2) P(F1=f12|A=A2)p(F2=f22|A=A2) / [P(f12)P(f22)] = 0.5 * 0.15 * 0.2 / [0.55 * 0.65] = 0.03 / [0.55 * 0.65] = 0.04

So x is more likely to Ma's.

The greatest advantage of Bayes' theorem is the probability of a known frequency to calculate the unknown, we simply frequency as a probability .

0X03 reference source snowNLP

We can further understanding of the Naive Bayes model by snowNLP source.

In bayes object, and there are two attributes total d, d is a data dictionary, the total number of words stored in all categories of the total, after the training dataset train method, d is the data stored in the key of each classification classified tag label, value is a AddOneProb object.

The code here is simply the frequency as a probability. Training is to count the number of each classification label (key) corresponds.

1. Source

#训练数据集
def train(self, data):
    #遍历数据集,data 中既包含正样本,也包含负样本
    for d in data: # data中是list
        # d[0]:分词的结果,list
        # d[1]:标签-->分类类别,正/负样本的标记
        c = d[1]
        #判断数据字典中是否有当前的标签
        if c not in self.d:
            #如果没有该标签,加入标签,值是一个AddOneProb对象
            self.d[c] = AddOneProb()  # 类的初始化
        #d[0]是评论的分词list,遍历分词list
        for word in d[0]:
            #调用AddOneProb中的add方法,添加单词
            self.d[c].add(word, 1)
    #计算总词数,是正类和负类之和
    self.total = sum(map(lambda x: self.d[x].getsum(), self.d.keys())) # # 取得所有的d中的sum之和
                
class AddOneProb(BaseProb):
def __init__(self):
    self.d = {}
    self.total = 0.0
    self.none = 1
 
#添加单词
def add(self, key, value):
    #更新该类别下的单词总数
    self.total += value
    #如果单词未出现过,需新建key
    if not self.exists(key):
        #将单词加入对应标签的数据字典中,value设为1
        self.d[key] = 1
        #更新总词数
        self.total += 1
    #如果单词出现过,对该单词的value值加1
    self.d[key] += value

Specific classification is calculated probabilities for each classification and labeling

#贝叶斯分类
def classify(self, x):
    tmp = {}
    #遍历每个分类标签
    for k in self.d: # 正类和负类
        #获取每个分类标签下的总词数和所有标签总词数,求对数差相当于log(某标签下的总词数/所有标签总词数)
        tmp[k] = log(self.d[k].getsum()) - log(self.total) # 正类/负类的和的log函数-所有之和的log函数
        for word in x:
            #获取每个单词出现的频率,log[(某标签下的总词数/所有标签总词数)*单词出现频率]
            tmp[k] += log(self.d[k].freq(word))
    #计算概率
    ret, prob = 0, 0
    for k in self.d:
        now = 0
        try:
            for otherk in self.d:
                now += exp(tmp[otherk]-tmp[k])
            now = 1/now
        except OverflowError:
            now = 0
        if now > prob:
            ret, prob = k, now
    return (ret, prob)

2. Source derivation formula

For two classes c1, c1 is the classification, wherein w1, ⋯, Wn of between, wherein the independent basic process of the Bayesian model belonging to the class c1 is:

P(c1∣w1,⋯,wn)=P(w1,⋯,wn∣c1)⋅P(c1) / P(w1,⋯,wn)
    
如果做句子分类,可以认为是出现了w1, w2, ..., wn这些词之后,该句子被归纳到c1类的概率。

among them:

P(w1,⋯,wn)=P(w1,⋯,wn∣c1)⋅P(c1) + P(w1,⋯,wn∣c2)⋅P(c2)

Predicted process using the above formulas to, namely:
\ [P (c1|w1, ⋯, Wn of) = \ {FRAC P (W1, ⋯, wn|c1) ⋅P (C1)} {P (W1, ⋯, wn|c1) ⋅P (c1) + P
(w1, ⋯, wn|c2) ⋅P (c2)} \] of the above equation simplifies to:
\ [P (c1|w1, ⋯, Wn of) = \ {FRAC P (w1, ⋯, wn|c1) ⋅P (c1)} {P (w1, ⋯, wn|c1) ⋅P (c1) + P (w1, ⋯, wn|c2) ⋅P (c2)} \ ]

\[ =\frac{1}{1+\frac{P(w1,⋯,wn∣c2)⋅P(c2)}{P(w1,⋯,wn∣c1)⋅P(c1)}} \]

\[ =\frac{1}{1+exp[log(\frac{P(w1,⋯,wn∣c2)⋅P(c2)}{P(w1,⋯,wn∣c1)⋅P(c1)})]} \]

\[ =\frac{1}{1+exp[log(P(w1,⋯,wn∣c2)⋅P(c2))−log(P(w1,⋯,wn∣c1)⋅P(c1))]} \]

Wherein, in the denominator 1 can be rewritten as:
\ [1 = exp [log (P (W1, ⋯, wn|c1) ⋅P (C1)) - log (P (W1, ⋯, wn|c1) ⋅P ( c1))] \]

Combining Equation 3. Detailed further Code

According to the above formula, for c1, c2, we need

a. First seek

\[ P(w1,⋯,wn∣c1)⋅P(c1) \]

b. then seek

\[ P(c1) \]

Binding Code

p(Ck) = k这类词出现的概率 = self.d[k].getsum() / self.total
p(w1|Ck) = w1这个词在Ck类出现的概率 = self.d[k].freq(word)
k = 1,2

c. recalculation

\[ log(P(w1,⋯,wn∣c1)⋅P(c1)) \]

This formula is
\ [log (P (w1 |
c1) ... ⋅P (c1) p (wn|c1)) \] The result of this equation is:
\ [log (sum_ {P (W1 | a C1) .. .p (wn | C1)})
+ log (P (c1)) \] finally expand:
\ [log (sum_ {P (W1 | a C1) ... P (Wn of | a C1)}) + log (Self. d [1] .getsum ())
- log (self.total)) \] the following is tmp [k]. Wherein, for the first cycle tmp [k] corresponding to the formula log (P (ck)), for the second cycle tmp [k] corresponding to the formula log (P (w1, ⋯, wn|ck) ⋅P (ck)). Results for two cycles is the final tmp [k].

def classify(self, x):
    tmp = {}
    for k in self.d: # 正类和负类
        tmp[k] = log(self.d[k].getsum()) - log(self.total) # 正类/负类的和的log函数-所有之和的log函数
        for word in x:
            tmp[k] += log(self.d[k].freq(word)) # 词频,不存在就为0
    ret, prob = 0, 0
    for k in self.d:
        now = 0
        try:
            for otherk in self.d:
                now += exp(tmp[otherk]-tmp[k]) # for循环中有一个结果是0, exp(0)就是1.就是上面分母中的1
            now = 1/now
        except OverflowError:
            now = 0
        if now > prob:
            ret, prob = k, now
    return (ret, prob)

0x04 Reference

Naive Bayesian classification model (a)

Naive Bayes classifier - Road to SR

You get to know the band Naive Bayesian classification algorithm

snownlp sentiment analysis source code parsing

Detailed Naive Bayes and Chinese public opinion analysis

[Data mining] Naive Bayes classifier

NLP Series (2) _ with Naive Bayes text classification (on)

Sentiment Analysis - Principles and Practice deep snownlp

Guess you like

Origin www.cnblogs.com/rossiXYZ/p/12148809.html