Learn Naive Bayes Together

 Zhai Cunqi  360 Cloud Computing


Heroine declaration

Recently, the editor is also starting to learn some machine learning knowledge. So I started with Naive Bayes and compiled relevant information for everyone to learn from.

PS: Rich first-line technology and diversified forms of expression are all in the " HULK first-line technology talk ", please pay attention!

Introduction

Naive Bayes method is a classification method based on Bayes' theorem and the assumption of feature condition independence. For a given training data set, first learn the joint probability distribution of input/output based on the feature condition independence assumption; then based on this model. For a given input x, use Bayes' theorem to find the output y with the largest posterior probability. The Naive Bayes method is simple to implement, and the efficiency of learning and prediction is very high. It is a commonly used method.

Character introduction

Bayes, a British mathematician. Born in London in 1701, he was a priest. In 1742, he became a member of the Royal Society. He died on April 7, 1763. Bayes mainly studies probability in mathematics. He will first inductive reasoning The method is used in the basic theory of probability theory, and the Bayesian statistical theory was created, which has made contributions to statistical decision functions, statistical inference, and statistical estimation. In 1763, publications on this topic were published in modern probability theory and mathematical statistics. Both have a very important role. Bayes's other book "An Introduction to the Doctrine of Opportunity" was published in 1758. Many terms used by Bayes have been used today. His main contribution to statistical reasoning is to use the concept of "inverse probability" and put it forward as a universal reasoning method. Bayes' theorem was originally a theorem in probability theory. This theorem can be expressed by a mathematical formula, which is the famous Bayes formula. - Excerpted from 360 Baike

Algorithm principle

  • Conditional probability formula

  • Total probability formula

  • Characteristic conditional independence hypothesis

1

Conditional probability formula

Conditional probability refers to the probability of occurrence of event A under the condition that another event B has already occurred. The conditional probability is expressed as: P(A|B), which is read as "the probability of A under the condition of B". If there are only two events A and B, then:

P(A|B) = P(AB)/P(B)

P(B|A) = P(AB)/P(A)

and so:

P(A|B) = P(B|A) * P(A) / P(B)

2

Total probability formula

If events A1, A2, A3...An constitute a complete event group, that is, they are incompatible with each other, and their sum is the complete set; and P(Ai) is greater than 0, then for any event B:

P(B) = P(A1B) + P(A1B) + ··· + P(AnB)

        = ∑P (AiB)

        = ∑P (B | Ai) * P (Ai) ····················· (i as as 1,2 , ···· , n) as as as 12 , ···· , n)

3

Bayesian formula

Bring the total probability formula into the conditional probability formula, for event Ak and event B:

P (Ak | B) = [P (Ak) * P (B | Ak)] / (P (B | Ai) * P (Ai) ······ ́ i = 1,2 , · ··· , n)

For P(Ak|B), the numerator ∑P(B|Ai)*P(Ai) is a fixed value, because we only need to compare the size of P(Ak|B), so the fixed value of the denominator can be removed, It will not affect the results. Therefore, the following formula can be obtained:

P (Ak | B) = P (Ak) * P (B | Ak)

P(Ak) prior probability, P(Ak|B) posterior probability, P(B|Ak) likelihood function

4

Characteristic conditional independence hypothesis

In classification problems, it is often necessary to classify a thing into a certain category. A thing has many attributes, namely x=(x1,x2,···,xn). Often there are multiple categories, that is, y=(y1,y2,···,yk). P(y1|x),P(y2|x),...,P(yk|x), represents the probability that x belongs to a certain category, then we need to find the largest probability P(yk|x).

According to the formula obtained in the previous step: P(yk|x) = P(yk) * P(x|yk) 

The sample x has n attributes: x=(x1,x2,···,xn), so: P(yk|X) = P(yk) * P(x1,x2,···,xn|yk) 

The assumption of conditional independence means that the conditions do not affect each other, so: P(x1,x2,···,xn|yk) = ∏P(xi|yk) The final formula is: P(yk|x) =P( yk) * ∏P(xi|yk)  

According to the formula P(yk|x) =P(yk) * ∏P(xi|yk), the classification problem can be done.

Laplacian smoothing

引入这个概率的意义,公式P(yk|x) =P(yk) * ∏P(xi|yk),是一个多项乘法公式,其中有一项数值为0,则整个公式就为0,显然不合理,避免每一项为零的做法就是,在分子、分母上各加一个数值。

P(y) = (|Dy| + 1) / (|D| + N)

参数说明:|Dy|表示分类y的样本数,|D|样本总数。

P(xi|Dy) = (|Dy,xi| + 1) / (|Dy| + Ni)

参数说明:|Dy,xi|表示分类y属性i的样本数,|Dy|表示分类y的样本数,Ni表示i属性的可能的取值数。

文本分类

手动实现邮件分类


首先要对所有的已标记的邮件进行分词,整理得到每封邮件分词向量和全分词向量

根据邮件向量可以得到每个词在正常邮件中出现的概率(∏P(wi|Normal))及垃圾邮件中出现的概率(∏P(wi|Spam))

垃圾邮件的概率:P(spam)

正常邮件的概率:P(normal)

邮件是垃圾邮件的概率:

P(Spam|mail) = P(Spam) * ∏P(wi|Spam)

邮件是正常邮件的概率:

P(Normal|mail) = P(Normal) * ∏P(wi|Normal)

最后比较 P(Spam|mail) 与 P(Normal|mail) 的大小就可以了。

使用sklearn实现文本分类

# sklearn 实现文本分类
import os
import random
from numpy import *
from numpy.ma import arange
from sklearn.pipeline import Pipeline
# TfidfVectorizer 文本特征提取(根据词出现的频率及在语句中的重要性)
# HashingVectorizer 文本的特征哈希
# CountVectorizer 将文本转换为每个词出现的个数的向量

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt

# 获取样本集
def get_dataset():    data = []
   for root, dirs, files in os.walk(r'./mix20_rand700_tokens_cleaned/tokens/neg'):
       for file in files:            realpath = os.path.join(root, file)
           with open(realpath, errors='ignore') as f:                data.append((f.read(), 'bad'))
   for root, dirs, files in os.walk(r'./mix20_rand700_tokens_cleaned/tokens/pos'):
       for file in files:            realpath = os.path.join(root, file)
           with open(realpath, errors='ignore') as f:                data.append((f.read(), 'good'))    random.shuffle(data)
   return data
   
# 处理训练集与测试集

def train_and_test_data(data_):    # 训练集和测试集的比例为7:3    filesize = int(0.7 * len(data_))
   # 训练集    train_data_ = [each[0] for each in data_[:filesize]]    train_target_ = [each[1] for each in data_[:filesize]]
   # 测试集    test_data_ = [each[0] for each in data_[filesize:]]    test_target_ = [each[1] for each in data_[filesize:]]
   return train_data_, train_target_, test_data_, test_target_
   
""" 多项式模型: 在多项式模型中, 设某文档d=(t1,t2,…,tk),tk是该文档中出现过的单词,允许重复,则 先验概率P(c)= 类c下单词总数/整个训练样本的单词总数 类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|) V是训练样本的单词表(即抽取单词,单词出现多次,只算一个),|V|则表示训练样本包含多少种单词。 P(tk|c)可以看作是单词tk在证明d属于类c上提供了多大的证据,而P(c)则可以认为是类别c在整体上占多大比例(有多大可能性)。 """
def mnb(train_da, train_tar, test_da, test_tar):    nbc = Pipeline([        ('vect', TfidfVectorizer()),        ('clf', MultinomialNB(alpha=1.0)),    ])    nbc.fit(train_da, train_tar)  # 训练我们的多项式模型贝叶斯分类器    predict = nbc.predict(test_da)  # 在测试集上预测结果    count = 0  # 统计预测正确的结果个数    for left, right in zip(predict, test_tar):
       if left == right:            count += 1    # print("多项式模型:", count / len(test_target))    return count / len(test_tar)

""" 伯努利模型: P(c)= 类c下文件总数/整个训练样本的文件总数 P(tk|c)=(类c下包含单词tk的文件数+1)/(类c下单词总数+2) """
def bnb(train_da, train_tar, test_da, test_tar):    nbc_1 = Pipeline([        ('vect', TfidfVectorizer()),        ('clf', BernoulliNB(alpha=1.0)),    ])    nbc_1.fit(train_da, train_tar)  # 训练我们的多项式模型贝叶斯分类器    predict = nbc_1.predict(test_da)  # 在测试集上预测结果    count = 0  # 统计预测正确的结果个数    for left, right in zip(predict, test_tar):
       if left == right:            count += 1    # print("伯努利模型:", count / len(test_target))    return count / len(test_tar)

# 训练十次
x = arange(10) y1 = [] y2 = []for i in x:    print(i)    data = get_dataset()    train_data, train_target, test_data, test_target = train_and_test_data(data)    y1.append(mnb(train_data, train_target, test_data, test_target))    y2.append(bnb(train_data, train_target, test_data, test_target)) print(x) print(y1) print(y2) plt.plot(x, y1, lw='2', label='MultinomialNB') plt.plot(x, y2, lw='2', label='BernoulliNB') plt.legend(loc="upper right") plt.ylim(0, 1) plt.grid(True) plt.show()

sklearn结果对比

image.png

总结

Scikit learn 也简称 sklearn, 是机器学习领域当中最知名的 python 模块之一。Sklearn 把所有机器学习的模式整合统一起来了,学会了一个模式就可以通吃其他不同类型的学习模式。

Guess you like

Origin blog.51cto.com/15127564/2667407