模型介绍

$\quad$ 朴素贝叶斯分类器会单独考量每一维度特征被分类的条件概率，进而综合这些概率并对其所在的特征向量做出分类预测。因此，这个模型的基本数学假设是：各个维度上的特征被分类的条件概率之间是相互独立的。
$\quad$ 如果采用概率模型来表示，则定义 $x=<x_1,x_2,\cdots,x_n>$ 为某一 $n$ 维特征量， $y\in\{c_1,c_2,\cdots,c_k\}$ 为该特征向量 $x$ 所有 $k$ 种可能的类别，记 $P(y=c_i|x)$ 为特征向量 $x$ 属于类别 $c_i$ 的概率。根据贝叶斯原理：

P (y | x) = \frac{P (x | y) P (y)}{P (x)}

$P(y|x)=\frac{P(x|y)P(y)}{P(x)}$

$\quad$ 我们的目标是寻找所有

y \in {c_{1}, c_{2}, \dots, c_{k}}

$y\in\{c_1,c_2,\cdots,c_k\}$ 中

P (y | x)

$P(y|x)$ 最大的，即

\underset{y}{a r g m a x} P (y | x)

$\mathop{argmax}\limits_{y}P(y|x)$ ;并且考虑到

P (x)

$P(x)$ 对于同一样本都是相同的，因此可以忽略不记。所以，

\underset{y}{a r g m a x} P (y | x) = \underset{y}{a r g m a x} P (x | y) P (y) = \underset{y}{a r g m a x} P (x_{1}, x_{2}, x_{3}, \dots, x_{n} | y) P (y)

$\mathop{argmax}_yP(y|x)=\mathop{argmax}_yP(x|y)P(y)=\mathop{argmax}_yP(x_1,x_2,x_3,\cdots,x_n|y)P(y)$

$\quad$ 若每一种特征可能取值均为0或者1，在没有任何特殊假设的条件下，计算

P (x_{1}, x_{2}, x_{3}, \dots, x_{n} | y) P (y)

$P(x_1,x_2,x_3,\cdots,x_n|y)P(y)$ 需要对

k * 2^{n}

$k*2^n$ 个可能的参数进行估计：

P (x_{1}, x_{2}, \dots, x_{n} | y) = P (x_{1} | y) P (x_{2} | x_{1}, y) P (x_{3} | x_{1}, x_{2}, y) \dots P (x_{n} | x_{1}, x_{2}, \dots, x_{n - 1}, y)

$P(x_1,x_2,\cdots,x_n|y)=P(x_1|y)P(x_2|x_1,y)P(x_3|x_1,x_2,y)\cdots P(x_n|x_1,x_2,\cdots,x_{n-1},y)$

$\quad$ 但是由于朴素贝叶斯模型的特征类别条件独立假设，

P (x_{n} | x_{1}, x_{2}, \dots, x_{n - 1}, y) = P (x_{n} | y)

$P(x_n|x_1,x_2,\cdots,x_{n-1},y)=P(x_n|y)$ ;若依然每一种特征可能的取值只有两种，那么只需要估计

2 k n

$2kn$ 个参数，即

P (x_{1} = 0 | y = c_{i}), P (x_{1} = 1 | y = c_{i}), \dots, P (x_{n} = 1 | y = c_{k})

$P(x_1=0|y=c_i), P(x_1=1|y=c_i),\cdots,P({x_n=1|y=c_k})$ 。

$\quad$ 为了估计每个参数的概率，采用如下的公式，并且改用频率比近似计算概率：

P (x_{n} = 1 | y = c_{k}) = \frac{P (x_{n} = 1, y = c_{k})}{P (y = c_{k})} = \frac{# (x_{n} = 1, y = c_{k})}{# (y = c_{k})}

$P(x_n=1|y=c_k)=\frac{P(x_n=1,y=c_k)}{P(y=c_k)}=\frac{\#(x_n=1,y=c_k)}{\#(y=c_k)}$

数据描述

朴素贝叶斯模型有着广泛的实际应用环境，特别是在文本分类的任务中间，包括互联网新闻的分类、垃圾邮件的筛选等。
我们使用经典的20类新闻文本作为试验数据。

读取20类新闻文本的数据细节

##读取20类新闻文本的数据细节
#导入新闻数据抓取器fetch_20newsgroups
from sklearn.datasets import fetch_20newsgroups
#与之前预存的数据不同，fetch_20newsgroups需要即时从互联网下载数据
news = fetch_20newsgroups(subset='all')
#查验数据规模和细节
print(len(news.data))
print(news.data[0])

输出：

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
18846
From: Mamatha Devineni Ratnam [email protected]
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers’ relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!

Process finished with exit code 0

数据获取，分割，预测，评估

##读取20类新闻文本的数据细节
#导入新闻数据抓取器fetch_20newsgroups
from sklearn.datasets import fetch_20newsgroups
#与之前预存的数据不同，fetch_20newsgroups需要即时从互联网下载数据
news = fetch_20newsgroups(subset='all')
#查验数据规模和细节
# print(len(news.data))
# print(news.data[0])

##20类新闻文本数据分割
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size = 0.25, random_state= 33)

#使用朴素贝叶斯分类器对新闻文本数据进行类别预测
#从sklearn.feature_extraction.text里导入用于文本特征向量转化模块。
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X_train = vec.fit_transform(X_train)
X_test = vec.transform(X_test)

#从sklearn.naive_bayes里导入朴素贝叶斯模型
from sklearn.naive_bayes import MultinomialNB
#从使用默认配置初始化朴素贝叶斯模型
mnb = MultinomialNB()
#利用训练数据对模型参数进行估计。
mnb.fit(X_train, y_train)
#对测试样本进行类别预测，结果存储在变量 y_predict中。
y_predict = mnb.predict(X_test)

#对朴素贝叶斯分类器在新闻文本数据上的表现性能进行评估
from sklearn.metrics import classification_report
print('The accuracy of Naive Bayes Classifier is', mnb.score(X_test, y_test))
print(classification_report(y_test, y_predict, target_names=news.target_names))

输出：

The accuracy of Naive Bayes Classifier is 0.8397707979626485
                          precision    recall  f1-score   support

             alt.atheism       0.86      0.86      0.86       201
           comp.graphics       0.59      0.86      0.70       250
 comp.os.ms-windows.misc       0.89      0.10      0.17       248
comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
   comp.sys.mac.hardware       0.93      0.78      0.85       242
          comp.windows.x       0.82      0.84      0.83       263
            misc.forsale       0.91      0.70      0.79       257
               rec.autos       0.89      0.89      0.89       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.98      0.91      0.95       251
        rec.sport.hockey       0.93      0.99      0.96       233
               sci.crypt       0.86      0.98      0.91       238
         sci.electronics       0.85      0.88      0.86       249
                 sci.med       0.92      0.94      0.93       245
               sci.space       0.89      0.96      0.92       221
  soc.religion.christian       0.78      0.96      0.86       232
      talk.politics.guns       0.88      0.96      0.92       251
   talk.politics.mideast       0.90      0.98      0.94       231
      talk.politics.misc       0.79      0.89      0.84       188
      talk.religion.misc       0.93      0.44      0.60       158

             avg / total       0.86      0.84      0.82      4712


Process finished with exit code 0

性能分析

朴素贝叶斯模型被广泛应用于海量互联网文本分类任务。由于其较强的特征条件独立假设，使得模型预测所需要估计的参数规模从幂指数量级向线性量级减少，极大地节约了内存消耗和计算时间。但是，也正是受这种强假设的限制，模型训练时无法将各个特征之间的联系考量在内，使得该模型在其他数据特征关联性较强的分类任务上的性能表现不佳。

机器学习5-分类学习-朴素贝叶斯

模型介绍

数据描述

读取20类新闻文本的数据细节

输出：

数据获取，分割，预测，评估

输出：

性能分析

猜你喜欢