贝叶斯分类器

1.基础知识

概率论的基本知识

先验概率:由以往的数据得到的
后验概率:得到信息后再重新加以修正的概率

条件风险公式（期望损失）:

\[R(c_i\mid \textbf x)=\sum_{j=1}^N \lambda_{ij} P(c_j\mid \textbf x)\]

其中，基于后验概率$P(c_{i}\mid \textbf x)$ 可获得将样本x分类为$c_{i}$所产生的期望损失（expected loss）

$\lambda_{ij}$ 是将一个真实标记为$c_{j}$的样本误分类为$c_{i}$所产生的损失

对于每个样本 $\textbf x $ 选择能使后验概率$P(c\mid \textbf x)$最大的类别标记

基于贝叶斯定理，$P(c\mid \textbf x)$ 可以写成：

\[P(c\mid \textbf x)=\frac{P(\textbf x ,c)}{P(\textbf x)}=\frac{P(c)P(\textbf x\mid c)}{P(\textbf x)} \]

先对联合概率分布$P(\textbf x,c)$进行建模，再求后验概率

判别式模型＆生成式模型

判别式模型(discriminative models):

给定Ｘ，可以通过直接建模$P(c\mid \textbf x)$来预测c,简单而又直接的办法．例如：决策树，神经网络，支持向量机都是判别式模型的范畴．

生成式模型(generative models):

先对联合概率分布$P(\textbf x,c)$建模，然后再由此获得$P(c\mid \textbf x)$.典型的就是贝叶斯函数定理

后验概率的最大化

对于类先验概率(prior)，$P(c)$,是样本空间中各类样本所占的比例，根据大数定律，当样本足够充足且独立同分布时，可以用样本出现的频率来拟合概率．

类条件概率$P(\textbf x \mid c)$　可能会出现属性组合爆炸的情况，一般不能使用简单的频率估计．（对于简单的朴素贝叶斯分类器，是可以直接使用频率来表达概率）

注意区分 "未被观测到"和"出现概率为零"

极大似然估计

$D_c$表示训练集$D$中第$c$类样本组成的集合，假设这些样本独立同分布．则参数$\theta_c$对于数据集$D_c$的似然是:

\[P(D_c\mid \theta_{c})=\prod P(\textbf x \mid \theta_{c})\]

解决实际问题时，还需要考虑，连乘操作会导致数值的下溢　可以考虑使用对数似然方程$\frac{\mathrm{d} }{\mathrm{d}\theta}InL(\theta)=0$的方法:\[LL(\theta_{c})=\log P(D_c\mid \theta_{c})\]
\[=\sum_{x\in D_{c}} \log P(\textbf x \mid \theta_{c})\]

朴素贝叶斯分类器

属性条件独立性假设(attribute conditional independence assumption):

属性之间相互独立,基于这个假设,可以重新得到公式

\[P(c\mid \textbf x)=\frac{P(c)P(\textbf x\mid c)}{P(\textbf x)}=\frac{P(c)}{P(\textbf x)}\prod_{i=1}^d P(x_i\mid c)\]

其中d 是属性的数目,$x_i$是$\textbf x$在第i个属性上的取值

朴素分类器的训练过程就是基于训练数据集D来估计类先验概率$P(c)$,并为每个属性估计条件概率$P(x_i\mid c)$

半朴素贝叶斯分类器(semi-naive bayes classifiers)

朴素贝叶斯假设各个属性之间相互独立，但这一点在实际过程中很难实现，尝试对属性条件独立性假设进行一定程度的放松．

假设每个属性在类别之外最多依赖于一个其他属性．

贝叶斯网

借助有向无环图来实现属性之间的依赖关系．$B=\left \langle G,\Theta \right \rangle$

其中Ｇ是一个有向无环图

参数$\Theta$ 定量描述这种依赖关系，$\Theta$ 包含了每个属性的条件概率表 $\theta_{x_{i}\mid \pi_{i}}=P_{B}(x_{i}\mid \pi_{i})$

有向图转换为道德图参考：<机器学习>　周志华　清华大学出版社

最小描述长度情况下的评分函数：

\[s(B\mid D)=f(\theta)\mid B \mid -LL(B\mid D)\]

2.思想脉络

基于生成式模型，利用贝叶斯公式将求解进行转换．

训练数据集Ｄ，来估计类先验概率$P(c)$,并为每个属性估计出条件概率$P(x_{i}\mid c)$

思想脉络＝＝算法推导．主要还是算法的推导过程．

3.算法推导

训练过程

1.类先验概率$P(c)$:

$D_c$表示训练集D中第c类样本组成的集合

\[P(c)=\frac{\mid D_C\mid}{\mid D \mid}\]

2.类条件概率$P(x_i\mid c)$:

2.1对于离散属性而言:$D_{c,x_i}$ 表示$D_c$在i个属性上取值为$x_i$样本集合
则条件概率为:

\[P(x_i\mid c)=\frac{\mid D_{c,x_i}\mid}{\mid D_c\mid}\]

2.2对于连续属性考虑概率密度函数. 假定$p(x_i\mid c)\sim N(\mu_{c,i},\sigma_{c,i}^{2})$
考虑均值和方差,则条件概率为:

\[p(x_t\mid c)=\frac{1}{\sqrt{2\pi}\sigma_{c,i}}exp\left(-\frac{(x_i-\mu_{c,i})^{2}}{2\sigma_{c,i}^{2}} \right )\]

贝叶斯判定准则

基于最小化分类错误率的贝叶斯最优分类器为:\[h(x)^{*}=arg \max_{c\in y}P(c \mid \textbf x)\]

得到的最重要的贝叶斯判定准则:\[h_{nb}(\textbf x)=arg \max_{c\in y}P(c)\prod_{i=1}^{d}P(X_{i}\mid c)\]

上述训练过程需要注意的一点是: "未被观测到"和"出现概率为零"

若某个属性值咋训练集中没有与某个类同时出现过.基于上式进行计算,会让连乘式中出现概率值为零的情况.
防止出现连乘为０的现象．

一般进行拉普拉斯修正

平滑处理:

\[P(c)=\frac{\mid D_c\mid+1}{\mid D \mid+N},\]
\[P(x_i\mid c)=\frac{\mid D_{c,x_i}\mid+1}{\mid D_c\mid+N_i}.\]

实现拉普拉斯修正的朴素贝叶斯分类器的算术推导过程

朴素贝叶斯算法(naive bayes algorithm):

输入:训练数据集 T={$(x_1,y_1),(x_2,y_2),...,(x_N,y_N)$} 其中$x_i=(x_i ^{1},x_i ^{2},...,x_i ^{n}) ^{T}$,$x_i ^{j}$是i个样本的第j个特征,$x_i^{j}\subset\left \{ a_{j1},a_{j2},...,a_{js} \right \}$,$a_{jl}$是第j个特征可能的取的第l个值；
实例 x

输出:实例 x 的分类.

(1) 计算先验概率及条件概率

\[P(Y=c_k)=\frac{ \sum_{i=1}^{N} I(y_i=c_k)} {N},k=1,2,...K\]

\[p(X^{j}=a_{ji}\mid Y=c_k)=\frac{\sum_{i=1}^{N} I(x_i^{j}=a_{ji},y_i=c_k)}{\sum_{i=1}^{N} I(y_i=c_k)},j=1,2,...,n;l=1,2,...,S_j;k=1,2,...,K\]

(2)对于给定的实例$x=(x^{1},x^{2},...,x^{n})^{T}$,计算:
\[P(Y=c_k) \prod_{j=1}^{n} P(X^{j}=x^{j}\mid Y=c_k),k=1,2,...,K\]

(3)确定实例x的分类
\[y=arg max_{c_k}P(Y=c_k)\prod_{j=1}^{n}P(X^{j}=x^{j}\mid Y=c_k)\]

4.编程实现

上面只是算法推导的思路,然后考虑拉普拉斯修正进行编程实现

# input watermelon csv file and process easy data cleaning

import pandas as pd 
def in_put():
    """
    @ return df.data
    """
    with open("/home/dengshuo/GithubCode/ML/CH04/ID3watermelon3.csv") as f:
        df=pd.read_csv(f)
    return df

# too many thing need consider
# i need help

两个解决的方法,一步一步编程法；或者直接调用Sklearn库中的函数来进行
一步一步编程法: 需要强大的编程能力

调用Sklearn : 可能对数据的处理要求较高,对数据量较少的结果可能不是很好

1.按步骤编程进行代码实现:

一个总结,当要去实现一个特定函数功能时候,可先将输入假设为所需要的数据结构和类型,读输入数据的处理放到后面
将每个函数进行结合.

也可以先使用一个函数,然后再去定义(在知道函数功能的前提下)

# date&time : 2018.05.07
# @author   : dengshuo

# input  excel or csv data
# define class 

class LaplacianNB():
    """
    Laplacian naive bayes for binary classification problem.
    """
    def __init__(self):
        """
        no parameter init 
        this mean a pulic init or supet init variables.
        """
    def train(self,X,y):
        """
        Training laplacian naive bayes classifier with training set(X,y)
        Input:X,list of instances
              y,list of labels
        """
        N=len(y)
        self.class=self.count_list(y)
        self.class_num=len(self.class)
        # calc prior probability
        self.class_p={}
        for c,n in self.class.items():
            self.class_p[c]=float(n+1)/(N+self.class_num)
            
        #print(self.class_p)
        
        # calc conditional probability
        
        self.discrete_attr_good_p=[]
        self.diacrete_attr_bad_p=[]
        for i in range(6):
            attr_with_good=[]
            attr_with_bad=[]
            for j in range(N):
                if y[j]==1:    # it can replace with : if y[i]=="是"
                    attr_with_good.append(X[j][i])
                else:
                    attr_with_bad.append(X[j][i])
            unique_with_good=self.count_list(attr_with_good)
            unique_with_bad=self.count_list(attr_with_bad)
            self.discrete_attr_good_p.append(self.discrete_p(unique_with_good,self.class[1]))
            self.discrete_attr_bad_p.append(self.discrete_p(unique_with_bad,self.class[0]))
        
        
        # calc the continuous variable conditional probability
        
        self.good_mus=[]
        self.good_vars=[]
        self.bad_mus=[]
        self.bad_vars=[]
        for i in range(6,8):
            attr_with_good=[]
            attr_with_bad=[]
            for j in range(N):
                if y[j]==1:
                    attr_with_good.append(X[j][i])
                else:
                    attr_with_bad.append(X[j][i])
            
            good_mu,good_var=self.mu_var_of_list(attr_with_good)
            bad_mu,bad_var=self.mu_var_of_list(attr_with_bad)
            self.good_mus.append(good_mu)
            self.good_vars.append(good_var)
            self.bad_mus.append(bad_mu)
            self.bad_vars.append(bad_var)
            
    def predict(self,x):
        """
        x:the testset need calculate
        @return: return the label class 0 or 1
        """
        p_good=self.class_p[1]
        p_bad=self.class_p[0]
        for i in range(6):
            p_good*=self.discrete_attar_with_good_p[i][x[i]]
            p_bad*=self.discrete_attr_with_good_p[i][x[i]]
        for i in range(6,8):
            p_good*=self.continuous_p([x[i]],self.good_mus[i],self.good_vars[i])
            p_bad*=self.continuous_p([x[i]],self.bad_mus[i],self.bad_vars[i])
        if p_good>=p_bad:
            return p_good,p_bad,1
        else:
            return p_good,p_bad,0
            
        
        

    def count_list(self,List):
        """
        List : input data
        @return :  count unique elements in list with dictionary
        """
        unique_dict={}
        for i in set(List):
            unique_dict[i]=List.count(e)
        return unique_dict
            
    def discrete_p(self,d,N_class):
        """
        d:attribute of dictionary
        n_class: the number of label class 
        @return:the probaility of each feature with dictionary
        """
        new_d={}
        for a,n in d.items():
            new_d[a]=float(n+1)/(N_class+len(d))   # n_class: means good or bad feature (amount)=class[1]
        return new_d
    
    def mu_var_of_list(self,l):
        """
        l:list of feature
        @return: 
        """
        mu=sum(l)/float(len(l))
        var=0
        for i in range(len(l)):
            var+=(l[i]-mu)**2
        var=var/float(len(l))
        return mu,var 
    
    def continuous_p(self,x,mu,var):
        """
        x: the input testset 
        mu: the meanvalue
        var: the variance
        @return the testset probaility
        """
        import math
        p=1.0/(math.sqrt(2*math.pi)*math.sqrt(var)*math.exp(-(x-mu)**2/(2*var)))
        return p

import xlrd

# this package often use it can read excel files

if __name__=="__main__":
    lnb=laplacianNB()
    
    # read excel dataset ,of course can use pandas read csv_file
    # pandas can read excel files to.
    workbook=xlrd.open_workbook('the dataset.xlsx')
    sheet=workbook.sheet_by_name("Sheet1")
    X=[]
    for i in range(17):
        x=sheet.col_values(i)
        for j in range(6):
            x[j]=int(x[j])
        x.pop()
        X.append(x)
    y=sheet.row_values(8)
    y=[int(i) for i in y]
    lnb.trian(X,y)
    label=lnb.predict([1,1,1,1,1,1,0.697,0.460])
    print("predict result {}".format(label))

调用Sklearn库来实现

数据的读取,数据的处理,(数据的可视化处理),库函数的调用

最重要,可能也是最难的就是:如何将数据处理成函数可以直接使用的类型(先不管数据的输入类型,定义函数) 数据的预处理

# read csv_file dataset
import pandas as pd 
def in_put():
    """
    @ return df.data
    """
    with open("/home/dengshuo/GithubCode/ML/CH04/ID3watermelon3.csv") as f:
        df=pd.read_csv(f)
    return df

# not fully comprehension how to cleaning df_dataset
# data,expecially cleaning df.data
# import library 
from sklearn.naive_bayes import GaussianNB
# assumed have X(predictor) and y(target) for training data 
clf=GaussianNB()
clf.fit(X,y)

print(clf.predict([[1,1,1,1,1,1,0.697,0.460]]))

# python test console

import pandas as pd
import numpy as np 
with open('/home/dengshuo/GithubCode/ML/CH06/watermelon_3a.csv') as f:
      df=pd.read_csv(f,header=None,)
with open('/home/dengshuo/GithubCode/ML/CH06/watermelon_test.csv') as p:
      df_test=pd.read_csv(p,header=None,)
#df=df.rename(columns={'编号':'numbers'})
df.columns=['id','density','sugar_index','label']
df_test.columns=['id','density','sugar_index','label']
df.set_index(['id'])
# print(df)
X=df[['density','sugar_index']].values
x_test=df_test[['density','sugar_index']].values
# this is a np.array 
y=df[['label']].values
# extract the values

# import scikit learn library
from sklearn.naive_bayes import GaussianNB 
gnb=GaussianNB()
gnb.fit(X,y)

predict=gnb.predict(x_test)
print(predict)

[1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0]


/home/dengshuo/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

结果描述：数据太少,导致正确率不是很高　需要大量数据的拟合．

总结

推导．为什么可以用条件概率来表达后验概率的值？重要公式的推导

对于朴素贝叶斯函数计算的整个流程的掌握

EM算法的学习

贝叶斯网的深度理解