高斯判别分析(GDA)——含python代码

基本数学知识

多元正态分布

  多元正态分布是正态分布在多维变量下的扩展,它的参数是一个均值向量 ( m e a n v e c t o r ) μ 和协方差矩阵 ( c o v a r i a n c e m a t r i x ) Σ R n n ,其中n是多维变量的向量长度, Σ R n n 是对称正定矩阵。多元正态分布的概率密度函数为:

P ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 e 1 2 ( x μ ) T Σ 1 ( x μ )

协方差矩阵

  对于一个样本的特征向量,一般有多个属性,我们需要分析各个属性之间的线性关系。协方差及相关系数是度量随机变量间线性关系的参数,由于不知道具体的分布,只能通过样本来进行估计。
  假设我们的样本集合可以表示成矩阵 X = [ x 1 , x 2 , x 3 , . . . , x n ] T ,其中 x i 表示第 i 个样本的特征向量。样本的协方差矩阵的计算公式如下:

(1) Σ = 1 m 1 j = 1 m ( x j μ ) ( x j μ ) T

公式中 μ = 1 m i = 1 m x i

GDA建模过程

  高斯判别分析的基本假设是目标值 y 服从伯努利分布,条件概率 P ( x | y ) 服从正态分布。所以它们的概率密度为:

(2) P ( y ) = φ y ( 1 φ ) 1 y (3) P ( x | y = 0 ) = 1 ( 2 π ) n / 2 | Σ 0 | 1 / 2 e 1 2 ( x μ 0 ) T Σ 0 1 ( x μ 0 ) (4) P ( x | y = 1 ) = 1 ( 2 π ) n / 2 | Σ 1 | 1 / 2 e 1 2 ( x μ 1 ) T Σ 1 1 ( x μ 1 )

于是数据集的极大似然函数如下所示:
(5) L ( φ , μ 0 , μ 1 , Σ ) = l o g i = 1 m P ( x i , y i ; ϕ , μ 0 , μ 1 , Σ ) (6) = l o g i = 1 m P ( x i | y i ; ϕ , μ 0 , μ 1 , Σ ) P ( y i ; φ )

对极大似然函数最大化,可以推导得到各参数的极大似然估计,各参数的极大似然估计如下:
(7) φ = 1 m i = 1 m I { y i = 1 } (8) μ 0 = i = 1 m I { y i = 0 } x i i = 1 m I { y i = 0 } (9) Σ = 1 m i = 1 m ( x i μ y i ) ( x i μ y i ) T

如果一个类别的协方差矩阵对角线元素偏小,则说明该类别的样本点比较集中;如果对角线元素偏大,则该类别的样本点比较发散。如果两个类别的协方差矩阵相同,那么分类超平面垂直平分两个分布中心点 μ 0 , μ 1 。下面给出一个二维 G D A 模型的例子(来自吴恩达的机器学习课程):

这里写图片描述

代码块

看完吴恩达的机器学习课程后自己用python写了个高斯判别的算法,如果有问题请留言:

import numpy as np
import math
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import scale
from random import random
from sklearn.model_selection import train_test_split

class GDA(object):
    X = np.array([])
    Y = np.array([])
    data_num = 0
    feature_size = 0
    Ppos = 0
    Pneg = 0
    Ave_pos = np.array([])
    Ave_neg = np.array([])
    Var_pos = np.array([])
    Var_neg = np.array([])
    Cov_pos = np.array([])
    Cov_neg = np.array([])

    def __init__(self, train_data, train_target):
        self.X = train_data
        self.Y = train_target
        self.data_num = train_data.shape[0]
        self.feature_size = train_data.shape[1]
        self.Ppos = sum(train_target)/self.data_num
        self.Pneg = 1 - self.Ppos
        posSum = np.zeros((self.feature_size,))
        negSum = np.zeros((self.feature_size,))
        posData = np.array([])
        negData = np.array([])
        for i in range(self.data_num):
            if train_target[i] == 1:
                posSum += train_data[i]
                if posData.size == 0:
                    posData = train_data[i]
                else:
                    posData = np.vstack((posData,train_data[i]))
            else:
                negSum += train_data[i]
                if negData.size == 0:
                    negData = train_data[i]
                else:
                    negData = np.vstack((negData,train_data[i]))
        self.Ave_pos = posSum/sum(train_target)
        self.Ave_neg = negSum/(self.data_num - sum(train_target))
        self.Var_pos = np.zeros((self.feature_size,))
        self.Var_neg = np.zeros((self.feature_size,))
        for i in range(self.data_num):
            self.Var_pos += (train_data[i] - self.Ave_pos)**2
            self.Var_neg += (train_data[i] - self.Ave_neg)**2
        self.Var_pos = self.Var_pos/sum(train_target)
        self.Var_neg = self.Var_neg/(self.data_num - sum(train_target))
        self.Cov_pos = np.cov(posData.T)
        self.Cov_neg = np.cov(negData.T)

    def predict(self, test_data):
        predict_target = []
        Cov_pos_det = np.sqrt(np.linalg.det(self.Cov_pos))
        Cov_neg_det = np.sqrt(np.linalg.det(self.Cov_neg))
        Cov_pos_inv = np.linalg.inv(self.Cov_pos)
        Cov_neg_inv = np.linalg.inv(self.Cov_neg)
        for i in range(test_data.shape[0]):
            tmp1 = math.pow((2*np.pi),test_data.shape[0]/2)#*np.sqrt(Cov_pos_det)
            tmp_pos = tmp1*Cov_pos_det
            tmp_neg = tmp1*Cov_neg_det
            tmp_pos_exp = -np.dot(np.dot((test_data[i]-self.Ave_pos),Cov_pos_inv),((test_data[i]-self.Ave_pos).T))/2
            tmp_neg_exp = -np.dot(np.dot((test_data[i]-self.Ave_neg),Cov_neg_inv),((test_data[i]-self.Ave_neg).T))/2
            P_X_Cpos = np.exp(tmp_pos_exp)/tmp_pos
            P_X_Cneg = np.exp(tmp_neg_exp)/tmp_neg
            tmp_target = self.Ppos*P_X_Cpos/(P_X_Cpos + P_X_Cneg)
            if tmp_target >= 0.5:
                predict_target.append(1)
            else:
                predict_target.append(0)
        return predict_target

    def evaluate(self, predict_target, test_target):
        err = 0
        for i in range(len(predict_target)):
            if predict_target[i] != test_target[i]:
                err += 1
        return err/len(predict_target)

if __name__ == "__main__":
    cancer = load_breast_cancer()
    train_data,test_data,train_target,test_target = train_test_split(cancer.data, \
                                                                     cancer.target, \
                                                                     test_size=0.1, \
                                                                     random_state=0)

    gda = GDA(train_data, train_target)
    predict = gda.predict(test_data)
    print('准确率:',1-gda.evaluate(predict,test_target))

猜你喜欢

转载自blog.csdn.net/zesenchen/article/details/79631409