李宏毅机器学习2020春季作业二hw2（2）

文章目录

- 二、Porbabilistic generative model

这里我想说一下自己对于logistic regression（逻辑回归）、Porbabilistic generative model（概率生成模型）的理解。首先在视频里李宏毅老师讲到logistic regression是一种Discriminative model（判别式模型），而与之对应的是Generative model（生成式模型），说白了其实就是这两大类模型的区别。

$\color{green}Generative$ 在进行估计时首先需要根据数据的分布做一个假设，假设他属于什么样的概率分布（比如，二维正态分布、二项分布等等），然后根据切贝谢夫为基础，计算x来自class1的概率（得到一个公式），通过找到公式中未知的变量，即可计算出相应的 $\color{red}w,b$ 。这里推荐看老师的视频来做回顾https://www.bilibili.com/video/BV1JE411g7XF?p=11

$\color{blue}Discriminative$ 则是开始的时候不对数据的分布进行估计，我们知道最后会得到 $\color{red}z=w^T + b$ ，然后根据σ(z)求出loss，通过梯度下降使loss最小化，进而求出最优的 $\color{red}w^*,b^*$ 。那么一开始的 $\color{red}w,b$ 是随机的，通过loss function才使得他越来越接近 $\color{red}w^*,b^*$ 。

这不难看出两者的优劣之处：
通常情况下 $\color{blue}Discriminative$ 会比 $\color{green}Generative$ 有更好的准确率，因为 $\color{blue}Discriminative$ 是根据真实的数据进行梯度下降进而找到的 $\color{red}w^*,b^*$ ，这种情况更接近于真实的数据，也就是更容易泛化；而 $\color{green}Generative$ 一开始进行了假设，抛开别的不说，首先进行的假设分布就有可能是错的。其次他是根据概率公式进行的计算，而概率公式都是在数据接近于无穷大时（取到总体时）推导出来的，这不一定适用于我们取的样本，因为单单一小部分的样本不可能代表总体。

但是 $\color{green}Generative$ 也并非在任何情况下都比 $\color{blue}Discriminative$ 表现差。例如：
（1）由于 $\color{blue}Discriminative$ 只看数据进行梯度下降，所以其效果受数据量的影响很大。当数据量很少的时候， $\color{green}Generative$ 会忽视data，按照自己认定的分布估计 $\color{red}w^*,b^*$ ，如果此时他对数据的分布估计准确的话，当然结果则比 $\color{blue}Discriminative$ 会更好
（2）当data中存在noise时， $\color{green}Generative$ 会忽视noise，得到更准确的答案。
（3）在语音辨识中，求取某句话被说出来的概率完全可以用公式求取，而不需要对其进行 $\color{blue}Discriminative$ 的参数估计，所以 $\color{green}Generative$ 在某些时候会更加适用

二、Porbabilistic generative model

1、Preparing Data

训练集和测试集的处理方法和 logistic regression 一样，然而因为 generative model 有可解析的最佳解（利用公式直接算的），因此不必使用到 development set。这里以防忘记重新写了正则化的函数，具体详见上篇博客

import numpy as np

np.random.seed(0)
X_train_fpath = './data/X_train'
Y_train_fpath = './data/Y_train'
X_test_fpath = './data/X_test'
output_fpath = './output_{}.csv'#接受结果

with open(X_train_fpath) as f:
    next(f)
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype=float)

with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype=float )

with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype=float)

def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
    '''
    This function normalizes specific columns of X.
    The mean and standard variance of training data will be reused when processing testing data.

    :param X: data to be processed
    :param train: 'True' when processing training data,'False' for tseting data
    :param specified_column: indexs of the columns that will be normalized.
                              if 'none' all collumn will be normalized.
    :param X_mean: mean value of training data,used when train = 'False'
    :param X_std: standard deviation of training data, used when train = 'False'

    :return:
    X: normalized data
    X_mean:computed mean value of training data
    X_std:computed standard deviation of training data
    '''

    if specified_column == None:
        specified_column = np.arange(X.shape[1])
    if train:
        X_mean = np.mean(X[:,specified_column], 0).reshape(1,-1)
        X_std = np.std(X[:,specified_column], 0).reshape(1,-1)

    X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8)

    return X, X_mean, X_std

# Normalize training and testing data
X_train, X_mean, X_std = _normalize(X_train, train=True)
X_test, _, _= _normalize(X_test, train=False, specified_column=None, X_mean=X_mean, X_std=X_std)

1、Mean and Covariance

抽取的x来自class1的概率也就是 $P(C_{1}|x)$ ，如果 $P(C_{1}|x)$ ＞0.5那么x来自class 1，根据 $\color{green}Generative$ 我们假设其服从n维正态分布（下面的演算中为了方便书写只有两个特征，特征的数量只对Σ矩阵的大小有影响：两个特征的数据的协方差矩阵是2ⅹ2维，510个特征就初始化为510ⅹ510维。最主要的是理解他的处理过程）
在这里插入图片描述

这里根据最后化简的结果知，为了求出 $\color{red}w^*,b^*$ ，我们需要计算平均数 μ（Mean）和协方差 Σ（Covariance）其手算过程如下：

在这里插入图片描述
最后根据 $Σ=\frac{Σ^1*N_{1}+Σ^2*N_{2}}{N_{1}+N_{2}}$ 求出 $Σ$
$\color{red}补充知识点$

X_train_0 = np.array([x for x, y in zip(X_train, Y_train) if f == 0])：表示在Y_train中找值为0的，并以此下标找到对应的X_train，这些Y_train=0对应的X_train被划分为X_train_0
序列解包的更多操作详见链接
np.mean(X_train_0, axis = 0)：对每一列求平均数，得到1ⅹn的矩阵，这里X_train_0的n为43105，X_train_1的n为11151

$\hat{y}$ 的取值为0/1，所以要划分为两个正态分布

# 划分两个正态分布：X_train_0、X_train_1
X_train_0 = np.array([x for x, y in zip(X_train, Y_train) if y == 0])
X_train_1 = np.array([x for x, y in zip(X_train, Y_train) if y == 1])

#求出两个对应的E(xi)矩阵
mean_0 = np.mean(X_train_0, axis = 0)
mean_1 = np.mean(X_train_1, axis = 0)  

# 初始化两个Σ矩阵
data_dim = X_train.shape[1]
cov_0 = np.zeros((data_dim, data_dim))
cov_1 = np.zeros((data_dim, data_dim))

#按照上图的计算方法求取两个Σ矩阵
for x in X_train_0:
    cov_0 += np.dot(np.transpose([x - mean_0]), [x - mean_0]) / X_train_0.shape[0]
for x in X_train_1:
    cov_1 += np.dot(np.transpose([x - mean_1]), [x - mean_1]) / X_train_1.shape[0]

# 使两个正态分布具有相同的Σ，根据行数的比例来计算
cov = (cov_0 * X_train_0.shape[0] + cov_1 * X_train_1.shape[0]) / (X_train_0.shape[0] + X_train_1.shape[0])

2、Computing weights and bias

从上上页图片推导出的公式结论知： $z=\color{red}(μ^1-μ^2)^T(Σ^1)^{-1}\color{black}x\color{blue}-0.5(μ^1)^T(Σ^1)^{-1}(μ^1)+0.5(μ^2)^T(Σ^2)^{-1}(μ^2)+ln\frac{N_{1}}{N_{2}}$ ，其中红色部分为 $\color{red}w$ ，蓝色部分为 $\color{blue}b$ ，现在已经求出 $μ^1,μ^2$ ，还需要知道 $Σ)^{-1}$ 才能求出 $\color{red}w$ 、 $\color{blue}b$ 。

这时候需要考虑一个矩阵是否具有逆？没有逆的时候怎么办？—— 当一个矩阵A的行列式为0时（写作|A|=0/det(A)=0）我们称A为奇异矩阵，奇异矩阵不具有逆。但我们可以求出他的伪逆（具有矩阵逆的性质）

这里就得提到几个矩阵求逆的方法：
第一种：高斯消元法
第二种：LU分解法
第三种：SVD分解法
第四种：QR分解法
其中最稳定的方法是SVD分解法，他们四种更加具体的区别点击链接查看
https://www.zhihu.com/question/345971704/answer/1624930445

由于我们算出的协方差矩阵 $(Σ)$ 可能是近似奇异的，直接利用np.linalg.inv()求逆可能会给出较大的数值误差。通过SVD分解，可以有效、准确地得到矩阵的逆。他的具体推导过程可点击链接查看。
https://zhuanlan.zhihu.com/p/134512367?utm_source=qq&utm_medium=social&utm_oi=1184094730944806912

这里给出求解矩阵逆的结论：
通过SVD分解矩阵cov得到 $\color{red}cov=UΣV^T$ ,那么 $\color{red}cov^+=VΣ^+U^T$ (U 矩阵的列被称为cov 的左奇异向量，Σ 矩阵中的对角值被称为原始矩阵， cov 的奇异值V 的列被称为 cov的右奇异向量， $cov^+$ 表示cov伪逆， $Σ^+$ 表示Σ的伪逆)
$Σ$ 除了对角线为奇异值外其余位置都是0，所以 $Σ^+$ 可以由 $Σ$ 取倒数得到（ $ΣΣ^+=E$ ）

一切准备就绪！开始计算 $w 、 b$ （直接带公式）！
$\color{red}w=(μ^1-μ^2)^T(Σ^1)^{-1}$
$\color{blue}b=0.5(μ^1)^T(Σ^1)^{-1}(μ^1)+0.5(μ^2)^T(Σ^2)^{-1}(μ^2)+ln\frac{N_{1}}{N_{2}}$

u, s, vT = np.linalg.svd(cov, full_matrices=False)
#这里得到的s并不是Σ对角矩阵，而是1*510维的向量
#由于cov是510*510维，所以s不进行扩维转化可以直接进行计算
#通过1 / s求出s的逆矩阵
inv = np.matmul(vT.T * 1 / s, u.T)

# Directly compute weights and bias
w = np.dot(inv, mean_0 - mean_1)
b =  (-0.5) * np.dot(mean_0, np.dot(inv, mean_0)) + 0.5 * np.dot(mean_1, np.dot(inv, mean_1))\
    + np.log(float(X_train_0.shape[0]) / X_train_1.shape[0])

3、Compute accuracy on training set

这些函数均和上篇文章的相同，这里只是复制粘贴下，具体功能等请点击链接查看

def _sigmoid(z):
    # Sigmoid function can be used to calculate probability.
    # To avoid overflow, minimum/maximum output value is set.
    return np.clip(1 / (1.0 + np.exp(-z)), 1e-8, 1 - (1e-8))
def _f(X, w, b):
    '''
    This is the logistic regression function, parameterized by w and b

    :param X: input data, shape = [batch_size, data_dimension]
    :param w: weight vector, shape = [data_dimension, ]
    :param b: bias, scalar
    :return: predicted probability of each row of X being positively labeled, shape = [batch_size, ]
    '''
    return _sigmoid(np.matmul(X, w) + b)

def _predict(X, w, b):
    # This function returns a truth value prediction for each row of X
    # by rounding the result of logistic regression function.
    return np.round(_f(X, w, b)).astype(np.int)

def _accuracy(Y_pred, Y_label):
    # This function calculates prediction accuracy
    acc = 1 - np.mean(np.abs(Y_pred - Y_label))
    return acc

# Compute accuracy on training set
Y_train_pred = 1 - _predict(X_train, w, b)
print('Training accuracy: {}'.format(_accuracy(Y_train_pred, Y_train)))

4、Predicting testing labels

这里的处理方法同上篇文章一样最后一部分的预测一样（约等于复制粘贴），点击链接查看具体操作方法

# Predict testing labels
predictions = 1 - _predict(X_test, w, b)
with open(output_fpath.format('generative'), 'w') as f:
    f.write('id,label\n')
    for i, label in  enumerate(predictions):
        f.write('{},{}\n'.format(i, label))

# Print out the most significant weights
ind = np.argsort(np.abs(w))[::-1]
with open(X_test_fpath) as f:
    content = f.readline().strip('\n').split(',')
features = np.array(content)
for i in ind[0:10]:
    print(features[i], w[i])

Summary

从hw2中我学习了两种分类方法：logistic regression（逻辑回归）、Porbabilistic generative model（概率生成模型），以及两者的区别和具体的流程，更重要的是掌握了完成代码过程中各种运算的操作！顺带浅显地了解了PAC、SVD，他们的应用、以及在本次作业中求矩阵逆的过程，改天再写篇关于SVD的文章吧。