Deep Learning--week1~week3

week1

一张图片，设像素为64*64，颜色通道为红蓝绿三通道，则对应3个64*64实数矩阵

为了用向量表示这些矩阵，将这些矩阵的像素值展开为一个向量x作为算法的输入

从红色到绿色再到蓝色，依次按行一个个将元素读到向量x中，则x是一个$1\times64*64*3$的矩阵，也就是一个64*64*3维的向量

用 $n_x = 64*64*3$ 表示特征向量x的维度

而所有的训练样本表示成：$X = \begin{bmatrix}\mid & \mid &\mid &&\mid \\ x^{(1)}& x^{(2)}& x^{(3)}& \cdots & x^{(m)}\\ \mid & \mid &\mid &&\mid \end{bmatrix}$ ($n_x \times m$矩阵)

（注意不是$X = \begin{bmatrix} (x^{(1)})^T\\ \vdots \\ (x^{(m)})^T \end{bmatrix}$ ,用上面的方法运算会简单点）

$Y=\begin{bmatrix}y^{(1)} & y^{(2)} & \cdots & y^{(m)}\end{bmatrix}$

之前的机器学习课上的$\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}$的形式不再使用，而用$\large b = \theta_0, \; w = \begin{bmatrix} \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}$代替( it will be easier to just keep $b$ and $w$ as separate parameters )

则output : $\large \hat{y}^{(i)} = \sigma(w^Tx^{(i)}+b)，{\rm where\;}\sigma(z^{(i)}) = \frac{1}{1+e^{-z^{(i)}}}$

$\text{Given \{}(x^{(1)}, y^{(1)}),\dots,(x^{(m)},y^{(m)})\text{\}, want } \hat{y}^{(i)} \approx y^{(i)}$

week2

Loss Function/Error Function

Loss Function/Error Function(误差函数): used to measure how well our algorism is doing
\[ {\cal L}(\hat{y},y) = -y\cdot log(\hat{y})-(1-y)\cdot log(1-\hat{y}) \]
Cost Function
\[ J(w,b) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\, log\,\hat{y}^{(i)})+(1-y^{(i)})\, log\,(1-\hat{y}^{(i)})] \]

Gradient Descent

看ML的笔记，实质上是一样的

Vectorization:

#Non-vecotrized
#slow
z = 0
for i in range(n_x):
    z += w[i] * x[i]
z += b

#Vectorized
#import numpy as np
z = np.dot(w,x) + b

whenever possible, avoid explicit for-loops（因为是解释型语言）, 用numpy带的行数可以简洁而高效地实现

Vectorizing Logistic Regression

$X = \begin{bmatrix} \lvert & \lvert & \cdots & \lvert \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \lvert & \lvert & \cdots & \lvert \end{bmatrix}, \mathbb{R}^{n_x \times m}$

$Z = \begin{bmatrix}z^{(1)} & z^{(2)} & \cdots & z^{(m)} \end{bmatrix} = w^TX + \begin{bmatrix}b &b & \cdots & b \end{bmatrix}$

$z^{(i)}$ 是 sigmoid function的输入值

$A = \begin{bmatrix}a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix} = \sigma(Z)$

（这里的不同上标的元素似乎实际是在同一个layer中的，跟ML课上不大一样。 $a^{[j](i)}$中方括号括起来的是层数，圆括号括起来的是第$i$个训练实例）

import numpy as np
z = np.dot(w,x) + b\
#Python automatically takes this real number b and expands it out to this 1*m row vector

Gradient Output

${\rm d}z^{(i)} = a^{(i)} - y^{(i)}$

$\begin{align}{\rm d}Z &= \begin{bmatrix}{\rm d}z^{(1)} & {\rm d}z^{(2)} & \cdots & {\rm d}z^{(m)} \end{bmatrix} \\&= A-Y = \begin{bmatrix}a^{(1)} - y^{(1)} & a^{(2)} - y^{(2)} & \cdots & a^{(m)} - y^{(m)} \end{bmatrix} \end{align}$

${\rm d}b = $1/m*np.sum(dZ)

${\rm d}w = \frac{1}{m}X{\rm d}Z^T$

单次迭代免for-loop法(vectorize)：
\[ \begin{align} \downarrow&\begin{cases} Z & = w^T+b\\ & = {\rm np.dot(}w{\rm .T, }X{\rm)}\\ A & = \sigma(Z)\\ {\rm d}Z &= A-Y \\ {\rm d}w &= \frac{1}{m}X{\rm d}Z^T\\ \end{cases}\\\\ w& := w - \alpha{\rm d}w\\ b &:= b - \alpha{\rm d}b \end{align} \]

若要多次迭代，最外层的显式for-loop是不可避免的

Broadcasting

用reshape()确保矩阵的尺寸

举个例子说明numpy 的 broadcasting机制：

>>> import numpy as np
>>> a = np.arange(0,6).reshape(6,1)
>>> a
array([[0],
       [1],
       [2],
       [3],
       [4],
       [5]])
>>> b = np.arange(0,5)
>>> b
array([0, 1, 2, 3, 4])
>>> a * b
array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12],
       [ 0,  4,  8, 12, 16],
       [ 0,  5, 10, 15, 20]])
>>> a + b
array([[0, 1, 2, 3, 4],
       [1, 2, 3, 4, 5],
       [2, 3, 4, 5, 6],
       [3, 4, 5, 6, 7],
       [4, 5, 6, 7, 8],
       [5, 6, 7, 8, 9]])

也就是说matrix+-*/number/vector时，numpy会将number/vector通过自我复制拓展成合法的矩阵

注意这会导致在期望抛出异常的地方不抛出异常而是发生奇怪的BUG：

比如有时我想行向量和列向量相加时抛出异常，但是numpy却用broadcasting机制把它给算出来了...

numpy的坑

import numpy as np
a = np.random.randn(5)
>>> a
array([-0.19837642, -0.16758652,  1.57705505,  0.13033745, -0.81073889])
>>> a.shape
(5,)    
# which is called a rank 1 array in Python and is neither a row vector nor a column vector

>>> a.T
array([-0.19837642, -0.16758652,  1.57705505,  0.13033745, -0.81073889])    
# which is same as 'a' i self

>>> np.dot(a,a.T)
3.2288264718632416  
# it is a number rather than a matrix in expectation(just like array([[55]]))

不要使用形如(5,)或者(n,)这样的“rank 1 array”，而是显式地说明是$m \times n$的矩阵：

>>> a = np.random.randn(5,1)
>>> a
array([[ 0.7643396 ],
       [-1.66945103],
       [ 1.66235712],
       [-0.06892102],
       [-1.61347409]])
>>> a.T
array([[ 0.7643396 , -1.66945103,  1.66235712, -0.06892102, -1.61347409]])

注意array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])和array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])的区别（后者有两个方括号），这说明前者是秩为1的数组而后者是一个真正的$1 \times 5$矩阵（就像C里一样矩阵是用二维数组表示的）（另外我觉得rank 1 array翻译为一维数组更为准确）

It can use assert() statement to make sure the dimension of one of vectors.

When you get a rank 1 array, you can use a.reshape to transform it into a (n,1) array or a (1,n) array.

Logistic Regression Cost Function
\[ \left. \begin{array}{l} \text{If y=1:}\quad p(y|x)=\hat{y}\\ \text{If y=0:}\quad p(y|x)=1-\hat{y} \end{array} \right\} p(y|x) = \hat{y}^y\cdot (1-\hat{y})^{1-y}\\ \,\\ \begin{align} \therefore {\rm log}(p(y|x)) &= y\cdot log\,\hat{y} + (1-y)\cdot log\, (1-\hat{y}) \\ &= -\mathcal{L}(\hat{y},y) \end{align} \]

所以：

\[ \begin{align} {\rm log }[p(\text{labels in training set})] &= {\rm log } \prod_{i=1}^mp(y^{(i)}|x^{(i)})\\ &=\sum_{i=1}^m {\rm log\,}p(y^{(i)}|x^{(i)})\\ &=\sum_{i=1}^m-\mathcal{L}(\hat{y}^{(i)},y^{(i)})\\ &=-\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)}) \end{align}\\ \text{Cost: }J(w,b) = \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)}) \]
maximum likelihood estimation (极大似然估计)

week3

$Z^{[j]} = W^{[j]}A^{[j-1]} + b^{[j]} = w^{[j]}\begin{bmatrix} | & | & | & \\ a^{[j-1](1)} & a^{[j-1](2)} & a^{[j-1](3)} & \cdots \\ | & | & | & \end{bmatrix} + b^{[j]} = \begin{bmatrix} | & | & | & \\ z^{[j](1)} & z^{[j](2)} & z^{[j](3)} & \cdots \\ | & | & | & \end{bmatrix}$

其中$(i) \in [(1),(m)],\quad [j] \in [[1],[n]],\quad X = A^{[0]}$

Other Activation Function

①$tanh(z)$ function:
\[ a= tanh(z)=\frac{e^z -e^{-z}}{e^z +e^{-z}}\text{ , when } tanh(z) \in (-1,1), tanh(0)=0 \]
$tanh(z)$ 可以把数据中心化为 0 （Sigmoid Function 将数据中心化为 0.5）

之后只有 $0 \le \hat{y} \le 1$ (即二元分类问题)才用 Sigmoid Function，因为$tanh$几乎严格优于Sigmoid...

②Rectified Linear Unit(线性整流函数, ReLU)：$Q = max(0,z)$

When not sure what to use for your hidden layer, can use the ReLU function

Disadvantage of ReLU: when $z$ is negative, the value is 0.

It can use what names Leaky ReLU to overcome the disadvantage below.

Leaky ReLU: $a = max(0.01z, z)$

ReLU可以使得斜率不变（Sigmoid 和 $tanh(z)$ 在$z\rightarrow \infin$时斜率趋向于0，会使得学习速度下降）

最常用的 Activation Function

③Tannish Function(双曲函数)

当且仅当要解决回归问题的时候，在生成到output layer才使用线性的Activation Function（$g(z)=z$) ,比如预测房价时，y不限于 0 和 1（$y \in \mathbb{R}$），所以可以用$g(z)=z$ 输出，隐藏单元不应该使用Linear Activation Function, 而是应该使用tanh/ReLU/Leaky ReLU

Derivatives of Activation Functions

Sigmoid:
- $\frac{{\rm d}}{{\rm d}z}g(z) = g(z)(1-g(z))$
  $tanh(z)$:
- $g\prime(z) = 1-(tanh(z))^2$
ReLU:
- $g\prime(z) = \begin{cases}1, \text{if }z\ge0 \\0, \text{if }z\lt0 \end{cases}$

Gradient Descents For Neural Networks

Parameters : $w^{[1]},b^{[1]},w^{[2]},b^{[2]}$

Cost Function : $J(w^{[1]},b^{[1]},w^{[2]},b^{[2]})= \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y},y)$

Gradient Function:
\[ \begin{align} &\text{Repeat \{}\\ &\quad \text{compute predicts} (\hat{y}^{(i)}, i = 1,\dots,m) \\ &\quad {\rm d}w^{[1]} = \frac{\partial J}{\partial w^{[1]}}, {\rm d}b^{[1]} = \frac{\partial J}{\partial b^{[1]}},\dots\\ &\quad w^{[1]} = w^{[1]} - \alpha {\rm d}w^{[1]}\\ &\quad b^{[1]} = b^{[1]} - \alpha {\rm d}b^{[1]}\\ &\quad w^{[2]} = w^{[2]} - \alpha {\rm d}w^{[2]}\\ &\quad b^{[2]} = b^{[2]} - \alpha {\rm d}b^{[2]}\\ \text{\}} \end{align} \]

Forward Propagation :
\[ \begin{align} Z^{[1]} &= w^{[1]}X + b^{[1]}\\ A^{[1]} &= g^{[1]}(z^{[1]})\\ Z^{[2]} &= w^{[2]}A^{[1]} + b^{[2]}\\ A^{[2]} &= g^{[2]}(z^{[2]}) = \sigma(Z^{[2]}) \end{align} \]
Backward Propagation :
\[ \begin{align} {\rm d}Z^{[2]} &= A^{[2]} - Y, \quad Y = \begin{bmatrix}y^{[1]} & y^{[2]} & \dots & y^{[m]}\end{bmatrix}\\ {\rm d}w^{[2]} &= \frac{1}{m} {\rm d}z^{[2]} A^{[1]T}\\ {\rm d}d^{[2]} &= \frac{1}{m}\text{np.sum(d}z^{[2]}\text{,axis=1,keepdims=True)}\\ {\rm d}Z^{[1]} &= w^{[2]T}{\rm d}Z^{[2]}\; .* \; g^{[1]\prime}(Z^{[1]})\\ {\rm d}w^{[1]} &= \frac{1}{m} {\rm d}Z^{[1]}X^T\\ {\rm d}d^{[1]} &= \frac{1}{m}\text{np.sum(d}z^{[1]}\text{,axis=1,keepdims=True)}\\ \end{align} \]
注：axis = 1means summing horizontally, and keepdims = True means prevent from outputting Rank 1 Array. You can call reshape function explicitly rather than keeping these parameters.

又注：$由于A^{[1]} = g^{[1]}(Z^{[1]})且g^{[1]\prime}(z) = 1-a^2,\;所以 g^{[1]\prime}(Z^{[1]}) = 1-(A^{[1]})^2$, 即：$Z^{[1]} = w^{[2]T}{\rm d}Z^{[2]}\; .* \; (1-(A^{[1]})^2$

Random Initialization

For a neural network, if initialize the weights to parameters to all zero and then apply gradient descent, it won't work.