前言:本文在翻译过程中，为了便于理解，某些句子可能和原文有一定的出入。但是整体上没有太大的改动，由于本人水平有限，翻译或者理解不对的地方，欢迎指正，不胜感激。

Softmax 分类函数

本例子包括以下内容:
* softmax 函数
* 交叉熵（Cross-entropy）损失函数

在上一个例子中，我们介绍了如何利用 logistic 函数来处理二分类问题。

对于多分类问题，在处理多项式 logistic 回归（multinomial logistic regression）中，用到 logistic 函数的一种扩展形式，叫做 softmax 函数。下面的内容将会介绍 softmax 函数及其求导方式。

首先，导入需要用到的 python 库函数。

# Python imports
import numpy as np # Matrix and vector computation package
import matplotlib.pyplot as plt  # Plotting library
from matplotlib.colors import colorConverter, ListedColormap # some plotting functions
from mpl_toolkits.mplot3d import Axes3D  # 3D plots
from matplotlib import cm # Colormaps
# Allow matplotlib to plot inside this notebook
%matplotlib inline

1. Softmax 函数

在上个例子中，我们介绍了logistic 函数，但是它只能用于处理二分类问题 $t=1$ 或者 $t=0$ 。现在我们来介绍它的一个推广的形式， softmax 函数，在多分类问题中它能够输出每个类别的预测概率。 softamx 函数 $\varsigma$ 的输入是一个 $C$ -维的向量 $\mathbf{z}$ ，输出也是一个 $C$ -维的向量 $\mathbf{y}$ , $\mathbf{y}$ 的每个元素都是介于 $0$ 和 $1$ 的一个实数值。这个函数形式为归一化的指数函数，定义如下:

y c = ς (z) c = e z c \sum C d = 1 e z d for c = 1 \dots C

$y_c = \varsigma(\mathbf{z})_c = \frac{e^{z_c}}{\sum_{d=1}^C e^{z_d}} \quad \text{for} \; c = 1 \cdots C$

其中，分母 $\sum_{d=1}^C e^{z_d}$ 作为一个正则化矩阵，确保所有类别的概率值和为 1： $\sum_{c=1}^C y_c = 1$ .
在神经网络中，softmax 函数一般作为最后的输出层，所以 softmax 函数可以图形化地表示为一个拥有 $C$ 个节点的神经层。

对于给定输入 $\mathbf{z}$ ，我们可以按照下面的公式计算每个类别的概率：

⎡ ⎣ ⎢ ⎢ P (t = 1 | z) ⋮ P (t = C | z) ⎤ ⎦ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ς (z) 1 ⋮ ς (z) C ⎤ ⎦ ⎥ ⎥ = 1 \sum C d = 1 e z d ⎡ ⎣ ⎢ ⎢ e z 1 ⋮ e z C ⎤ ⎦ ⎥ ⎥

$\begin{bmatrix} P(t=1 | \mathbf{z}) \\ \vdots \\ P(t=C | \mathbf{z}) \\ \end{bmatrix} = \begin{bmatrix} \varsigma(\mathbf{z})_1 \\ \vdots \\ \varsigma(\mathbf{z})_C \\ \end{bmatrix} = \frac{1}{\sum_{d=1}^C e^{z_d}} \begin{bmatrix} e^{z_1} \\ \vdots \\ e^{z_C} \\ \end{bmatrix}$

对于给定的输入 $\mathbf{z}$ ， $P(t=c | \mathbf{z})$ 就是该样本预测属于类别 $c$ 的概率。

以一个二分类问题为例：对于输入 $\mathbf{z} = [z_1, z_2]$ ，输出预测属于类别 1 的概率 $P(t=1|\mathbf{z})$ 如下图所示。输出预测为另一个类别 $P(t=2|\mathbf{z})$ 的概率值大小刚好和图中结果互补。

# Define the softmax function
def softmax(z):
    return np.exp(z) / np.sum(np.exp(z))

# Plot the softmax output for 2 dimensions for both classes
# Plot the output in function of the weights
# Define a vector of weights for which we want to plot the ooutput
nb_of_zs = 200
zs = np.linspace(-10, 10, num=nb_of_zs) # input 
zs_1, zs_2 = np.meshgrid(zs, zs) # generate grid
y = np.zeros((nb_of_zs, nb_of_zs, 2)) # initialize output
# Fill the output matrix for each combination of input z's
for i in range(nb_of_zs):
    for j in range(nb_of_zs):
        y[i,j,:] = softmax(np.asarray([zs_1[i,j], zs_2[i,j]]))
# Plot the cost function surfaces for both classes
fig = plt.figure()
# Plot the cost function surface for t=1
ax = fig.gca(projection='3d')
surf = ax.plot_surface(zs_1, zs_2, y[:,:,0], linewidth=0, cmap=cm.coolwarm)
ax.view_init(elev=30, azim=70)
cbar = fig.colorbar(surf)
ax.set_xlabel('$z_1$', fontsize=15)
ax.set_ylabel('$z_2$', fontsize=15)
ax.set_zlabel('$y_1$', fontsize=15)
ax.set_title ('$P(t=1|\mathbf{z})$')
cbar.ax.set_ylabel('$P(t=1|\mathbf{z})$', fontsize=15)
plt.grid()
plt.show()

这里写图片描述

2. softmax 函数求导

为了在神经网络中使用 softmax 函数，我们需要对它进行求导。假设我们定义 $\Sigma_C = \sum_{d=1}^C e^{z_d} 。\text{for} \; c = 1 \cdots C$ ，预测属于类别 $c$ 的概率值 $y_c = e^{z_c} / \Sigma_C$ , 那么输出 $\mathbf{y}$ 对输入 $\mathbf{z}$ 的导数 ${\partial y_i}/{\partial z_j}$ 可以写成下面形式：

if i = j : if i \neq j : \partial y i \partial z i = \partial e z i Σ C \partial z i = e z i Σ C - e z i e z i Σ 2 C = e z i Σ C Σ C - e z i Σ C = e z i Σ C (1 - e z i Σ C) = y i (1 - y i) \partial y i \partial z j = \partial e z i Σ C \partial z j = 0 - e z i e z j Σ 2 C = - e z i Σ C e z j Σ C = - y i y j

$\begin{split} \text{if} \; i = j :& \frac{\partial y_i}{\partial z_i} = \frac{\partial \frac{e^{z_i}}{\Sigma_C}}{\partial z_i} = \frac{e^{z_i}\Sigma_C - e^{z_i}e^{z_i}}{\Sigma_C^2} = \frac{e^{z_i}}{\Sigma_C}\frac{\Sigma_C - e^{z_i}}{\Sigma_C} = \frac{e^{z_i}}{\Sigma_C}(1-\frac{e^{z_i}}{\Sigma_C}) = y_i (1 - y_i)\\ \text{if} \; i \neq j :& \frac{\partial y_i}{\partial z_j} = \frac{\partial \frac{e^{z_i}}{\Sigma_C}}{\partial z_j} = \frac{0 - e^{z_i}e^{z_j}}{\Sigma_C^2} = -\frac{e^{z_i}}{\Sigma_C} \frac{e^{z_j}}{\Sigma_C} = -y_i y_j \end{split}$

注意，若 $i = j$ ，那么求导结果和 logistic 函数是一样的。

3. softmax 函数的交叉熵损失函数

首先我们看一下似然函数: 和 logistic 回归的损失函数一样，似然函数表示对于给定的模型参数集 $\theta$ ，模型能够正确预测输入样本的可能性。最大化似然函数可以写成下面的形式：

argmax θ L (θ | t, z) = argmax θ \prod i = 1 n L (θ | t i, z i)

$\underset{\theta}{\text{argmax}}\; \mathcal{L}(\theta|t,z) = \underset{\theta}{\text{argmax}} \prod_{i=1}^{n} \mathcal{L}(\theta|t_i,z_i)$

根据似然函数的定义， $\mathcal{L}(\theta|t,z)$ 可以写成联合概率的方式,在给定参数 $\theta$ 时，模型产生 $t$ 和 $z$ 的概率: $P(t,z|\theta)$ . Since $P(A,B) = P(A|B)*P(B)$ ，这个又可以写成下面形式:

P (t, z | θ) = P (t | z, θ) P (z | θ)

$P(\mathbf{t},\mathbf{z}|\theta) = P(\mathbf{t}|\mathbf{z},\theta)P(\mathbf{z}|\theta)$

由于我们并不需要关心 $\mathbf{z}$ 的概率，上式可以简化为：
$\mathcal{L}(\theta|\mathbf{t},\mathbf{z}) = P(\mathbf{t}|\mathbf{z},\theta)$ . 所以对于给定的 $\theta$ ，可以写成 $P(\mathbf{t}|\mathbf{z})$ .

由于每个样本的类别 $t_i$ 都是和整个输入 $\mathbf{z}$ 相关的，而且 $\mathbf{t}$ 中有且只有一个类别会激活该函数，所以可以将上面的概率函数写成下面的形式：

P (t | z) = \prod c = 1 C P (t c | z) t c = \prod c = 1 C ς (z) t c c = \prod c = 1 C y t c c

$P(\mathbf{t}|\mathbf{z}) = \prod_{c=1}^{C} P(t_c|\mathbf{z})^{t_c} = \prod_{c=1}^{C} \varsigma(\mathbf{z})_c^{t_c} = \prod_{c=1}^{C} y_c^{t_c}$

译者注：原文没有提到，但是我认为作者说的只有一个类别会激活 cross-entropy 损失函数的意思是指， $\mathbf{t}$ 是一个 one-hot 向量，只有真实类别所对应的那个元素取值为1，其他元素取值为0。

和之前在 logistic 函数中提到的那样，最大化似然函数等价于最小化负的对数似然函数：

- l o g L (θ | t, z) = ξ (t, z) = - l o g \prod c = 1 C y t c c = - \sum c = 1 C t c \cdot l o g (y c)

$- log \mathcal{L}(\theta|\mathbf{t},\mathbf{z}) = \xi(\mathbf{t},\mathbf{z}) = - log \prod_{c=1}^{C} y_c^{t_c} = - \sum_{c=1}^{C} t_c \cdot log(y_c)$

译者注：原文中的交叉熵损失函数的公式和我上面写的是不一样的，原文中是这样：
$- l o g L (θ | t, z) = ξ (t, z) = - l o g \prod i = c C y t c c = - \sum i = c C t c \cdot l o g (y c)$ $- log \mathcal{L}(\theta|\mathbf{t},\mathbf{z}) = \xi(\mathbf{t},\mathbf{z}) = - log \prod_{i=c}^{C} y_c^{t_c} = - \sum_{i=c}^{C} t_c \cdot log(y_c)$
但我不太理解下面从 $i=c$ 开始是什么意思。我认为应该是我上面写的类别 $c$ 从 1 到 $C$ ，不知道是作者笔误还是我理解错了。

这就是我们所说的交叉熵损失函数 $\xi$ .

对于二分类问题 $t_2 = 1 - t_1$ ，这个结果和之前在 logistic 回归中的损失函数是一样的：

ξ (t, y) = - t c l o g (y c) - (1 - t c) l o g (1 - y c)

$\xi(\mathbf{t},\mathbf{y}) =- t_c log(y_c) - (1-t_c) log(1-y_c)$

那么对于一批数量为 $n$ 的样本集，交叉熵损失函数计算如下：

ξ (T, Y) = \sum i = 1 n ξ (t i, y i) = - \sum i = 1 n \sum c = 1 C t i c \cdot l o g (y i c)

$\xi(T,Y) = \sum_{i=1}^n \xi(\mathbf{t}_i,\mathbf{y}_i) = - \sum_{i=1}^n \sum_{c=1}^{C} t_{ic} \cdot log( y_{ic})$

其中当样本 $i$ 属于类别 $c$ 的时候， $t_{ic}$ 取值为 1（否则为0)。 $y_{ic}$ 表示对于输入样本 $i$ ，模型输出属于类别 $c$ 的概率大小。

4. softmax 函数的交叉熵损失函数求导

损失函数对 softmax 函数的输入 $z_i$ 进行求导： ${\partial \xi}/{\partial z_i}$ ，推导如下：

译者注：注意下标的含义，在上面我们用 $i$ 表示第 $i$ 个样本，用 $c$ 表示类别。但是这里下标 $j$ 表示预测的类别， $i$ 表示样本的真实类别。

\partial ξ \partial z i = - \sum j = 1 C \partial t j l o g ( y j ) \partial z i = - \sum j = 1 C t j \partial l o g ( y j ) \partial z i = - \sum j = 1 C t j 1 y j \partial y j \partial z i = - t i y i \partial y i \partial z i - \sum j \neq i C t j y j \partial y j \partial z i = - t i y i y i (1 - y i) - \sum j \neq i C t j y j (- y j y i) = - t i + t i y i + \sum j \neq i C t j y i = - t i + \sum j = 1 C t j y i = - t i + y i \sum j = 1 C t j = y i - t i

$\begin{split} \frac{\partial \xi}{\partial z_i} & = - \sum_{j=1}^C \frac{\partial t_j log(y_j)}{\partial z_i}{} = - \sum_{j=1}^C t_j \frac{\partial log(y_j)}{\partial z_i} = - \sum_{j=1}^C t_j \frac{1}{y_j} \frac{\partial y_j}{\partial z_i} \\ & = - \frac{t_i}{y_i} \frac{\partial y_i}{\partial z_i} - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{\partial y_j}{\partial z_i} = - \frac{t_i}{y_i} y_i (1-y_i) - \sum_{j \neq i}^C \frac{t_j}{y_j} (-y_j y_i) \\ & = - t_i + t_i y_i + \sum_{j \neq i}^C t_j y_i = - t_i + \sum_{j = 1}^C t_j y_i = -t_i + y_i \sum_{j = 1}^C t_j \\ & = y_i - t_i \end{split}$

注意，前面我们已经推导过 ${\partial y_j}/{\partial z_i}$ for $i=j$ and $i \neq j$ .

从上面结果可以看出，和 logistic 回归中一样，交叉熵损失函数对所有类别 $i \in C$ 的样本求导结果都是一样的： ${\partial \xi}/{\partial z_i} = y_i - t_i$ 。

This post at peterroelants.github.io is generated from an IPython notebook file. Link to the full IPython notebook file

（译）神经网络基础（2）：Softmax 分类函数

Softmax 分类函数

1. Softmax 函数

2. softmax 函数求导

3. softmax 函数的交叉熵损失函数

4. softmax 函数的交叉熵损失函数求导

猜你喜欢