Machine Learning（3）Generalized Linear Models

Chenjing Ding
2018/02/28

notation	meaning
M	M-dimensional feature space $\phi$
K	the number of classes
N	the number of training data
W	the weight matrix
$w_k$	the weight of class $C_k$
$w_{kj}^{(\tau +1)}$	the new weight of j-th entry of $w_k$
$\phi_n$	$\phi_n = \phi(x_n)$
$\phi_{ni}$	$\phi_{ni} = \phi_i(x_n)$ the i-th entry of $\phi_n$
$y(\phi_n)$	column vector, the discriminant function vector for all class
$y_k(\phi_n)$	k-th entry of $y(\phi_n)$ , the discriminant function vector for class k

1. linear model and generalized linear model

1.1 Definition

linear model:

y_{k} (x) = w_{k}^{T} x Y (x_{n}) = [y_{1} (x_{n}), y_{2} (x_{n}) . . . y_{K} (x_{n})]^{T} = W^{T} x_{n} \hat{Y} (X) = [Y (x_{1}), Y (x_{2}) . . . Y (x_{N})]^{T} = X W

$y_k(x)= w_k^Tx\\ Y(x_n) = [y_1(x_n),y_2(x_n)...y_K(x_n)]^T = W^T x_n \\ \widehat{Y}(X) = [Y(x_1),Y(x_2)...Y(x_N)] ^T= XW$ you can see Machine Learning（3）Least-squares classification 1 for more details.

generalized model:

y_{k} (x) = g (w_{k}^{T} x)

$y_k(x) = g(w_k^T x)$
g is the activation function:

g (x) = \frac{1}{1 + e^{- a}} \Rightarrow l o g i s t i c r e g r e s s i o n

$g(x) = \frac{1}{1+e^{-a}} \ \ \ \ \ \Rightarrow \ \ \ \ \ logistic\ regression$

g_{k} (x) = \frac{e^{x_{k}}}{\sum_{j = 1}^{K} e^{x_{j}}} \Rightarrow S o f t m a x r e g r e s s i o n

$g_k(x) = \frac{ e^{x_k}}{\sum_{j=1}^K e^{x_j} } \ \ \ \ \ \Rightarrow \ \ \ \ \ Softmax\ regression$

1.2 nonlinear basis functions

If g is monotonous (which is typically the case), the resulting decision boundaries are still linear in input space x. Thus transform vector x with D nonlinear basis functions $\phi_i(x)$ :

y_{k} (x_{n}) = g (w_{k}^{T} Φ (x_{n})) = g (\sum_{i = 0}^{D} w_{k i} ϕ_{i} (x_{n})), ϕ_{0} (x_{n}) = 1

$y_k(x_n) = g(w_k^T\Phi(x_n) )=g( \sum_{i=0}^D w_{ki}\phi_i(x_n)),\ \ \ \phi_0(x_n) = 1$
Advantages:
Allow non-linear decision boundaries. By choosing the right

ϕ_{i} (x)

$\phi_i(x)$ , every continuous function can (in principle)be approximated with arbitrary accuracy.

1.3 Why use generalized linear model compared with linear model

Can be used to limit the effect of outliers
In linear model, the $y_k(x_n)$ can grow arbitrarily large for some $x_n$ . As a result, too correct points (far away from linear boundaries but classified correctly) will have a strong influence on the decision boundaries. $\Rightarrow$ ML(3) Least-squares classification 3
By choosing a suitable nonlinear activation function( eg: sigmoid function in 2.1), we can limit those influences because the output for sigmoid function is [0,1]
Choice of a sigmoid leads to a nice probabilistic interpretation $\Rightarrow$ 2.1

However, Least-squares minimization in general no longer leads to a closed-form analytical solution, we need do the Gradient descent to update the weight. $\Rightarrow$ 3

2. How to obtain generalized linear model

2.1 Logistic Sigmoid Activation Function

a nice probabilistic interpretation:
Consider 2 classes:

P (C_{1} | x) = \frac{p (x | C_{1}) P (C_{1})}{p (x | C_{1}) P (C_{1}) + p (x | C_{2}) P (C_{2})} = \frac{1}{1 + \frac{p (x | C_{2}) P (C_{2})}{p (x | C_{1}) P (C_{1})}} = \frac{1}{1 + e^{- a}}

$P(C_1|x) = \frac{p(x|C_1)P(C_1)}{ p(x|C_1)P(C_1) + p(x|C_2)P(C_2) }\\ = \frac{1}{ 1+\frac{ p(x|C_2)P(C_2)}{p(x|C_1)P(C_1)} } \\ = \frac{1}{ 1+e^{-a}}$
if

a = - \ln \frac{p (x | C_{2}) P (C_{2})}{p (x | C_{1}) P (C_{1})} \Rightarrow P (C_{1} | x) = g (a) = \frac{1}{1 + e^{- a}}

$a = -\ln \frac{ p(x|C_2)P(C_2)}{p(x|C_1)P(C_1)} \Rightarrow P(C_1|x) = g(a) = \frac{1}{1+e^{-a}}$

Thus using function g(a), it leads to a nice probabilistic interpretation.

logistic function:
$g(a)$ is logistic function, we also use $\sigma(a)$ . Here are some properties of this function:

σ (- a) = 1 - σ (a) \frac{d σ}{d a} = σ (1 - σ)

$\sigma(-a) = 1-\sigma(a) \\ \frac{d\sigma}{da} = \sigma(1-\sigma)$

logistic regression:
In the following, we will consider models of the form:

p (C_{1} | ϕ) = y_{1} (ϕ) = σ (a) = σ (w_{1}^{T} ϕ (x)) p (C_{2} | ϕ) = 1 - p (C_{1} | ϕ)

$p(C_1|\phi) = y_1(\phi) = \sigma(a) = \sigma(w_1^T \phi(x))\\p(C_2|\phi) = 1- p(C_1|\phi)$ This model is called logistic regression.

Why use logistic regression:
Because we only need to update fewer parameters.
Assume we have an M-dimensional feature space $\phi$ . In logistic regression model, if we assume the $p(x|C_1)$ and $p(x|C_2)$ represented by Gaussians, and we need:

the number of means: 2M（M for $p(x|C_1)$ , M for $p(x|C_2)$ ）
the number of variance: $\frac{M(M+1)}{2}$ (suppose the variances of $p(x|C_1)$ and $p(x|C_2)$ are same)
prior probability: 1 ( $p(C_1) = 1 - p(C_2)$ )

thus in total, with Gaussians representation we need to update $\frac{M(M+5)}{2} +1$ parameters.

But in logistic regression, we only need to update M parameters of $p(C_1|\phi)$ because $p(C_2|\phi) = 1- p(C_1|\phi)$ );

2.2 Normalized Exponential

t_{n} \in {1, 2... K} P (C_{k} | x_{n}) = \frac{p (x | C_{k}) P (C_{k})}{\sum_{i = 1}^{K} p (x | C_{i}) P (C_{i})}

$t_n \in \{1,2...K\}\\ P(C_k|x_n) = \frac{p(x|C_k)P(C_k)}{\sum_{i=1}^K p(x|C_i)P(C_i)}$

\begin{matrix} (2.2.1) & = \frac{e x p (a_{k})}{\sum_{j} e x p (a_{j})}, a_{j} = \ln p (x | C_{j}) P (C_{j}) \end{matrix}

$= \frac{exp(a_k)}{\sum_j exp(a_j)}, a_j = \ln p(x|C_j)P(C_j) \tag{2.2.1}$
Formula 2.2.1 is known as softmax function. or normalized exponential.

\begin{matrix} (2.2.2) & y (x_{n}; W) = [\begin{matrix} P (y = 1 | W, x_{n}) \\ P (y = 2 | W, x_{n}) \\ . . . \\ P (y = K | W, x_{n}) \end{matrix}] = \frac{1}{\sum_{k = 1}^{K} e x p (w_{k}^{T} x_{n})} [\begin{matrix} e x p (w_{1}^{T} x_{n}) \\ e x p (w_{2}^{T} x_{n}) \\ . . . \\ e x p (w_{K}^{T} x_{n}) \end{matrix}] \end{matrix}

$y(x_n; W) = [ \begin{matrix} P(y=1|W,x_n) \\ P(y=2|W,x_n) \\ ...\\ P(y=K|W,x_n) \end{matrix} \tag{2.2.2} ] =\frac{1}{\sum_{k=1}^K exp(w_k^Tx_n)}[ \begin{matrix} exp(w_1^Tx_n) \\ exp(w_2^Tx_n) \\ ...\\ exp(w_K^Tx_n) \end{matrix} ]$

3. Gradient descent to update weight

3.1 steps of Gradient Descent

Gradient descent is Iterative minimization.
step1: Start with an initial guess for the parameter values $w_{kj}^{(0)}$ ;
step2: Move towards a (local) minimum by following the gradient.

\begin{matrix} (3.1) & w_{k j}^{(τ + 1)} = w_{k j}^{(τ)} - η \frac{\partial E (W)}{\partial w_{k j}^{(τ)}} \end{matrix}

$w_{kj}^{(\tau +1)} = w_{kj}^{(\tau )} - \eta \frac{\partial E(W)}{\partial w_{kj}^{(\tau)} } \tag{3.1}$
formula(3.1) corresponds to a 1st order Taylor expansion. If you are interested in why gradient descent correspond to 1st order Talyor expansion, just go on.

Using 1st order Taylor expansion of $E(W)$ in $W^{\tau}$ :

E (W^{(τ)} - η Δ) = E (W^{(τ)}) + Δ (- η Δ) < E (W^{(τ)})

$E(W^{(\tau)}-\eta \Delta) = E(W^{(\tau)}) +\Delta(-\eta \Delta) < E(W^{(\tau)})$ since

η > 0

$\eta > 0$

Δ = \frac{\partial E (W^{τ})}{\partial W^{τ}}

$\Delta = \frac{\partial E(W^{\tau})}{\partial W^{\tau}}$
Thus updating W will lead to the minimum of

E (W)

$E(W)$ .

3.2 Batch learning

Process the full data at once to computer the gradient.

E (W) = \sum_{i = 1}^{N} E_{i} (W) w_{j i}^{τ + 1} = w_{j i}^{τ} - η \frac{\partial E (W)}{\partial W_{j i}^{τ}}

$E(W) = \sum_{i=1}^N E_i(W)\\w_{ji}^{\tau +1} = w_{ji}^{\tau} - \eta \frac{\partial E(W)}{\partial W_{ji}^{\tau}}$

3.3 Stochastic learning/Sequential Updating

Choose a single training sample $x_n$ to obtain $E_n(W)$ ;

w_{j i}^{τ + 1} = w_{j i}^{τ} - η \frac{\partial E_{n} (W)}{\partial W_{j i}^{τ}}

$w_{ji}^{\tau +1} = w_{ji}^{\tau} - \eta \frac{\partial E_n(W)}{\partial W_{ji}^{\tau}}$

3.4 Delta rule /LMS rule

Delta/LMS rule are based on least-squares error.
Error function (least-squares error) of linear model:

E (W) = \frac{1}{2} \sum_{n = 1}^{N} \sum_{k = 1}^{K} (t_{k n} - y_{k} (x_{n}))^{2} = \frac{1}{2} \sum_{n = 1}^{N} \sum_{k = 1}^{K} (t_{k n} - \sum_{j = 1}^{M} w_{k j} ϕ_{j} (x_{n}))^{2}

$E(W) =\frac{1}{2} \sum_{n=1}^N \sum_{k=1}^K (t_{kn}-y_k(x_n))^2 \\ =\frac{1}{2} \sum_{n=1}^N \sum_{k=1}^K (t_{kn}-\sum_{j=1}^M w_{kj}\phi_j(x_n))^2$

E_{n} (W) = \frac{1}{2} \sum_{k = 1}^{K} (t_{k n} - y_{k} (x_{n}))^{2} = \frac{1}{2} \sum_{k = 1}^{K} (t_{k n} - \sum_{j = 1}^{M} w_{k j} ϕ_{j} (x_{n}))^{2}

$E_n(W) =\frac{1}{2} \sum_{k=1}^K (t_{kn}-y_k(x_n))^2 \\ = \frac{1}{2} \sum_{k=1}^K (t_{kn}-\sum_{j=1}^M w_{kj}\phi_j(x_n))^2$

\frac{\partial E_{n} (W)}{\partial w_{\hat{k} j}} = (y_{\hat{k}} (x_{n}) - t_{\hat{k} n}) ϕ_{j} (x_{n})

$\frac{\partial E_n(W)}{\partial w_{\widehat{k}j}} =(\ y_{\widehat{k}}(x_n)-t_{\widehat{k} n } )\phi_j(x_n)$

\begin{matrix} (3.4.1) & w_{\hat{k} j}^{τ + 1} = w_{\hat{k} j}^{τ} - η (y_{\hat{k}} (x_{n}) - t_{\hat{k} n}) ϕ_{j} (x_{n}) \end{matrix}

$w_{\widehat{k}j}^{\tau +1} = w_{\widehat{k}j}^{\tau} - \eta (\ y_{\widehat{k}}(x_n)-t_{ \widehat{k} n} )\phi_j(x_n) \tag{3.4.1}$

Cases with differentiable, non-linear activation function:

E_{n} (W) = \frac{1}{2} \sum_{k = 1}^{K} (t_{k n} - g (\sum_{j = 1}^{M} w_{k j} ϕ_{j} (x_{n})))^{2}

$E_n(W) =\frac{1}{2} \sum_{k=1}^K (t_{kn}- g(\sum_{j=1}^M w_{kj}\phi_j(x_n)))^2$

\begin{matrix} (3.4.2) & \frac{\partial E_{n} (W)}{\partial w_{\hat{k} j}} = g^{'} (y_{\hat{k}} (x_{n}) - t_{\hat{k} n}) ϕ_{j} (x_{n}) \end{matrix}

$\frac{\partial E_n(W)}{\partial w_{\widehat{k}j}} =g' (\ y_{\widehat{k}}(x_n)-t_{\widehat{k} n }) \phi_j(x_n) \tag{3.4.2}$

Both formula 3.4.1 and 3.4.2 are the Delta/LMS rule.

3.5 logistic regression

3.5.1 Gradient Descent (1st order)

Let’s consider a data set $(\phi_n,t_n)$ with n = 1,…,N where $\phi_n = \phi(x_n)$ and $t_n \in \{0,1\},t = (t_1,t_2...t_N)^T$

With $y_n = p(C1|\phi_n)$ , we can write the likelihood as

p (t | w) = \prod_{n = 1}^{N} y_{n}^{t_{n}} (1 - y_{n})^{1 - t_{n}}

$p(t|w) = \prod _{n=1}^N y_n^{t_n} (1-y_n)^{1-t_n}$ since

y_{n}

$y_n$ only have parameter w and the second

y_{C 2} (x_{n}) = 1 - y_{n}

$y_{C2}(x_n) = 1-y_n$ , thus w is vector here, we needn’t use Capital letter W;
Define the error function as the negative log-likelihood:

\begin{matrix} (3.5.1) & E (w) = - \ln p (t | w) = - \sum_{n = 1}^{N} t_{n} \ln y_{n} + (1 - t_{n}) \ln (1 - y_{n}) \end{matrix}

$E(w) = -\ln p(t|w) \\ = -\sum_{n=1}^N {t_n\ln y_n}+ (1-t_n)\ln(1-y_n) \tag{3.5.1}$
Formula 3.4.1 is the so-called cross-entropy error function.

y_{n} = σ (w^{T} ϕ (x_{n})) \frac{\partial y_{n}}{\partial w} = y_{n} (1 - y_{n}) ϕ (x_{n}) \frac{\partial E (w)}{\partial w} = - \sum_{n = 1}^{N} t_{n} \frac{\frac{\partial y_{n}}{\partial w}}{y_{n}} + (1 - t_{n}) \frac{\frac{\partial 1 - y_{n}}{\partial w}}{1 - y_{n}} = - \sum_{n = 1}^{N} t n (1 - y_{n}) ϕ (x_{n}) - (1 - t_{n}) y_{n} ϕ (x_{n})

$y_n = \sigma(w^T\phi(x_n ))\\ \frac{\partial y_n}{\partial w} = y_n(1-y_n)\phi(x_n) \\ \frac{\partial E(w)}{\partial w} = -\sum_{n=1}^N t_n \frac{\frac{\partial{y_n}}{\partial w}}{y_n} + (1-t_n) \frac{\frac{\partial{1-y_n}}{\partial w}}{1-y_n}\\= -\sum_{n=1}^N tn(1-y_n)\phi(x_n) -(1-t_n)y_n\phi(x_n)$

\begin{matrix} (3.5.2) & = \sum_{n = 1}^{N} (y_{n} - t_{n}) ϕ (x_{n}) \end{matrix}

$=\sum_{n=1}^N (y_n- t_n)\phi(x_n) \tag{3.5.2}$
Formula 3.4.2 is the gradient for logistic regression it is same with formula 3.4.1. we can use sequential updating:

w^{(τ + 1)} = w^{(τ)} - η \frac{\partial E_{n} (w^{(τ)})}{\partial w^{(τ)}} = w^{(τ)} - η (y_{n} - t_{n}) ϕ (x_{n})

$w^{(\tau+1)} = w^{(\tau)}-\eta \frac{\partial E_n(w^{(\tau)})}{\partial w^{(\tau)}} = w^{(\tau)}-\eta (y_n- t_n)\phi(x_n)$

Disadvantage：
Relatively slow to converge, thus Newton-Raphson (2nd order) is introduced.

3.5.2 Newton-Raphson (2nd order)

w^{(τ + 1)} = w^{(τ)} - η H^{- 1} \frac{\partial E (w^{(τ)})}{\partial w^{(τ)}}

$w^{(\tau+1)} = w^{(\tau)}-\eta H^{-1} \frac{\partial E(w^{(\tau)})}{\partial w^{(\tau)}}$
H is the Hessian matrix (the matrix of second derivatives).According to 3.4.2:

\frac{\partial E (w)}{\partial w} = \sum_{n = 1}^{N} (y_{n} - t_{n}) ϕ (x - n) = ϕ^{T} (y - t) H = ϕ^{T} \frac{d y}{d w} = ϕ^{T} y (1 - y) ϕ

$\frac{\partial E(w)}{\partial w} = \sum_{n=1}^N (y_n-t_n) \phi(x-n)=\phi^T(y-t) \\ H = \phi^T \frac{dy}{dw} = \phi^T y(1-y) \phi$

w^{(τ + 1)} = w^{(τ)} - η (ϕ^{T} R ϕ)^{- 1} ϕ^{T} (y - t)

$w^{(\tau+1)} = w^{(\tau)}-\eta (\phi^T R \phi)^{-1} \phi^T(y-t)$ R is an N*N matrix with

R_{n} n = y_{n} (1 - y_{n})

$R_nn = y_n(1-y_n)$ and

ϕ^{T}, y

$\phi^T,y$ and t are all column vectors.

Logistic regression is only for 2 classes, for multi classes, softmax regression in introduced.

3.6 softmax regression

t_{n} \in {1, 2... K}

$t_n \in \{1,2...K\}$

y (x_{n}; W) = [\begin{matrix} y_{1} (x_{n}) \\ y_{2} (x_{n}) \\ . . . \\ y_{K} (x_{n}) \end{matrix}] = \frac{1}{\sum_{k = 1}^{K} e x p (w_{k}^{T} x_{n})} [\begin{matrix} e x p (w_{1}^{T} x_{n}) \\ e x p (w_{2}^{T} x_{n}) \\ . . . \\ e x p (w_{K}^{T} x_{n}) \end{matrix}]

$y(x_n; W) = [ \begin{matrix} y_1(x_n)\\ y_2(x_n) \\ ...\\ y_K(x_n) \end{matrix} ] =\frac{1}{\sum_{k=1}^K exp(w_k^Tx_n)}[ \begin{matrix} exp(w_1^Tx_n) \\ exp(w_2^Tx_n) \\ ...\\ exp(w_K^Tx_n) \end{matrix} ]$
cross-entropy error function:

E (W) = - \sum_{n = 1}^{N} \sum_{k = 1}^{K} {I (t_{n} = k) \ln y_{k} (x_{n} | w)} = - \sum_{n = 1}^{N} \sum_{k = 1}^{K} {I (t_{n} = k) \ln \frac{e x p (w_{k}^{T} x_{n})}{\sum_{i = 1}^{K} e x p (w_{i}^{T} x_{n})}

$E(W) =- \sum_{n=1}^N \sum_{k=1}^K \{ I(t_n = k) \ln \ y_k(x_n|w) \} \\ = - \sum_{n=1}^N \sum_{k=1}^K \{ I(t_n = k) \ln \frac{exp(w_k^Tx_n)}{\sum_{i=1}^K exp(w_i^Tx_n)}$ To obtain the first order of gradient descent :

y_{k n} = y_{k} (w^{T} x_{n}) = \frac{e x p (w_{k}^{T} x_{n})}{\sum_{i = 1}^{K} e x p (w_{i}^{T} x_{n})}

$y_{kn}=y_k(w^T x_n) =\frac{exp(w_k^Tx_n)}{\sum_{i=1}^K exp(w_i^Tx_n)}$

\begin{matrix} (3.6.1) & \frac{\partial y_{k} (\hat{x})}{\partial {\hat{x}}_{k}} == \frac{e^{{\hat{x}}_{k}} \sum_{i} e^{{\hat{x}}_{i}} - e^{{\hat{x}}_{k}} e^{{\hat{x}}_{k}}}{(\sum_{i} e^{{\hat{x}}_{i}})^{2}} = y_{k} (\hat{x}) - y_{k} (\hat{x})^{2} \end{matrix}

$\frac{\partial y_k(\widehat{x}) }{\partial \widehat{x}_k} ==\frac{e^{\widehat{x}_k}\sum_i e^{\widehat{x}_i} - e^{\widehat{x}_k}e^{\widehat{x}_k}}{(\sum_ie^{\widehat{x}_i})^2} = y_k(\widehat{x})-y_k(\widehat{x})^2 \tag{3.6.1}$

\begin{matrix} (3.6.2) & \frac{\partial y_{k} (\hat{x})}{\partial {\hat{x}}_{j}} = \frac{0 - e^{{\hat{x}}_{j}} e^{{\hat{x}}_{k}}}{(\sum_{i} e^{{\hat{x}}_{i}})^{2}} = - y_{k} (\hat{x}) y_{j} (\hat{x}) \end{matrix}

$\frac{\partial y_k(\widehat{x}) }{\partial \widehat{x}_j} =\frac{0-e^{\widehat{x}_j}e^{ \widehat{x}_k}}{(\sum_ie^{ \widehat{x}_i})^2}=-y_k( \widehat{x})y_j( \widehat{x}) \tag{3.6.2}$
According to 3.6.1 and 3.6.2

\frac{\partial y_{k} (w^{T} x_{n})}{\partial w_{k}^{T} x_{n}} = y_{k} (w^{T} x_{n}) - y_{k} (w^{T} x_{n})^{2}

$\frac{\partial y_k(w^Tx_n) }{\partial w_k^Tx_n} = y_k(w^Tx_n) - y_k(w^Tx_n) ^2$

\frac{\partial y_{k} (w^{T} x_{n})}{\partial w_{j}^{T} x_{n}} = - y_{k} (w^{T} x_{n}) y_{j} (w^{T} x_{n})

$\frac{\partial y_k(w^Tx_{n})}{\partial w_j^Tx_n} = -y_k(w^Tx_n)y_j(w^Tx_n)$

V = [v_{1}, v_{2} . . . v_{K}], v j = \frac{\partial E (W)}{\partial y_{j} (w^{T} x_{n})} = - \frac{1}{y_{j} (w^{T} x_{n})}

$V = [v_1,v_2...v_K],vj = \frac{\partial E(W)}{\partial y_j(w^Tx_n)} =-\frac{1 }{y_j(w^Tx_n)}$ K is the number of classes.

\frac{\partial E_{n} (W)}{\partial w_{k}} = \sum_{j = 1}^{K} \frac{\partial E (W)}{\partial y_{j n}} \frac{\partial y_{j n}}{\partial w_{k}^{T} x_{n}} \frac{\partial w_{k}^{T} x_{n}}{\partial w_{k}} = (\sum_{j = 1. j \neq k}^{K} - v_{j} y_{k n} y_{j n} + v_{k} (y_{k n} - y_{k n}^{2})) x_{n} = [v_{k} y_{k n} - y_{k n} \sum_{j = 1}^{K} v_{j} y_{j n}] x_{n}

$\frac{\partial E_n(W)}{\partial w_k}= \sum_{j=1}^K \frac{\partial E(W)}{\partial y_{jn}} \frac{\partial y_{jn}}{\partial w_k^Tx_n}\frac{\partial w_k^Tx_n}{\partial w_k} =( \sum_{j=1.j\neq k}^K -v_j y_{kn}y_{jn}+ v_k(y_{kn}-y_{kn}^2)) x_n \\= [v_ky_{kn}-y_{kn}\sum_{j=1}^K v_jy_{jn}]x_n$

\begin{matrix} (3.6.3) & = y_{k n} x_{n} (v_{k} - V y (x_{n})) \end{matrix}

$= y_{kn}x_n(v_k - Vy(x_n)) \tag{3.6.3}$

The formula 3.6.3 is matrix computation, it is highly recommended you can use this formula in your code.

To help your understand, list some related matrix here,very helpful when you coding:

ϕ_{n} = w^{T} x_{n}

$\phi_n = w^Tx_n$
Softmax’s Jacobi matrix(K*N) :

J = [\begin{matrix} \frac{\partial y_{1} (ϕ_{1})}{\partial ϕ_{1}} & \frac{\partial y_{1} (ϕ_{2})}{\partial ϕ_{2}} & . . . & \frac{\partial y_{1} (ϕ_{N})}{\partial ϕ_{N}} \\ \frac{\partial y_{2} (ϕ_{1})}{\partial ϕ_{1}} & \frac{\partial y_{2} (ϕ_{2})}{\partial ϕ_{2}} & . . . & \frac{\partial y_{2} (ϕ_{N})}{\partial ϕ_{N}} \\ . . . & . . . & . . . & . . . \\ \frac{\partial y_{K} (ϕ_{1})}{\partial ϕ_{1}} & \frac{\partial y_{K} (ϕ_{2})}{\partial ϕ_{2}} & . . . & \frac{\partial y_{K} (ϕ_{N})}{\partial ϕ_{N}} \end{matrix}]

$J = [ \begin{matrix} \frac{\partial y_1(\phi_1)}{\partial \phi_1} && \frac{\partial y_1(\phi_2)}{\partial \phi_2}&&...&& \frac{\partial y_1(\phi_N)}{\partial \phi_N}\\ \frac{\partial y_2(\phi_1)}{\partial \phi_1} && \frac{\partial y_2(\phi_2)}{\partial \phi_2}&&...&& \frac{\partial y_2(\phi_N)}{\partial \phi_N}\\ ...&&...&&...&&...\\ \frac{\partial y_K(\phi_1)}{\partial \phi_1} && \frac{\partial y_K(\phi_2)}{\partial \phi_2}&&...&& \frac{\partial y_K(\phi_N)}{\partial \phi_N} \end{matrix} ]$

gradient matrix

V = [\begin{matrix} \frac{\partial E (W)}{\partial y_{1} (ϕ_{1})} & \frac{\partial E (W)}{\partial y_{2} (ϕ_{1})} & . . . & \frac{\partial E (W)}{\partial y_{K} (ϕ_{1})} \\ \frac{\partial E (W)}{\partial y_{1} (ϕ_{2})} & \frac{\partial E (W)}{\partial y_{2} (ϕ_{2})} & . . . & \frac{\partial E (W)}{\partial y_{K} (ϕ_{2})} \\ . . . & . . . & . . . & . . . \\ \frac{\partial E (W)}{\partial y_{1} (ϕ_{N})} & \frac{\partial E (W)}{\partial y_{2} (ϕ_{N})} & . . . & \frac{\partial E (W)}{\partial y_{K} (ϕ_{N})} \end{matrix}]

$V = [ \begin{matrix} \frac{\partial E(W)}{\partial y_1(\phi_1)}&& \frac{\partial E(W)}{\partial y_2(\phi_1)}&&...&& \frac{\partial E(W)}{\partial y_K(\phi_1)}\\ \frac{\partial E(W)}{\partial y_1(\phi_2)}&& \frac{\partial E(W)}{\partial y_2(\phi_2)}&&...&& \frac{\partial E(W)}{\partial y_K(\phi_2)}\\ ...&&...&&...&&...\\ \frac{\partial E(W)}{\partial y_1(\phi_N)}&& \frac{\partial E(W)}{\partial y_2(\phi_N)}&&...&& \frac{\partial E(W)}{\partial y_K(\phi_N)} \end{matrix} ]$

V_{n} = [\frac{\partial E (W)}{\partial y_{1} (ϕ_{n})} \frac{\partial E (W)}{\partial y_{2} (ϕ_{n})} . . . \frac{\partial E (W)}{\partial y_{K} (ϕ_{n})}]

$V_n = [\frac{\partial E(W)}{\partial y_1(\phi_n)} \ \frac{\partial E(W)}{\partial y_2(\phi_n)}\ ... \frac{\partial E(W)}{\partial y_K(\phi_n)}]$

y (x_{n}) = [y_{1} (x_{n}) y_{2} (x_{n}) . . . y_{K} (x_{n})]^{T}

$y(x_n)= [y_1(x_n)\ y_2(x_n)\ ... y_K(x_n) ]^T$

G = \frac{\partial E (W)}{\partial W} = [\begin{matrix} \frac{\partial E_{1} (W)}{\partial w_{1}} & \frac{\partial E_{2} (W)}{\partial w_{1}} & . . . & \frac{\partial E_{N} (W)}{\partial w_{1}} \\ \frac{\partial E_{1} (W)}{\partial w_{2}} & \frac{\partial E_{2} (W)}{\partial w_{2}} & . . . & \frac{\partial E_{N} (W)}{\partial w_{2}} \\ . . . & . . . & . . . & . . . \\ \frac{\partial E_{1} (W)}{\partial w_{K}} & \frac{\partial E_{2} (W)}{\partial w_{K}} & . . . & \frac{\partial E_{N} (W)}{\partial w_{K}} \end{matrix}]

$G =\frac{\partial E(W)}{\partial W} = [ \begin{matrix} \frac{\partial E_1(W)}{\partial w_1}&& \frac{\partial E_2(W)}{\partial w_1}&&...&& \frac{\partial E_N(W)}{\partial w_1}\\ \frac{\partial E_1(W)}{\partial w_2}&& \frac{\partial E_2(W)}{\partial w_2}&&...&& \frac{\partial E_N(W)}{\partial w_2}\\ ...&&...&&...&&...\\ \frac{\partial E_1(W)}{\partial w_K}&& \frac{\partial E_2(W)}{\partial w_K}&&...&& \frac{\partial E_N(W)}{\partial w_K} \end{matrix} ]$

G_{k n} = y_{k n} x_{n} (V_{n k} - V_{n} y (x_{n}))

$G_{kn} = y_{kn}x_n(V_{nk} - V_ny(x_n))$