Machine learning(3) Linear Discriminant Functions--Least-squares classification

Machine Learning（3）Least-squares classification

Chenjing Ding
2018/02/28

notation	meaning
M	the number of mixture components
x_n	n-th input vector
N	the number of training input vectors
K	the number of classes
w	a vector of the weight matrix
W	weight matrix
X	input metrix

To put it clearly, all vectors in this passage are column vector, the transpose of them are row vector; and all Capital letter represents matrix, otherwise it represents a vector.

1.General Classification Problem

1.1 one sample input case

Let’s consider K discriminant linear models:

\begin{matrix} (1.1.1) & y_{k} (x) = w_{k}^{T} x + w_{k 0}, k = 1... K \end{matrix}

$y_k(x) = w_k^Tx+w_{k0}, k = 1...K \tag{1.1.1}$ Both

w_{k}

$w_k$ and

x

$x$ are vector. if W is a matrix as followed:

\begin{matrix} (1.1.2) & W = [w_{1}, w_{2}, \dots w_{k}] = [\begin{matrix} w_{10} & w_{20} & . . . & w_{K 0} \\ w_{11} & w_{21} & . . . & w_{K 1} \\ . . . & . . . & . . . & . . . \\ w_{1 D} & w_{2 D} & . . . & w_{K D} \end{matrix}] \end{matrix}

$W = [w_1,w_2,…w_k] = [ \begin{matrix} w_{10} & w_{20} & ...&w_{K0} \\ w_{11} & w_{21} & ...& w_{K1} \\ ... & ... & ...&...\\ w_{1D}& w_{2D} & ...& w_{KD} \end{matrix} \tag{1.1.2} ]$
then we obtain

Y (x)

$Y(x)$ which is a column vector,

\begin{matrix} (1.1.3) & Y (x) = W^{T} x = [y_{1} (x) y_{2} (x) . . . y_{K} (x)]^{T} \end{matrix}

$Y(x) = W^Tx = [y_1(x)\ y_2(x)\ ...y_K(x)]^T \tag{1.1.3}$

1.2 input as a matrix

For entire data set, X is a matrix.

\hat{Y} (X) = X W

$\widehat{Y}(X) = XW$

X = [x_{1} x_{2} . . . x_{N}]^{T}

$X = [ x_1\ x_2\ ... \ x_N]^T$

\begin{matrix} (1.2.1) & T = [t 1 t 2... t_{N}]^{T}, \hat{Y} (X) = [Y (x_{1}) Y (x_{2}) . . . Y (x_{N})]^{T} \end{matrix}

$T = [t1\ t2... t_N] ^T , \widehat{Y}(X) = [Y(x_1) \ Y(x_2)\ ...Y(x_N)]^T \tag{1.2.1}$
and

t_{1}, t_{2} . . .

$t_1,t_2...$ is column vectors , T and

\hat{Y} (X)

$\widehat{Y}(X)$ are matrix, T is the target matrix ;

2. Closed-form solution

Try to find the closed-form solution of W, directly to minimize the sum-of-squares error:

E (W) = \sum_{n = 1}^{N} \sum_{k = 1}^{K} (y_{k} (x_{n}) - t_{n k})^{2} = \sum_{n = 1}^{N} \sum_{k = 1}^{K} (w_{k}^{T} x_{n} - t_{n k})^{2}

$E(W) = \sum_{n=1}^N \sum_{k=1}^K (y_k(x_n)-t_{nk})^2\\ =\sum_{n=1}^N \sum_{k=1}^K (w_k^Tx_n-t_{nk})^2$ Let’s formulate the sum-of-squares error in matrix notation:

\begin{matrix} (2.1) & \sum_{i j} a_{i j}^{2} = T r (A^{T} A), \frac{\partial T r (A)}{\partial A} = I \end{matrix}

$\sum_{ij} a_{ij}^2 = Tr(A^TA), \frac{\partial Tr(A)}{\partial A} = I \tag{2.1}$

\begin{matrix} (2.2) & E (W) = \frac{1}{2} T r ((X W - T)^{T} (X W - T)) \end{matrix}

$E(W) = \frac{1}{2}Tr((XW-T)^T(XW-T))\tag{2.2}$

\begin{matrix} (2.3) & \frac{\partial E (W)}{\partial W} = \frac{1}{2} \frac{\partial E (W)}{\partial (X W - T)^{T} (X W - T)} \frac{\partial (X W - T)^{T} (X W - T)}{\partial W} \end{matrix}

$\frac{\partial E(W)}{\partial W} = \frac{1}{2} \frac{\partial E(W)}{\partial (XW-T)^T(XW-T)} \frac{\partial (XW-T)^T(XW-T)}{\partial W} \tag{2.3}$

\begin{matrix} (2.4) & = X^{T} (X W - T) . . . . u s i n g (4) \end{matrix}

$\\=X^T(XW-T)....using (4) \tag{2.4}$
and for all matrix A the inverse of

A^{T} A

$A^TA$ must exist.

\frac{\partial E (W)}{\partial W} = 0 \Rightarrow W = (X^{T} X)^{- 1} X^{T} T

$\frac{\partial E(W)}{\partial W} = 0 \Rightarrow W = (X^T X)^{-1}X^TT$
Thus the closed form solution for

y (x_{n})

$y(x_n)$ is :

Y (x_{n}) = W^{T} x_{n} = ((X^{T} X)^{- 1} X^{T} T)^{T} x_{n}

$Y(x_n) = W^T x_n = ( (X^T X)^{-1}X^TT)^T x_n$

3. Problems

Least-squares is very sensitive to outliers!
Least-squares corresponds to Maximum Likelihood under the assumption of a Gaussian conditional distribution.However, our binary target vectors have a distribution that is clearly non-Gaussian (0-1 distribution when K is 2)!

discuss later