Introduction for Machine Learning: Linear Regression

Imagination is an outcome of what you learned. If you can imagine the world, that means you have learned what the world is about.

Actually we don't know how we see, at lease it's really hard to know, so we can't program to tell a machine to see.

One of the most important part in machine learning is to introspect how our brain learn by subconscious. If we can't introspect, it can be fairly hard to replicate a brain.

Linear Models

Supervised learning of linear models can be divided into 2 phases:

Training:
1. Read training data points with labels $\left\{\mathbf{x}_{1:n},y_{1:n}\right\}$, where $\mathbf{x}_i \in \mathbb{R}^{1 \times d}, \ y_i \in \mathbb{R}^{1 \times c}$;
2. Estimate model parameters $\hat{\theta}$ by certain learning Algorithms.
  Note: The parameters are the information the model learned from data.
Prediction:
1. Read a new data point without label $\mathbf{x}_{n+1}$ (typically has never seen before);
2. Along with parameter $\hat{\theta}$, estimate unknown label $\hat{y}_{n+1}$.

1-D example:
First of all, we create a linear model:
\[ \hat{y}_i = \theta_0 + \theta_1 x_{i} \]
Both $x$ and $y$ are scalars in this case.
Slides by Nando de Freitas from CPSC540, UBC

Then we, for example, take SSE (Sum of Squared Error) as our objective / loss / cost / energy / error function¹:

\[ J(\theta)=\sum_{i=1}^n \left( \hat{y}_i - y_i\right)^2 \]

Linear Prediction Model

In general, each data point $x_i$ should have $d$ dimensions, and the corresponding number of parameters should be $(d+1)$.

The mathematical form of linear model is:
\[ \hat{y}_i = \sum_{j=0}^{d} \theta_jx_{ij} \]

The matrix form of linear model is:
\[ \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_n \end{bmatrix}= \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_d \end{bmatrix} \]
Or in a more compact way:
\[ \mathbf{\hat{y}} = \mathbf{X\theta} \]
Note that the matrix form is widely used not only because it's a concise way to represent the model, but is also straightforward for coding in MatLab or Python (Numpy).

Optimization Approach

In order to optimize the model prediction, we need to minimize the quadratic cost:
\[ J(\mathbf{\theta}) = \sum_{i=1}^n \left( \hat{y}_i - y_i\right)^2 \\ = \left( \mathbf{y-X\theta} \right)^T\left( \mathbf{y-X\theta} \right) \]

by setting the derivatives w.r.t vector $\mathbf{\theta}$ to zero since the cost function is strictly convex and the domain of $\theta$ is convex².

Slides by Nando de Freitas from CPSC540, UBC

\[ \begin{align*}\notag \frac{\partial J(\mathbf{\theta})}{\partial \mathbf{\theta}} &= \frac{\partial}{ \partial \mathbf{\theta} } \left( \mathbf{y-X\theta} \right)^T\left( \mathbf{y-X\theta} \right) \\ &=\frac{\partial}{ \partial \mathbf{\theta} } \left( \mathbf{y}^T\mathbf{y} + \mathbf{\theta}^T \mathbf{X}^T\mathbf{X\theta} -2\mathbf{y}^T\mathbf{X\theta} \right) \\ &=\mathbf{0}+2 \left( \mathbf{X}^T\mathbf{X} \right)^T \mathbf{\theta} - 2 \left( \mathbf{y}^T\mathbf{X} \right)^T \\ &=2 \left( \mathbf{X}^T\mathbf{X} \right) \mathbf{\theta} - 2 \left( \mathbf{X}^T\mathbf{y} \right) \\ &\triangleq\mathbf{0} \end{align*} \]

So we get $\mathbf{\hat{\theta}}$ as an analytical solution:
\[ \mathbf{\hat{\theta}} = \left( \mathbf{X}^T\mathbf{X} \right)^{-1} \left( \mathbf{X}^T\mathbf{y} \right) \]

After passing by these procedures, we can see that learning is just about to adjust model parameters so as to minimize the objective function.
Thus, the prediction function can be rewrite as:
\[ \begin{align*}\notag \mathbf{\hat{y}} &= \mathbf{X\hat{\theta}}\\ &=\mathbf{X}\left( \mathbf{X}^T\mathbf{X} \right)^{-1} \mathbf{X}^T\mathbf{y} $= \mathbf{Hy} \end{align*} \]
where $\mathbf{H}$ refers to hat matrix because it added hat to $\mathbf{y}$

Multidimensional Label $\mathbf{y_i}$

So far we have been assuming $y_i$ to be a scalar. But what if the model have multiple outputs (e.g. $c$ outputs)? Simply align with $c$ parameters:
\[ \begin{bmatrix} y_{11} & \cdots & y_{1c} \\ y_{21} & \cdots & y_{2c} \\ \vdots & \ddots & \vdots \\ y_{n1} & \cdots & y_{nc} \end{bmatrix}= \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix} \begin{bmatrix} \theta_{01} & \cdots & \theta_{0c}\\ \theta_{11} & \cdots & \theta_{1c}\\ \theta_{21} & \cdots & \theta_{2c}\\ \vdots & \ddots & \vdots \\ \theta_{d1}& \cdots & \theta_{dc} \end{bmatrix} \]

Written with StackEdit.

SSE is known by everyone but works poorly under certain circumstances e.g. if the training data contains some noise (outliers) then the model will be distorted seriously by outliers.↩
See one of some interesting explanations here ↩