[Machine Learning] Linear regression

1. variable definitions

m : example' count

n : features' count (n = 1 in `Linear Regression')

X : design matrix

x : features

y : expected output

\(h_\theta(x^{(i)})\) : predicted output of input \(x^{(i)}\)

\[X_{n,2} = \begin{pmatrix} 1 & x^{(1)}\\ 1 & x^{(2)}\\ 1 & x^{(3)}\\ ...& ...\\ 1 & x^{(n)}\\ \end{pmatrix} ,\]

\[y_{n,1} = \begin{pmatrix} y^{(1)}\\ y^{(2)}\\ y^{(3)}\\ ...\\ y^{(n)}\\ \end{pmatrix} ,\]

\[\theta = \begin{pmatrix} \theta_0\\ \theta_1\\ \end{pmatrix} ,\]

2. Hypothesis

\(h_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}\)

3. cost function

\[J(\theta _0,\theta_1) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 ,\]

\[X*\theta-y = \begin{pmatrix} 1 & x^{(1)}\\ 1 & x^{(2)}\\ 1 & x^{(3)}\\ ...& ...\\ 1 & x^{(n)}\\ \end{pmatrix} * \begin{pmatrix} \theta_0\\ \theta_1\\ \end{pmatrix} -y = \begin{pmatrix} \theta_0+\theta_1x^{(1)}-y^{(0)}\\ \theta_0+\theta_1x^{(2)}-y^{(1)}\\ \theta_0+\theta_1x^{(3)}-y^{(2)}\\ ...\\ \theta_0+\theta_1x^{(n)}-y^{(n)}\\ \end{pmatrix} = \begin{pmatrix} h_\theta(x^{(0)})-y^{(0)}\\ h_\theta(x^{(1)})-y^{(1)}\\ h_\theta(x^{(2)})-y^{(2)}\\ ...\\ h_\theta(x^{(n)})-y^{(n)}\\ \end{pmatrix} ,\]

vectorization (Octave)

\[J = (m / 2) * (X * \theta - y)^T * (X * \theta - y),\]

J = (m / 2) * (X * theta - y)' * (X * theta - y)

4. Goal

find \(\theta_0\), \(\theta_1\) to minimize \(J(\theta_0,\theta_1)\), as well as to minimize the difference between expected y and actual y

1. (Batch) gradient descent

looks at every example in the entire training set on every step, maybe very slow while train set is big

Repeat until convergence
\[\begin{cases} \theta_0 = \theta_0 - \frac{\alpha}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\\ \\ \theta_1 = \theta_1 - \frac{\alpha}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)} \end{cases} ,\]

\[S= \begin{pmatrix} h_\theta(x^{(1)})-y^{(i)} & h_\theta(x^{(2)})-y^{(i)} &... & h_\theta(x^{(n)})-y^{(i)} \end{pmatrix} \begin{pmatrix} 1 & x^{(1)}\\ 1 & x^{(2)}\\ 1 & x^{(3)}\\ ...& ...\\ 1 & x^{(n)}\\ \end{pmatrix} = \begin{pmatrix} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) & \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)} \end{pmatrix} ,\]

\[ \theta = \begin{pmatrix} \theta_0\\ \theta_1\\ \end{pmatrix} = \begin{pmatrix} \theta_0\\ \theta_1\\ \end{pmatrix} - S^T ,\]

vectorization

theta = theta - (alpha / m) * ((X * theta - y)' * X)'

2. Stochastic gradient descent

looks at one example on every step, make sure to get next example randomly

repeat until convergence {
for j = 1 to m {
\(\theta_i := \theta_i - \frac{\alpha}{m} (h_\theta(x^{(j)})-y^{(j)}) x^{(j)}\)
}
}

3. Normal equation

\(tr AB = tr BA\)
\(tr ABC = tr CAB = tr BCA\)
\(tr A = tr A^T\)
\(tr a = a\) (a is a real number)
\(\nabla_A tr AB = B^T\)
\(\nabla_A tr ABA^TC = CAB + C^TAB^T\)

\[ \nabla_\theta J = \begin{pmatrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ ...\\ \frac{\partial J}{\partial \theta_n}\\ \end{pmatrix} \]

let \(f(A) = tr AB\), input a matrix, output a real number

\[ J(\theta) = /frac{1}{2} \sum_{i=1}^m(h(x^{(i)}) - y^{(i)}) ^ 2 = \frac{1}{2}(X\theta - y)^T(X\theta - y) \]

find \(\theta\) make \(\frac{\partial J(\theta)}{\partial \theta} = 0\), as well as to minimize \(J(\theta)\)

\(\theta = (X^TX)^{-1}X^Ty\)