1. variable definitions
m
: example' count
n
: features' count (n = 1 in `Linear Regression')
X
: design matrix
x
: features
y
: expected output
\(h_\theta(x^{(i)})\) : predicted output of input \(x^{(i)}\)
\[X_{n,2} = \begin{pmatrix} 1 & x^{(1)}\\ 1 & x^{(2)}\\ 1 & x^{(3)}\\ ...& ...\\ 1 & x^{(n)}\\ \end{pmatrix} ,\]
\[y_{n,1} = \begin{pmatrix} y^{(1)}\\ y^{(2)}\\ y^{(3)}\\ ...\\ y^{(n)}\\ \end{pmatrix} ,\]
\[\theta = \begin{pmatrix} \theta_0\\ \theta_1\\ \end{pmatrix} ,\]
2. Hypothesis
\(h_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}\)
3. cost function
\[J(\theta _0,\theta_1) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 ,\]
\[X*\theta-y = \begin{pmatrix} 1 & x^{(1)}\\ 1 & x^{(2)}\\ 1 & x^{(3)}\\ ...& ...\\ 1 & x^{(n)}\\ \end{pmatrix} * \begin{pmatrix} \theta_0\\ \theta_1\\ \end{pmatrix} -y = \begin{pmatrix} \theta_0+\theta_1x^{(1)}-y^{(0)}\\ \theta_0+\theta_1x^{(2)}-y^{(1)}\\ \theta_0+\theta_1x^{(3)}-y^{(2)}\\ ...\\ \theta_0+\theta_1x^{(n)}-y^{(n)}\\ \end{pmatrix} = \begin{pmatrix} h_\theta(x^{(0)})-y^{(0)}\\ h_\theta(x^{(1)})-y^{(1)}\\ h_\theta(x^{(2)})-y^{(2)}\\ ...\\ h_\theta(x^{(n)})-y^{(n)}\\ \end{pmatrix} ,\]
vectorization (Octave)
\[J = (m / 2) * (X * \theta - y)^T * (X * \theta - y),\]
J = (m / 2) * (X * theta - y)' * (X * theta - y)
4. Goal
find \(\theta_0\), \(\theta_1\) to minimize \(J(\theta_0,\theta_1)\), as well as to minimize the difference between expected y and actual y
1. (Batch) gradient descent
looks at every example in the entire training set on every step, maybe very slow while train set is big
Repeat until convergence
\[\begin{cases} \theta_0 = \theta_0 - \frac{\alpha}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\\ \\ \theta_1 = \theta_1 - \frac{\alpha}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)} \end{cases} ,\]
\[S= \begin{pmatrix} h_\theta(x^{(1)})-y^{(i)} & h_\theta(x^{(2)})-y^{(i)} &... & h_\theta(x^{(n)})-y^{(i)} \end{pmatrix} \begin{pmatrix} 1 & x^{(1)}\\ 1 & x^{(2)}\\ 1 & x^{(3)}\\ ...& ...\\ 1 & x^{(n)}\\ \end{pmatrix} = \begin{pmatrix} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) & \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)} \end{pmatrix} ,\]
\[ \theta = \begin{pmatrix} \theta_0\\ \theta_1\\ \end{pmatrix} = \begin{pmatrix} \theta_0\\ \theta_1\\ \end{pmatrix} - S^T ,\]
vectorization
theta = theta - (alpha / m) * ((X * theta - y)' * X)'
2. Stochastic gradient descent
looks at one example on every step, make sure to get next example randomly
repeat until convergence {
for j = 1 to m {
\(\theta_i := \theta_i - \frac{\alpha}{m} (h_\theta(x^{(j)})-y^{(j)}) x^{(j)}\)
}
}
3. Normal equation
\(tr AB = tr BA\)
\(tr ABC = tr CAB = tr BCA\)
\(tr A = tr A^T\)
\(tr a = a\) (a is a real number)
\(\nabla_A tr AB = B^T\)
\(\nabla_A tr ABA^TC = CAB + C^TAB^T\)
\[ \nabla_\theta J = \begin{pmatrix} \frac{\partial J}{\partial \theta_0}\\ \frac{\partial J}{\partial \theta_1}\\ ...\\ \frac{\partial J}{\partial \theta_n}\\ \end{pmatrix} \]
let \(f(A) = tr AB\), input a matrix, output a real number
\[ J(\theta) = /frac{1}{2} \sum_{i=1}^m(h(x^{(i)}) - y^{(i)}) ^ 2 = \frac{1}{2}(X\theta - y)^T(X\theta - y) \]
find \(\theta\) make \(\frac{\partial J(\theta)}{\partial \theta} = 0\), as well as to minimize \(J(\theta)\)
\(\theta = (X^TX)^{-1}X^Ty\)