1. Variable definitions
m
: training examples' count
\(y\) :
\(X\) : design matrix. each row of \(X\) is a training example, each column of \(X\) is a feature
\[X = \begin{pmatrix} 1 & x^{(1)}_1 & ... & x^{(1)}_n \\ 1 & x^{(2)}_1 & ... & x^{(2)}_n \\ ... & ... & ... & ... \\ 1 & x^{(n)}_1 & ... & x^{(n)}_n \\ \end{pmatrix}\]
\[\theta = \begin{pmatrix} \theta_0 \\ \theta_1 \\ ... \\ \theta_n \\ \end{pmatrix}\]
2. Hypothesis
\[x= \begin{pmatrix} x_0 \\ x_1 \\ ... \\ x_n \\ \end{pmatrix} \]
\[ h_\theta(x) = g(\theta^T x) = g(x_0\theta_0 + x_1\theta_1 + ... + x_n\theta_n) = \frac{1}{1 + e^{(-\theta^Tx)}}, \]
sigmoid function
\[ g(z) = \frac{1}{1 + e^{-z}}, \]
g = 1 ./ (1 + e .^ (-z));
3. Cost functioin
\[J(\theta) = \frac{1}{m}\sum_{i=1}^m[-y^{(i)}log(h_\theta(x^{(i)})) - (1-y^{(i)})log(1 - h_\theta(x^{(i)}))],\]
vectorization edition of Octave
J = -(1 / m) * sum(y' * log(sigmoid(X * theta)) + (1 - y)' * log(1 - sigmoid(X * theta)));
4. Goal
find \(\theta\) to minimize \(J(\theta)\), \(\theta\) is a vector here
4.1 Gradient descent
\[ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j, \]
repeat until convergence{
\(\theta_j := \theta_j - \alpha \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j\)
}
vectorization
\[S= \begin{pmatrix} h_\theta(x^{(1)})-y^{(1)} & h_\theta(x^{(2)})-y^{(2)} & ... & h_\theta(x^{(n)}-y^{(n)}) \end{pmatrix} \begin{pmatrix} x^{(1)}_0 & x^{(1)}_1 & ... & x^{(1)}_3 \\ x^{(2)}_0 & x^{(2)}_1 & ... & x^{(2)}_3 \\ ... & ... & ... & ... \\ x^{(n)}_0 & x^{(n)}_1 & ... & x^{(n)}_3 \\ \end{pmatrix} \]
\[= \begin{pmatrix} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_0 & \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_1 & ... & \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_n \end{pmatrix} \]
\[ \theta = \theta - S^T \]
\[h_\theta(X) = g(X\theta) = \frac{1}{1 + e^{(-X\theta)}}\]
\(X\theta\) is nx1, \(y\) is nx1
\(\frac{1}{1+e^{(-X\theta)}} - y\) is nx1
\[ \frac{1}{1 + e^{(-X\theta)}} - y= \begin{pmatrix} h_\theta(x^{(1)})-y^{(1)} & h_\theta(x^{(2)})-y^{(2)} & ... & h_\theta(x^{(n)})-y^{(n)} \end{pmatrix} \]
\[ \theta = \theta - \alpha(\frac{1}{1 + e^{(-X\theta)}} - y)X \]
5. Regularized logistic regression
to avoid overfitting or underfitting
Cost function
\[ J(\theta) = \frac{1}{m}\sum_{i=1}^m[-y^{(i)}log(h_\theta(x^{(i)})) - (1-y^{(i)})log(1 - h_\theta(x^{(i)}))] + \frac{\lambda}{2m} \sum_{j=1}^m \theta^2_j, \]
Gradient descent
\[ \frac{\partial J(\theta)}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_0, \]
\[ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j, (j \ge 1) \]