第1章线性回归

1 Linear Regression with One Variable / Univariate Linear Regression
2 Linear Regression with Multiple Variables / Multivariate Linear Regression
3 Gradient Descent in Practice
4 Normal Equation
- 4.1 model
- 4.2 writing
5 梯度下降与正规方程的比较
6 参考

1 Linear Regression with One Variable / Univariate Linear Regression

1.1 model

Hypothesis： $h_\theta(x)=\theta_0+\theta_1x$
Parameters： $\theta_0,\theta_1$
Cost Function：square error function / square error cost function $J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2$
Goal（Object Function）： $\mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1)$

1.2 ‘Batch’ Gradient Descent Algorithm

to solve $\mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1)$

1.2.1 algorithm

repeat until convergence{
$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)\text{（for $j=0$ and $j=1$）}$
$\alpha$ —— learning rate
}
Correct: $\begin{aligned} {temp}_0&:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)\\ {temp}_1&:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)\\ \theta_0&:={temp}_0\\ \theta_1&:={temp}_1 \end{aligned}$
Notice: need to simultaneously update $\theta_0$ and $\theta_1$

1.2.2 use for univariate linear regression

repeat{
$\begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x^{(i)}) \end{aligned}$
}

1.2.3 features

在梯度下降的每一步中，我们都用到了所有的训练样本

2 Linear Regression with Multiple Variables / Multivariate Linear Regression

2.1 model

Hypothesis： $x_0=1$ $h_\theta(x)=\theta^Tx=\theta_0x_0+\theta_1x_1+···+\theta_nx_n$
Parameters： $n + 1$ -demensional vector $\theta=\theta_0,\theta_1,···,\theta_n$
Cost Function：square error function / square error cost function $J(\theta)=J(\theta_0,\theta_1,···,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2$
Goal（Object Function）： $\mathop{\text{minimize}}\limits_{\theta_0,\theta_1,···,\theta_n} J(\theta_0,\theta_1,\theta_n)$

2.2 Gradient Descent Algorithm

2.2.1 algorithm

repeat until convergence{
$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1,···,\theta_n)\text{（simultaneously update $\theta_j$ for $j=0,1,···,n$）}$
}

2.2.2 use for multiple linear regression

repeat{
$x_0^{(i)}=1$
$\begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_0^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_1^{(i)})\\ ···\\ \theta_n&:=\theta_n-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_n^{(i)})\\ \end{aligned}$
}

3 Gradient Descent in Practice

3.1 Feature Scaling（特征缩放）

Goal: make sure features are on a similar scale
Advantages:
(1) make gradient desecent run much faster（多快）
(2) converge in a lot fewer iterations（少收敛）
Methods:
(1) dividing by the maximum value
(2) mean normalization（均值归一化）

3.1.1 mean normaliztion

Theory: replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean
Do not apply to $x_0=1$
$x_i=\frac{x_i-\mu_i}{s_i}$
$\mu_i$ —— average value of $x_i$ in the training sets
$s_i$ —— the range of values of that feature
(1) $\text{maximum value} - \text{minimum value}$
(2) $\text{standard deviation of the variable}$ （标准差）

3.2 Learning Rate

Debugging: How to make sure gradient descent in working correctly

3.2.1 good method: plot

plot the cost function as we increase the number of iterations（画出代价函数值 $J(\theta)$ 随迭代次数变化的曲线）
Advantages:
(1) can show gradient descent if working correctly —— $J(\theta)$ should decrease after every iteration（梯度下降法是否正常运行）
(2) judge whether or not gradient descent has converged（梯度下降是否收敛）

3.2.1 other method

declare convergence if $J(\theta)$ decreases by less than $\varepsilon$ in one iteration
Advantage: judge if converging automatically
Disadvantage: choose $\varepsilon$ can be difficult

3.2.3 choose $\alpha$

consider: $\alpha=0.01,0.03,0.1,0.3,1,3,10,···$

3.2.4 plot problem

problem: $J(\theta)$ does not decrease after every iteration
cause: $\alpha$ is too large that $J(\theta)$ may not decrease on every iteration; may not converge and slow converge is also possible
Solve: choose sufficiently smaller $\alpha$
Problem caused by solution above: $\alpha$ is too small that gradient descent can be slow to converge

3.3 Features and Polynomial Regression

linear regression does not adapt to all datas
polynomial regression can translate into linear regression
we need to observe training sets so that we can choose a sufficient model
Notice: feature scaling is necessary if choosing polynomial regresssion

4 Normal Equation

a method to solve for $\theta$ analytically one step to get to the optimal value right

4.1 model

$m$ examples $x^{(1)},y^{(1)}),···,(x^{(m)},y^{(m)})$ ; $n$ features
$n + 1$ dimensional vector: $x^{(i)}=\left[ \begin{matrix} x_0^{(i)}\\ ···\\ x_n^{(i)}\\ \end{matrix} \right]$
design matrix ( $m \cdot (n + 1)$ dimensional vector): $X=\left[ \begin{matrix} {(x^{(1)})}^T\\ ···\\ {(x^{(m)})}^T\\ \end{matrix} \right]$
$y=\left[ \begin{matrix} y^{(1)}\\ ···\\ y^{(m)}\\ \end{matrix} \right]$
$\theta={(X^TX)}^{-1}X^Ty$

4.2 writing

Octave

 pinv(X'*X)*X'*y

Python

import numpy as np
def notmalEqn(X,y):
	theta = np.linalg.inv(X.T@X)@X.T@y	# X.T@X 等价于 X.T.dot(X)
	return theta

problem: if $X^TX$ is non-invertible ( singular or degenerate matrices)?

cause:
(1) redundant features (liearly dependent)
(2) too many features —— delete some features or use regularation
出现不可逆矩阵的情况极少发生
伪逆pesudo-inverse：pinv()
逆：inv()

5 梯度下降与正规方程的比较

Gradient Descent	Normal Euqation
need to choose $\alpha$	no need to choose $\alpha$
need many iterations	no need to iterate
works well even when $n$ is large	slow if $n$ is very large because we need to compute $X^TX)^{-1}$ 【 $n$ is hard to get】
适用于各种类型的模型	只适用于 linear regression，不适合 logistic regression algorithm

6 参考

吴恩达机器学习 coursera machine learning
黄海广机器学习笔记

【机器学习】1 线性回归

第1章线性回归

1 Linear Regression with One Variable / Univariate Linear Regression

1.1 model

1.2 ‘Batch’ Gradient Descent Algorithm

1.2.1 algorithm

1.2.2 use for univariate linear regression

1.2.3 features

2 Linear Regression with Multiple Variables / Multivariate Linear Regression

2.1 model

2.2 Gradient Descent Algorithm

2.2.1 algorithm

2.2.2 use for multiple linear regression

3 Gradient Descent in Practice

3.1 Feature Scaling（特征缩放）

3.1.1 mean normaliztion

3.2 Learning Rate

3.2.1 good method: plot

3.2.1 other method

3.2.3 choose $\alpha$

3.2.4 plot problem

3.3 Features and Polynomial Regression

4 Normal Equation

4.1 model

4.2 writing

5 梯度下降与正规方程的比较

6 参考

猜你喜欢

【机器学习】1 线性回归

第1章 线性回归

1 Linear Regression with One Variable / Univariate Linear Regression

1.1 model

1.2 ‘Batch’ Gradient Descent Algorithm

1.2.1 algorithm

1.2.2 use for univariate linear regression

1.2.3 features

2 Linear Regression with Multiple Variables / Multivariate Linear Regression

2.1 model

2.2 Gradient Descent Algorithm

2.2.1 algorithm

2.2.2 use for multiple linear regression

3 Gradient Descent in Practice

3.1 Feature Scaling（特征缩放）

3.1.1 mean normaliztion

3.2 Learning Rate

3.2.1 good method: plot

3.2.1 other method

3.2.3 choose α \alpha α

3.2.4 plot problem

3.3 Features and Polynomial Regression

4 Normal Equation

4.1 model

4.2 writing

5 梯度下降与正规方程的比较

6 参考

猜你喜欢

第1章线性回归

3.2.3 choose $\alpha$