notes
General procedure for solving linear regression problems
1 搜集数据 (x1,x2,...,xn,y) n个体征变量xi 一个预测值y
2 设计回归方程 h=Sigema(thetai*xi)
3 度量误差的函数 用最小二乘法度量 J(theta)=1/2 Sigema(h(x)-y)
4 找到一组theta 使得J最小
One of the ways to implement step 4 is gradient descent
1 初始化theta
2 调整theta 采用减等于梯度的相反值乘以 学习速率
根据样本两种调整theta的策略
1 每次根据所有的样本求出 J :batch gradient descent
2 把样本分成m组,每次用一组修正一次theta :stochastic gradient descent
Method 2 of Implementing Step 4 System of Equations
Record
1.linear regression
监督学习 自动驾驶
先是人驾驶,让AI学习(根据路况和人对于方向盘的转换),然后AI开车
第一个监督学习算法
房价预测
数据:房子的大小和价格(卧室数目)
定义符号
m:样本数目
x:标识输入的特征,这里表示,x1房子的大小 x2是卧室数目
y:表示输出变量,目标变量
(x,y): 表示一组样例
ith :表示第i组样例
n: 表示输入特征的个数
theta: 表示参数(系数) 都是实数
监督学习的设计流程
找到一个训练集合
找到一种算法
输出函数 h
通过h预测新给的数据的输出
我们假设 h(x)=theta0*x0+theta1*x1+theta2*x2 //假设x0=1
h(x)=sigema thetai*xi
我们需要做的是,选取一些 theta 使预测尽可能准确J(theta)=1/2×(h(x)-y)^2尽量小
The first way to choose sigema search algorithm search algorithm
start with some value of the parameter vector theta 可以是0
then keep changing the parameter vector theta
to reduce J of theta a little bit
2.gradient descent can implement the above algorithm
batch gradient descent 批梯度下降算法
on every step of gradient descent you're going to look at your entire
training set
this is the 3D shape of ,like a hill in some park.
So imagine you're actually standing physically at the position of that star,
of a cross. image you can stand on that hill.right and look all 360 degrees
around,you and ask, if I were to take a small step,what would allow me to
go downhill the most?
if you try again,with a new point
you may actually end up at a completely different local optimum
gradient descent can sometimes
depend on where you initialize you paramenters
theta i := theta i - alpha*J(theta)对 theta i的偏微分
theta i := theta i - alpha*(H(theta)-y)*xi
// alpha is a parameter of the algorithm called the learning rate it,
// controls how large a step you take
?下降最陡的方向就是偏导数?what mean
constantly(incremental) gradient decent 随机梯度下降 增量梯度下降
Repeat until convergence{
for i=1 to n{
For j=1 to m{
theta i= theta i-alpha*(h(xj)-yj)*xij
}
}
}
in order to start learning, in order to start modifying the parameters,
you only need to look at your first training examples
you should look at your first training example
and perform an update using the derivative of the error with respect to
just your first training example
and then you look at you second training example
for launch data sets, so constantly gradient descent is often much faster.
what happens is that constant gradient descent is that it won't actually
converge to the global minimum exactly.
3.the normal equations
定义符号
J:since J is a function of a vector of parameters theta
定义 derivative J的导数:as self of vector n+1维.第i列是一个对thetai偏导数
theta:=theta - alpha×J的导数
you have a function f
f: R(m*n)->R A belong to R(m*n)
derivative f(A)= 一个m*n 的矩阵 每项是 f对Aij 的偏导数
if A is an n by n matrix
define trace of A to be equal to the sum of A's diagonal element 对角元素之和
fact
tr AB=tr BA
tr ABC=tr CAB =tr BCA
derivative A to f(A)=tr AB ;
derivative with respect to the matrix A of this function of trace AB
is going to be B transposed
tr A=tr A transposed
if a is R tr a=a
derivative A to tr ABA(t)C=CAB+C(t)AB(t) // A(t) A的转置
X=[x(1)t,x(2)t,...,x(m)t]t //t表示转置 x(i) 表示第i个样本
X×theta=[h(x1),h(x2),...,h(xm)]t
Y=[y1,y2,...,y3]t
X*theta -Y 做内积
(X*theta -Y)t*(X*theta -Y)=J(theta)
set derivative theta to J(theta) =0