机器学习一（梯度下降法）

课件和视频地址

http://cs229.stanford.edu/notes/cs229-notes1.pdf
http://open.163.com/movie/2008/1/M/C/M6SGF6VB4_M6SGHFBMC.html

1.前言

最近偶触python，感ctrl c和ctrl v无比顺畅，故越发膨胀。怒拾起python数据分析一PDF读之，不到百页，内心惶恐，叹：卧槽，这都tm是啥，甚是迷茫。遂感基础知识薄弱，随意搜了机器学习教程，小看一翻。此文给出课件中几个算法，自己都不知道对不对，感觉还可以吧。

2.环境配置

不多说，用的python3.x，numpy包，环境下载pycharm，然后file->setting->Project Interpreter->右侧绿色+号->搜索输入numpy->install，然后可能有报错日志，根据日志循环上述过程安装缺少的包。

3.求解问题

本文以线性回归为例，在给出若干（x, y）下，找到方程y=b+ax中b和a，从而给出线性方程。具体理论实在理解尚浅，只给出求解公式。下面代码放在一个python文件中即可。代码部分用到矩阵和向量乘法，作为求和，有一个numpy符号*、dot、multipy区别写在末尾。
具体公式如下（课件抄的），已知样本符合如下线性方程：

h (x) = \sum i = 0 n θ i x i = θ T x, (x 0 = 1)

$h(x) = \sum_{i=0}^n\theta_ix_i=\theta^Tx, (x_0=1)$
求解下面cost function最小值时，

θ $\theta$ 的值，i是样本编号，m是样本总数

J (θ) = 1 2 \sum i = 1 m (h θ (x (i)) - y (i)) 2

$J(\theta) = \frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$
转换为梯度下降法求解如下公式：

θ j = θ j - α \partial \partial θ j J (θ)

$\theta_j=\theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta)$

\partial \partial θ j J (θ) = (h θ (x) - y) x j

$\frac{\partial}{\partial\theta_j}J(\theta) = (h_\theta(x)-y)x_j$

(1)批量梯度下降算法(batch gradient descent)

伪代码如下
Repeat until convergence{

θ j = θ j + α \sum i = 1 m (y (i) - h θ (x (i))) x (i) j ， (f o r e v e r y j)

$\theta_j = \theta_j + \alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}， (for\ every\ j)$ }

import numpy as np

# y=theta*x
def h(theta, example_x):
    return theta * example_x.T

def batch_gradient_descent(x, y, theta0, alpha, iterator):
    example_x = np.matrix(x)
    label_y = np.matrix(y)
    theta = np.matrix(theta0, dtype=float)
    for i in range(iterator):
        error = (label_y - h(theta, example_x))
        sum_gradient = error * example_x
        theta = theta + alpha * sum_gradient
    return theta

（2）随机梯度下降算法（stochastic gradient descent）

伪代码如下
loop{
　for i=1 to m{

θ j = θ j + α (y (i) - h θ (x (i))) x (i) j ， (f o r e v e r y j)

$\theta_j = \theta_j + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}， (for\ every\ j)$ 　}
}
代码如下

def stochastic_gradient_descent(x, y, theta0, alpha, iterator):
    example_x = np.matrix(x)
    label_y = np.matrix(y)
    theta = np.matrix(theta0, dtype=float)
    m, n = np.shape(example_x)
    for i in range(iterator):
        for j in range(m):
            gradient = (label_y[0, j] - h(theta, example_x[j])) * example_x[j]
            theta = theta + alpha * gradient
    return theta

（3）locally weighted linear regression algorithm

这个也不知道怎么翻译，就是带权重的线性回归，求解方法也就叫权重梯度下降法吧。这个自己不知道对不对，课件没给出具体步骤，也没搜到具体内容。感觉和上两个算法也不该放一起比较，场景不太一致。这就放一起吧。
求解公式，按照下面公式在（2）上加了个w，算法步骤与（2）一样

F i t θ t o m i n i m i z e \sum i w (i) (y (i) - θ T x (i)) 2

$Fit\ \theta\ to\ minimize\sum_{i}w^{(i)}(y^{(i)}-\theta^Tx^{(i)})^2$

w (i) = e x p (- ( x ( i ) - x ) 2 2 τ 2)

$w^{(i)}=exp(-\frac{(x^{(i)}-x)^2}{2\tau^2})$
上公式中x看做所有样本X每列的平均值，暂时这样处理吧。
代码如下：

def w(xi, ex, t):
    return np.exp(-np.multiply((xi - ex), (xi - ex))/2*t*t)

# locally weighted linear regression algorithm
def locally_gradient_descent(x, y, theta0, alpha, iterator, t):
    example_x = np.matrix(x)
    label_y = np.matrix(y)
    theta = np.matrix(theta0, dtype=float)
    m, n = np.shape(example_x)
    ex = np.mean(example_x, axis=0)
    for i in range(iterator):
        for j in range(m):
            wj = w(example_x[j], ex, t)
            gradient = np.multiply(wj, (label_y[0, j] - h(theta, example_x[j])) * example_x[j])
            theta = theta + alpha * gradient
    return theta

注意：numpy的matrix，负号*与dot是一样的，都表示矩阵乘法,行列对应一致。multiply是矩阵各对应位置相乘。例：[1,2]*[[1],[2]]=numpy.dot([1,2],[[1],[2]])=[5]，numpy.multiply([1,2], [[1],[2]])=[[1,2],[2,4]]

4测试结果

数据明显给出y=1+2x

x = [[1, 1], [1, 2], [1, 3]]
y = [3, 5, 7]
theta0 = [2, 3]
print(batch_gradient_descent(x=x, y=y, theta0=theta0, alpha=0.1, iterator=50))
print(stochastic_gradient_descent(x=x, y=y, theta0=theta0, alpha=0.1, iterator=50))
print(locally_gradient_descent(x=x, y=y, theta0=theta0, alpha=0.1, iterator=50, t=2))

求解结果如下（ $\theta_0和\theta_1$ ），是不是很像1，2
[[ 1.0748101 1.96709091]]
[[ 1.04498802 1.98500399]]
[[ 0.99706172 2.0013277 ]]