单变量线性回归（附带matlab代码）

对于本节学习，你需要理解以下问题：

1.什么是线性回归

2.什么是最小二乘估计

3.什么是代价函数

4.梯度下降的意义

5.理解梯度下降的公式及其参数的含义

看到线性回归算法这个名称，首先我们从字面的意义上理解。线性顾名思义就是直线，算法就是一些数学公式。难以理解的就是何为回归，所谓回归就是预测。故此，我们可以很容易的想到线性回归算法的作用：就是通过算法模拟出一条直线在做预测。

接下来我们通过一个例子开始：这个例子是通过x的值预测y的值的，我们需要使用一个数据集

我们需要构建一个模型，当然在线性回归中我们使用的模型是一条直线。通过这条直线我们用过x的值来预测y的值。

线性回归算法是一种监督学习算法，即我们给出数据，通过算法来得到近乎于实际的答案。对于回归的理解，就是说我们根据之前的数据预测出一个准确（我们希望是和实际完全一样的结果，但是我们几乎不可能做到和实际完全一样，我们能做的就是尽量接近）的输出值。接下来我们要做的就是最为重要也是最为困难的事情，就是根据问题建立出数学模型。

我们用下面的符号来描述这个线性回归问题：

m代表数据集的数量，在本例中就是代表（x，y）的数量

x代表目标特征/输入变量

y代表目标变量/输出变量

（x,y）代表训练中的实例

（ $x_i,y_i$ ）代表第i个观测实例

h代表假设，该假设就是我们用来预测函数，在本例中就是我们要用的那条直线。

建立了数学模型后我们该如何来解决一开始提出的预测问题。我们假设一个h，把x输入给h，希望能得到最接近于y的结果。

一种可能的表达式是： $f(x)=$ $\theta_0+\theta_1x$ .因为只含有一个变量因此叫做单变量的线性回归。

为了使的结果最精确，我们希望f（x）尽可能等于y。因此，在这一步我们的问题就是如何寻找一个度量标准来表示f（x）和y的差距。本例中我们使用均方误差估计（当然度量f（x）和y差距的方法不止一种，有兴趣的读者可以探讨其他的方法），基于均方误差最小化来求解的方法也叫做最小二乘法。均方误差的公式如下：

$E=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i)^2$

为了便于计算（至于为什么说这么说做便于计算，希望读者能自己推到一遍公式，就可以理解了），我们在不影响结果的情况下变化，得出代价函数（请读者一定反复思考代价函数的意义，这个至关重要）

$J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i)^2$

为了更直观的观察到 $\theta_0,\theta_1$ 对于J的影响，我们可以画出三维图，如下所示：

这样我们就去思考我们目前的问题是什么，从图上我们看到不同的 $\theta_0,\theta_1$ 对于J的取值也不同，我们希望预测的结果尽可能的接近于y，那么J的值也就要越小，那么我们目前的问题就是如何去寻找 $\theta_0,\theta_1$ ，使得这组 $\theta_0,\theta_1$ 对应的J是我们所能找到的J中最小的。

对于这样的问题，一些数学好的读者，可能已经想到办法了，对于数学不好的读者，也不必灰心，接下来的讲解会让你得到答案的。对于三维图中的情况，我们可以先在我们的实际生活中想到一些相似的场景。图中所展现的情形很像我们在一座山的某个地方，也许是山顶，也许是山腰，那么此时我们就要思考一下，山的高度在我们的三维图中和哪个变量相似，仔细想一下，山的高度就像是我们的J值（要是你忘了J值得含义，再返回去看一遍，学习的乐趣可能就在于温故而知新）。我们现在要做的就是在我们站的地方，环顾四周，用小碎步下山，没走一步，要再一次环顾四周，确保你每次走的都是在下山（就是走向更低的地方），以此方法，直到你走到一个点，当你环顾四周时发现不管从哪个方向走都不能走到更低的地方

从等高线中我们可以看到的确存在一个点其高度低于所有的点，这意味着什么呢？（此处我就不给出结论了，希望读者能认真思考，要是不能明白的话，请反复思考）

接下来，我们需要把下山这一过程抽象成为数学算法，在数学上这种思想成为梯度下降（其核心思想就是偏导数的应用，若是有读者像深入探究其中的数学原理，可以仔细在学习一遍高数中关于导数的章节）。

梯度下降的思想：开始时随机选择一组参数 $\theta_0,\theta_1$ ，计算代价函数J，然后寻找一个能让代价函数下降最多的 $\theta_0,\theta_1$ ，持续这么做，直到找到局部最小值。

其中算法的公式为：

$\theta_j=\theta_j-\alpha \frac{\partial }{\partial \theta_j}J(\theta_9,\theta_1)$

其中 $\alpha$ 为学习率，它决定了我们能沿着代价函数下降方向迈出的步子。在梯度下降算法中，我们需要注意的是需要同时更新 $\theta_0,\theta_1$ 值，为什么这么做呢，从直观的理解上就是，当你从一个点到另一个点时，此时，你所处的位置的 $\theta_0,\theta_1$ 必然是同时发生变化的。从数学的角度上看，我们下降到一个新的点时会重新计算J和新的 $\theta_0,\theta_1$ 值，若是不同步更新，则会出现错误。因此，当我们谈到梯度下降时，更新 $\theta_0,\theta_1$ 值，必然是同时更新的。对于学习率 $\alpha$ ，我们要知道的是它决定了我们向下迈的步子的大小。若是太大，则我们可能会错过最小的值，若是太小，则我们下降的速度会过于的慢。

最后，我们思考一个问题，若是一开始我们就处于最低的点会怎么办呢？其实仔细思考不难得知，当我们处于最低的点时，这个点是不可能找到更低的点的，因此不会再做梯度下降。

这就是我们所讲的线性回归，对于本例其结果如下：

看到这里已经完成了本节的学习，请返回顶部，看一下本节的学习目标，是否理解

本例中所用的数据和代码（工具是matlab）如下：

数据：

6.11010000000000 5.52770000000000 8.51860000000000 7.00320000000000 5.85980000000000 8.38290000000000 7.47640000000000 8.57810000000000 6.48620000000000 5.05460000000000 5.71070000000000 14.1640000000000 5.73400000000000 8.40840000000000 5.64070000000000 5.37940000000000 6.36540000000000 5.13010000000000 6.42960000000000 7.07080000000000 6.18910000000000 20.2700000000000 5.49010000000000 6.32610000000000 5.56490000000000 18.9450000000000 12.8280000000000 10.9570000000000 13.1760000000000 22.2030000000000 5.25240000000000 6.58940000000000 9.24820000000000 5.89180000000000 8.21110000000000 7.93340000000000 8.09590000000000 5.60630000000000 12.8360000000000 6.35340000000000 5.40690000000000 6.88250000000000 11.7080000000000 5.77370000000000 7.82470000000000 7.09310000000000 5.07020000000000 5.80140000000000 11.7000000000000 5.54160000000000 7.54020000000000 5.30770000000000 7.42390000000000 7.60310000000000 6.33280000000000 6.35890000000000 6.27420000000000 5.63970000000000 9.31020000000000 9.45360000000000 8.82540000000000 5.17930000000000 21.2790000000000 14.9080000000000 18.9590000000000 7.21820000000000 8.29510000000000 10.2360000000000 5.49940000000000 20.3410000000000 10.1360000000000 7.33450000000000 6.00620000000000 7.22590000000000 5.02690000000000 6.54790000000000 7.53860000000000 5.03650000000000 10.2740000000000 5.10770000000000 5.72920000000000 5.18840000000000 6.35570000000000 9.76870000000000 6.51590000000000 8.51720000000000 9.18020000000000 6.00200000000000 5.52040000000000 5.05940000000000 5.70770000000000 7.63660000000000 5.87070000000000 5.30540000000000 8.29340000000000 13.3940000000000 5.43690000000000

y=17.5920000000000 9.13020000000000 13.6620000000000 11.8540000000000 6.82330000000000 11.8860000000000 4.34830000000000 12 6.59870000000000 3.81660000000000 3.25220000000000 15.5050000000000 3.15510000000000 7.22580000000000 0.716180000000000 3.51290000000000 5.30480000000000 0.560770000000000 3.65180000000000 5.38930000000000 3.13860000000000 21.7670000000000 4.26300000000000 5.18750000000000 3.08250000000000 22.6380000000000 13.5010000000000 7.04670000000000 14.6920000000000 24.1470000000000 -1.22000000000000 5.99660000000000 12.1340000000000 1.84950000000000 6.54260000000000 4.56230000000000 4.11640000000000 3.39280000000000 10.1170000000000 5.49740000000000 0.556570000000000 3.91150000000000 5.38540000000000 2.44060000000000 6.73180000000000 1.04630000000000 5.13370000000000 1.84400000000000 8.00430000000000 1.01790000000000 6.75040000000000 1.83960000000000 4.28850000000000 4.99810000000000 1.42330000000000 -1.42110000000000 2.47560000000000 4.60420000000000 3.96240000000000 5.41410000000000 5.16940000000000 -0.742790000000000 17.9290000000000 12.0540000000000 17.0540000000000 4.88520000000000 5.74420000000000 7.77540000000000 1.01730000000000 20.9920000000000 6.67990000000000 4.02590000000000 1.27840000000000 3.34110000000000 -2.68070000000000 0.296780000000000 3.88450000000000 5.70140000000000 6.75260000000000 2.05760000000000 0.479530000000000 0.204210000000000 0.678610000000000 7.54350000000000 5.34360000000000 4.24150000000000 6.79810000000000 0.926950000000000 0.152000000000000 2.82140000000000 1.84510000000000 4.29590000000000 7.20290000000000 1.98690000000000 0.144540000000000 9.05510000000000 0.617050000000000

代码如下：

clear ; close all; clc


%% ======================= Part 2: Plotting =======================
fprintf('Plotting Data ...\n')
data = load('ex1data1.txt');
X = data(:, 1); y = data(:, 2);
m = length(y); % number of training examples

% Plot Data
% Note: You have to complete the code in plotData.m
plotData(X, y);

fprintf('Program paused. Press enter to continue.\n');
pause;

%% =================== Part 3: Gradient descent ===================
fprintf('Running Gradient Descent ...\n')

X = [ones(m, 1), data(:,1)]; % Add a column of ones to x
theta = zeros(2, 1); % initialize fitting parameters

% Some gradient descent settings
iterations = 1500;
alpha = 0.01;

% compute and display initial cost
computeCost(X, y, theta)

% run gradient descent
theta = gradientDescent(X, y, theta, alpha, iterations);

% print theta to screen
fprintf('Theta found by gradient descent: ');
fprintf('%f %f \n', theta(1), theta(2));

% Plot the linear fit
hold on; % keep previous plot visible
plot(X(:,2), X*theta, '-')
legend('Training data', 'Linear regression')
hold off % don't overlay any more plots on this figure

% Predict values for population sizes of 35,000 and 70,000
predict1 = [1, 3.5] *theta;
fprintf('For population = 35,000, we predict a profit of %f\n',...
    predict1*10000);
predict2 = [1, 7] * theta;
fprintf('For population = 70,000, we predict a profit of %f\n',...
    predict2*10000);

fprintf('Program paused. Press enter to continue.\n');
pause;

%% ============= Part 4: Visualizing J(theta_0, theta_1) =============
fprintf('Visualizing J(theta_0, theta_1) ...\n')

% Grid over which we will calculate J
theta0_vals = linspace(-10, 10, 100);
theta1_vals = linspace(-1, 4, 100);

% initialize J_vals to a matrix of 0's
J_vals = zeros(length(theta0_vals), length(theta1_vals));

% Fill out J_vals
for i = 1:length(theta0_vals)
    for j = 1:length(theta1_vals)
	  t = [theta0_vals(i); theta1_vals(j)];    
	  J_vals(i,j) = computeCost(X, y, t);
    end
end


% Because of the way meshgrids work in the surf command, we need to 
% transpose J_vals before calling surf, or else the axes will be flipped
J_vals = J_vals';
% Surface plot
figure;
surf(theta0_vals, theta1_vals, J_vals)
xlabel('\theta_0'); ylabel('\theta_1');

% Contour plot
figure;
% Plot J_vals as 15 contours spaced logarithmically between 0.01 and 100
contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20))
xlabel('\theta_0'); ylabel('\theta_1');
hold on;
plot(theta(1), theta(2), 'rx', 'MarkerSize', 10, 'LineWidth', 2);


function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
%GRADIENTDESCENT Performs gradient descent to learn theta
%   theta = GRADIENTDESENT(X, y, theta, alpha, num_iters) updates theta by 
%   taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
theta_s=theta;

for iter = 1:num_iters

    % ====================== YOUR CODE HERE ======================
    % Instructions: Perform a single gradient step on the parameter vector
    %               theta. 
    %
    % Hint: While debugging, it can be useful to print out the values
    %       of the cost function (computeCost) and gradient here.
    %
    theta(1) = theta(1) - alpha / m * sum(X * theta_s - y);       
    theta(2) = theta(2) - alpha / m * sum((X * theta_s - y) .* X(:,2));     % 必须同时更新theta(1)和theta(2)，所以不能用X * theta,而要用theta_s存储上次结果。
    theta_s=theta; 
    

    % ============================================================

    % Save the cost J in every iteration    
    J_history(iter) = computeCost(X, y, theta);

end
J_history
end


function J = computeCost(X, y, theta)
%COMPUTECOST Compute cost for linear regression
%   J = COMPUTECOST(X, y, theta) computes the cost of using theta as the
%   parameter for linear regression to fit the data points in X and y

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta
%               You should set J to the cost.
J = sum((X * theta - y).^2) / (2*m);     % X(79,2)  theta(2,1)





% =========================================================================

end


function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
%GRADIENTDESCENT Performs gradient descent to learn theta
%   theta = GRADIENTDESENT(X, y, theta, alpha, num_iters) updates theta by 
%   taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
theta_s=theta;

for iter = 1:num_iters

    % ====================== YOUR CODE HERE ======================
    % Instructions: Perform a single gradient step on the parameter vector
    %               theta. 
    %
    % Hint: While debugging, it can be useful to print out the values
    %       of the cost function (computeCost) and gradient here.
    %
    theta(1) = theta(1) - alpha / m * sum(X * theta_s - y);       
    theta(2) = theta(2) - alpha / m * sum((X * theta_s - y) .* X(:,2));     % 必须同时更新theta(1)和theta(2)，所以不能用X * theta,而要用theta_s存储上次结果。
    theta_s=theta; 
    

    % ============================================================

    % Save the cost J in every iteration    
    J_history(iter) = computeCost(X, y, theta);

end
J_history
end