暑期机器学习小组模型实现----线性回归模型(linear regression)

本次实践,从较为简单的多变量线性回归模型入手

数据源于网络,是一个28行,12列的数据集,因为是初次上手,所以找的数据集较少。

数据集的详细内容介绍附在文章末尾,简单地说,

我们的任务是得出一个能较好拟合已知数据的线性回归模型,并对新数据进行房价的预测

话不多说,上代码

//
//  main.cpp
//  linear_regression_house_price
//
//  Created by zsp on 2018/7/14.
//  Copyright © 2018年 zsp. All rights reserved.
//

#include <iostream>
#include <fstream>
#include <cmath>
using namespace std;

const int SAMPLE = 27;
const int PARAMETER = 12;

double hypoVal(double para[], double fea[], int count);
double costVal(double para[], double lab[], int amount, double allX[][PARAMETER + 1]);

int main()
{
    //文本操作,读入数据
    fstream infile;
    infile.open("data.txt");
    if (!infile)
    {
        cout << "can't open file!" << endl;
        return -1;
    }
    //二维数组储存数据
    double X[SAMPLE + 1][PARAMETER + 1];    //features matrix
    for (int i = 0; i < SAMPLE + 1; i++)
    {
        for (int j = 0; j < PARAMETER + 1; j++)
        {
            infile >> X[i][j];
        }
    }
    infile.close();
    //验证数据是否写入
    for (int i = 0; i < SAMPLE + 1; i++)
    {
        for (int j = 0; j < PARAMETER + 1; j++)
            cout << X[i][j] << " ";
        cout << endl;
    }
    //特征缩放
    
    //留一法,得到训练集X[0]~X[26]和测试集X[27]
    double y[SAMPLE] = { 0 };      //labels vector
    double theta[PARAMETER] = { 0 };   //parameters vector
    double a = 0.0001;    //set learning rate as 0.0001
    int cnt = 0;        //to count the times of loop
    for (int i = 0; i < SAMPLE; i++)
    {
        y[i] = X[i][PARAMETER];
    }
    double h =  hypoVal(theta, X[0], PARAMETER); //hypothesis function
    double cost = costVal(theta, y, SAMPLE, X); //cost function
    
    //梯度下降法求theta
    double temp[PARAMETER] = { 0 };    //used for simultaneously updating theta parameters
    double der[PARAMETER] = { 0 };  //the derivative term
    double tempCost = 0;       //break the loop when tempCost - cost is small enough
    do {
        tempCost = cost;
        double sum = 0;
        for (int j = 0; j < PARAMETER; j++)
        {
            for (int i = 0; i < SAMPLE; i++)
            {
                sum += (hypoVal(theta, X[i], PARAMETER) - y[i]) * X[i][j];
            }
            der[j] = (1.0 / double (PARAMETER)) * sum;
            temp[j] = theta[j] - a * der[j];
            sum = 0;
        }
        cout << "now the theta parameters are: ";
        for (int i = 0; i < PARAMETER; i++)
        {
            theta[i] = temp[i];
            cout << theta[i] << " ";
        }
        cout << endl;
        cost = costVal(theta, y, SAMPLE, X);        //new cost value
        cnt++;
    } while (tempCost - cost > 0.00001);
    
    //测试集进行测试
    cout << "共进行" << cnt << "次梯度下降法" << endl;
    h = hypoVal(theta, X[SAMPLE - 1], PARAMETER);
    cout << "对测试数据预测值为:" << h << ",真实值为:" << y[SAMPLE - 1] << endl;
    return 0;
}

double hypoVal(double para[], double fea[], int count)  //the value of hypothesis function
{
    double hy = 0;
    for (int i = 0; i < count; i++)
    {
        hy += para[i] * fea[i];
    }
    return hy;
}
double costVal(double para[], double lab[], int amount, double allX[][PARAMETER + 1])    //the value of cost function
{
    double sum = 0;
    for (int i = 0; i < amount; i++)
    {
        sum += pow((hypoVal(para, allX[i], PARAMETER) - lab[i]), 2);
    }
    cout << "costVal now is : " << (1.0/(2.0 * (double) amount)) * sum << endl;
    return double(1.0/(2.0 * (double) amount)) * sum;
}

调试过程:

读入数据后,要注意最后一组数据不能用于训练,最后一列数不是特征变量,是真实标记。 

刚开始,学习率为0.01,结果步长过大,代价函数增大,无法收敛到最优解。

学习率设置为0.0001后,程序正常运行,输出如下:

costVal now is : 824.024
now the theta parameters are: 0.00856167 0.0688921 0.0120104 0.0609822 0.0146573 0.0122017 0.0602742 0.0298192 0.29996 0.0196875 0.0104292 0.0032625 
...
...
costVal now is : 5.12645
now the theta parameters are: 1.22471 3.71067 2.97314 0.151653 2.89371 0.524275 -0.456355 0.777262 -0.0155349 0.828632 -0.47861 1.85199 
...
...
costVal now is : 4.53654
now the theta parameters are: 2.27722 3.36399 4.38813 0.114215 4.16421 0.756403 -1.04117 1.08051 -0.0204399 0.823736 0.0122513 2.3339 
costVal now is : 4.53653
共进行95091次梯度下降法
对测试数据预测值为:47.874,真实值为:45.8

分析:

由此可见,经过95091次梯度下降法后,我们得到了一个对数据拟合较好的线性回归模型,代价函数的值为4.53653

已足够小。并且由循环条件可知,此时代价函数下降速度已十分缓慢,可以认为我们已经逼近了最优解。

最后预测结果与测试集的对比可以看出,模型的泛化性能也较好。

缺点:

由于数据量太小,测试集太小,结果的说服力不强,

模型受数据变化的影响较大,并不能肯定模型的性能。

遇到的问题:

在从文本读入数据时,无法读入数据。具体解决方案参见我的另一篇博客

https://blog.csdn.net/ezio23/article/details/81068667

附数据集说明:

#  data.txt
#
#  Reference:
#
#    S C Narula, J F Wellington,
#    Linear Regression and the Minimum Sum of Relative Errors,
#    Technometrics, Volume 19, 1977, pages 185-190.
#
#    Helmut Spaeth,
#    Mathematical Algorithms for Linear Regression,
#    Academic Press, 1991,
#    ISBN 0-12-656460-4.
#
#  Discussion:
#
#    The selling price of houses is to be represented as a function of
#    other variables.
#
#    There are 28 rows of data.  The data includes:
#
#      I,   the index;
#      A1,  the local selling prices, in hundreds of dollars;
#      A2,  the number of bathrooms;
#      A3,  the area of the site in thousands of square feet;
#      A4,  the size of the living space in thousands of square feet;
#      A5,  the number of garages;
#      A6,  the number of rooms;
#      A7,  the number of bedrooms;
#      A8,  the age in years;
#      A9,  1 = brick, 2 = brick/wood, 3 = aluminum/wood, 4 = wood.
#      A10, 1 = two story, 2 = split level, 3 = ranch
#      A11, number of fire places.
#      B,   the selling price.
#
#    We seek a model of the form:
#
#    B = A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4 + A5 * X5 + A6 * X6 + A7 * X7
#      + A8 * X8 + A9 * X9 + A10 * X10 + A11 * X11
#
13 columns
28 rows
Index
A1, the local selling prices, in hundreds of dollars;
A2, the number of bathrooms;
A3, the area of the site in thousands of square feet;
A4, the size of the living space in thousands of square feet;
A5, the number of garages;
A6, the number of rooms;
A7, the number of bedrooms;
A8, the age in years;
A9, construction type
A10, architecture type
A11, number of fire places.
B, selling price 
 1   4.9176  1.0   3.4720  0.998   1.0   7  4  42  3  1  0  25.9
 2   5.0208  1.0   3.5310  1.500   2.0   7  4  62  1  1  0  29.5
 3   4.5429  1.0   2.2750  1.175   1.0   6  3  40  2  1  0  27.9
 4   4.5573  1.0   4.0500  1.232   1.0   6  3  54  4  1  0  25.9
 5   5.0597  1.0   4.4550  1.121   1.0   6  3  42  3  1  0  29.9
 6   3.8910  1.0   4.4550  0.988   1.0   6  3  56  2  1  0  29.9
 7   5.8980  1.0   5.8500  1.240   1.0   7  3  51  2  1  1  30.9
 8   5.6039  1.0   9.5200  1.501   0.0   6  3  32  1  1  0  28.9
 9  16.4202  2.5   9.8000  3.420   2.0  10  5  42  2  1  1  84.9
10  14.4598  2.5  12.8000  3.000   2.0   9  5  14  4  1  1  82.9
11   5.8282  1.0   6.4350  1.225   2.0   6  3  32  1  1  0  35.9
12   5.3003  1.0   4.9883  1.552   1.0   6  3  30  1  2  0  31.5
13   6.2712  1.0   5.5200  0.975   1.0   5  2  30  1  2  0  31.0
14   5.9592  1.0   6.6660  1.121   2.0   6  3  32  2  1  0  30.9
15   5.0500  1.0   5.0000  1.020   0.0   5  2  46  4  1  1  30.0
16   5.6039  1.0   9.5200  1.501   0.0   6  3  32  1  1  0  28.9
17   8.2464  1.5   5.1500  1.664   2.0   8  4  50  4  1  0  36.9
18   6.6969  1.5   6.9020  1.488   1.5   7  3  22  1  1  1  41.9
19   7.7841  1.5   7.1020  1.376   1.0   6  3  17  2  1  0  40.5
20   9.0384  1.0   7.8000  1.500   1.5   7  3  23  3  3  0  43.9
21   5.9894  1.0   5.5200  1.256   2.0   6  3  40  4  1  1  37.5
22   7.5422  1.5   4.0000  1.690   1.0   6  3  22  1  1  0  37.9
23   8.7951  1.5   9.8900  1.820   2.0   8  4  50  1  1  1  44.5
24   6.0931  1.5   6.7265  1.652   1.0   6  3  44  4  1  0  37.9
25   8.3607  1.5   9.1500  1.777   2.0   8  4  48  1  1  1  38.9
26   8.1400  1.0   8.0000  1.504   2.0   7  3   3  1  3  0  36.9
27   9.1416  1.5   7.3262  1.831   1.5   8  4  31  4  1  0  45.8
28  12.0000  1.5   5.0000  1.200   2.0   6  3  30  3  1  1  41.0

猜你喜欢

转载自blog.csdn.net/ezio23/article/details/81068085