暑期机器学习小组模型实现----线性回归模型(linear regression)

本次实践，从较为简单的多变量线性回归模型入手

数据源于网络，是一个28行，12列的数据集，因为是初次上手，所以找的数据集较少。

数据集的详细内容介绍附在文章末尾，简单地说，

我们的任务是得出一个能较好拟合已知数据的线性回归模型，并对新数据进行房价的预测

话不多说，上代码

//
//  main.cpp
//  linear_regression_house_price
//
//  Created by zsp on 2018/7/14.
//  Copyright © 2018年 zsp. All rights reserved.
//

#include <iostream>
#include <fstream>
#include <cmath>
using namespace std;

const int SAMPLE = 27;
const int PARAMETER = 12;

double hypoVal(double para[], double fea[], int count);
double costVal(double para[], double lab[], int amount, double allX[][PARAMETER + 1]);

int main()
{
    //文本操作，读入数据
    fstream infile;
    infile.open("data.txt");
    if (!infile)
    {
        cout << "can't open file!" << endl;
        return -1;
    }
    //二维数组储存数据
    double X[SAMPLE + 1][PARAMETER + 1];    //features matrix
    for (int i = 0; i < SAMPLE + 1; i++)
    {
        for (int j = 0; j < PARAMETER + 1; j++)
        {
            infile >> X[i][j];
        }
    }
    infile.close();
    //验证数据是否写入
    for (int i = 0; i < SAMPLE + 1; i++)
    {
        for (int j = 0; j < PARAMETER + 1; j++)
            cout << X[i][j] << " ";
        cout << endl;
    }
    //特征缩放
    
    //留一法，得到训练集X[0]~X[26]和测试集X[27]
    double y[SAMPLE] = { 0 };      //labels vector
    double theta[PARAMETER] = { 0 };   //parameters vector
    double a = 0.0001;    //set learning rate as 0.0001
    int cnt = 0;        //to count the times of loop
    for (int i = 0; i < SAMPLE; i++)
    {
        y[i] = X[i][PARAMETER];
    }
    double h =  hypoVal(theta, X[0], PARAMETER); //hypothesis function
    double cost = costVal(theta, y, SAMPLE, X); //cost function
    
    //梯度下降法求theta
    double temp[PARAMETER] = { 0 };    //used for simultaneously updating theta parameters
    double der[PARAMETER] = { 0 };  //the derivative term
    double tempCost = 0;       //break the loop when tempCost - cost is small enough
    do {
        tempCost = cost;
        double sum = 0;
        for (int j = 0; j < PARAMETER; j++)
        {
            for (int i = 0; i < SAMPLE; i++)
            {
                sum += (hypoVal(theta, X[i], PARAMETER) - y[i]) * X[i][j];
            }
            der[j] = (1.0 / double (PARAMETER)) * sum;
            temp[j] = theta[j] - a * der[j];
            sum = 0;
        }
        cout << "now the theta parameters are: ";
        for (int i = 0; i < PARAMETER; i++)
        {
            theta[i] = temp[i];
            cout << theta[i] << " ";
        }
        cout << endl;
        cost = costVal(theta, y, SAMPLE, X);        //new cost value
        cnt++;
    } while (tempCost - cost > 0.00001);
    
    //测试集进行测试
    cout << "共进行" << cnt << "次梯度下降法" << endl;
    h = hypoVal(theta, X[SAMPLE - 1], PARAMETER);
    cout << "对测试数据预测值为：" << h << "，真实值为：" << y[SAMPLE - 1] << endl;
    return 0;
}

double hypoVal(double para[], double fea[], int count)  //the value of hypothesis function
{
    double hy = 0;
    for (int i = 0; i < count; i++)
    {
        hy += para[i] * fea[i];
    }
    return hy;
}
double costVal(double para[], double lab[], int amount, double allX[][PARAMETER + 1])    //the value of cost function
{
    double sum = 0;
    for (int i = 0; i < amount; i++)
    {
        sum += pow((hypoVal(para, allX[i], PARAMETER) - lab[i]), 2);
    }
    cout << "costVal now is : " << (1.0/(2.0 * (double) amount)) * sum << endl;
    return double(1.0/(2.0 * (double) amount)) * sum;
}

调试过程：

读入数据后，要注意最后一组数据不能用于训练，最后一列数不是特征变量，是真实标记。

刚开始，学习率为0.01，结果步长过大，代价函数增大，无法收敛到最优解。

学习率设置为0.0001后，程序正常运行，输出如下：

costVal now is : 824.024
now the theta parameters are: 0.00856167 0.0688921 0.0120104 0.0609822 0.0146573 0.0122017 0.0602742 0.0298192 0.29996 0.0196875 0.0104292 0.0032625 
...
...
costVal now is : 5.12645
now the theta parameters are: 1.22471 3.71067 2.97314 0.151653 2.89371 0.524275 -0.456355 0.777262 -0.0155349 0.828632 -0.47861 1.85199 
...
...
costVal now is : 4.53654
now the theta parameters are: 2.27722 3.36399 4.38813 0.114215 4.16421 0.756403 -1.04117 1.08051 -0.0204399 0.823736 0.0122513 2.3339 
costVal now is : 4.53653
共进行95091次梯度下降法
对测试数据预测值为：47.874，真实值为：45.8

分析：

由此可见，经过95091次梯度下降法后，我们得到了一个对数据拟合较好的线性回归模型，代价函数的值为4.53653

已足够小。并且由循环条件可知，此时代价函数下降速度已十分缓慢，可以认为我们已经逼近了最优解。

最后预测结果与测试集的对比可以看出，模型的泛化性能也较好。

缺点：

由于数据量太小，测试集太小，结果的说服力不强，

模型受数据变化的影响较大，并不能肯定模型的性能。

遇到的问题：

在从文本读入数据时，无法读入数据。具体解决方案参见我的另一篇博客

https://blog.csdn.net/ezio23/article/details/81068667

附数据集说明：

# data.txt
#
# Reference:
#
# S C Narula, J F Wellington,
# Linear Regression and the Minimum Sum of Relative Errors,
# Technometrics, Volume 19, 1977, pages 185-190.
#
# Helmut Spaeth,
# Mathematical Algorithms for Linear Regression,
# Academic Press, 1991,
# ISBN 0-12-656460-4.
#
# Discussion:
#
# The selling price of houses is to be represented as a function of
# other variables.
#
# There are 28 rows of data. The data includes:
#
# I, the index;
# A1, the local selling prices, in hundreds of dollars;
# A2, the number of bathrooms;
# A3, the area of the site in thousands of square feet;
# A4, the size of the living space in thousands of square feet;
# A5, the number of garages;
# A6, the number of rooms;
# A7, the number of bedrooms;
# A8, the age in years;
# A9, 1 = brick, 2 = brick/wood, 3 = aluminum/wood, 4 = wood.
# A10, 1 = two story, 2 = split level, 3 = ranch
# A11, number of fire places.
# B, the selling price.
#
# We seek a model of the form:
#
# B = A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4 + A5 * X5 + A6 * X6 + A7 * X7
# + A8 * X8 + A9 * X9 + A10 * X10 + A11 * X11
#
13 columns
28 rows
Index
A1, the local selling prices, in hundreds of dollars;
A2, the number of bathrooms;
A3, the area of the site in thousands of square feet;
A4, the size of the living space in thousands of square feet;
A5, the number of garages;
A6, the number of rooms;
A7, the number of bedrooms;
A8, the age in years;
A9, construction type
A10, architecture type
A11, number of fire places.
B, selling price
1 4.9176 1.0 3.4720 0.998 1.0 7 4 42 3 1 0 25.9
2 5.0208 1.0 3.5310 1.500 2.0 7 4 62 1 1 0 29.5
3 4.5429 1.0 2.2750 1.175 1.0 6 3 40 2 1 0 27.9
4 4.5573 1.0 4.0500 1.232 1.0 6 3 54 4 1 0 25.9
5 5.0597 1.0 4.4550 1.121 1.0 6 3 42 3 1 0 29.9
6 3.8910 1.0 4.4550 0.988 1.0 6 3 56 2 1 0 29.9
7 5.8980 1.0 5.8500 1.240 1.0 7 3 51 2 1 1 30.9
8 5.6039 1.0 9.5200 1.501 0.0 6 3 32 1 1 0 28.9
9 16.4202 2.5 9.8000 3.420 2.0 10 5 42 2 1 1 84.9
10 14.4598 2.5 12.8000 3.000 2.0 9 5 14 4 1 1 82.9
11 5.8282 1.0 6.4350 1.225 2.0 6 3 32 1 1 0 35.9
12 5.3003 1.0 4.9883 1.552 1.0 6 3 30 1 2 0 31.5
13 6.2712 1.0 5.5200 0.975 1.0 5 2 30 1 2 0 31.0
14 5.9592 1.0 6.6660 1.121 2.0 6 3 32 2 1 0 30.9
15 5.0500 1.0 5.0000 1.020 0.0 5 2 46 4 1 1 30.0
16 5.6039 1.0 9.5200 1.501 0.0 6 3 32 1 1 0 28.9
17 8.2464 1.5 5.1500 1.664 2.0 8 4 50 4 1 0 36.9
18 6.6969 1.5 6.9020 1.488 1.5 7 3 22 1 1 1 41.9
19 7.7841 1.5 7.1020 1.376 1.0 6 3 17 2 1 0 40.5
20 9.0384 1.0 7.8000 1.500 1.5 7 3 23 3 3 0 43.9
21 5.9894 1.0 5.5200 1.256 2.0 6 3 40 4 1 1 37.5
22 7.5422 1.5 4.0000 1.690 1.0 6 3 22 1 1 0 37.9
23 8.7951 1.5 9.8900 1.820 2.0 8 4 50 1 1 1 44.5
24 6.0931 1.5 6.7265 1.652 1.0 6 3 44 4 1 0 37.9
25 8.3607 1.5 9.1500 1.777 2.0 8 4 48 1 1 1 38.9
26 8.1400 1.0 8.0000 1.504 2.0 7 3 3 1 3 0 36.9
27 9.1416 1.5 7.3262 1.831 1.5 8 4 31 4 1 0 45.8
28 12.0000 1.5 5.0000 1.200 2.0 6 3 30 3 1 1 41.0

暑期机器学习小组模型实现----线性回归模型(linear regression)

猜你喜欢