[二]机器学习之回归

2.1 线性回归

2.1.1 实验数据

1.数据描述

数据来自出版书籍《An Introduction to Statistical Learning with Applications in R》(Springer,2013)，作者Gareth James,Daniela Witten,Trevor Hastie and Robert Tibshirani。共200条数据，每条数据4个属性。

数据下载地址：http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv

2.数据集信息

数据共4列200行，每一行为一个特定的商品，前3列为输入特征，最后一列为输出特征。

输入特征：

TV：该商品用于电视上的广告费用(千元，下同)

Radio：在广播媒体上投资的广告费用

Newspaper：用于报纸媒体的广告费用

输出特征：

Sale：该商品的销量

3.数据样例

2.1.2 实验过程

运行python

1.收集、准备数据

import pandas as pd
data = pd.read_csv("Advertising.csv")
data.head()#显示前5行

查看数据集大小

data.shape

2.分析数据

import matplotlib.pyplot as plt
import pandas as pd
if __name__ == "__main__":
    path = "Advertising.csv"
#pandas读入数据
    data = pd.read_csv(path)
    x = data[['TV','radio','newspaper']]
    y = data['sales']
    plt.figure(figsize=(9,12))
    plt.subplot(311)
    plt.plot(data['TV'],y,'ro')
    plt.title('TV')
    plt.grid()
    plt.subplot(312)
    plt.plot(data['radio'],y,'g^')
    plt.title('radio')
    plt.grid()
    plt.subplot(313)
    plt.plot(data['newspaper'],y,'b*')
    plt.title('newspaper')
    plt.grid()
    plt.tight_layout()
    plt.show()

得到绘图：

3.使用pandas构建特征向量x和列标签y

feature_cols = ['TV','radio','newspaper']
X = data[feature_cols]
print X.head()
print type(X)
print X.shape
y = data['sales']
print y.head()

结果如下：

4.构建训练集与测试集

from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)
#默认75%为训练集，25%为测试集
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape

5.sklearn线性回归

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
model = linreg.fit(X_train,y_train)
print model
print linreg.intercept_
print linreg.coef_
zip(feature_cols,linreg.coef_)

由此，可以得到各项系数：y=2.8769+0.0465*TV+0.1791*radio+0.00345*newspaper

6.预测

y_pred = linreg.predict(X_test)
print y_pred
print type(y_pred)

7.回归问题的评价测度

对于分类问题，评价测度(evalution metrics)是准确率，但这种方法不适用与回归问题。我们使用连续数值的评价测度

(1)平均绝对误差(Mean Absolute Error,MAE)

(2)均方误差(Mean Squared Error,MSE)

(3)均方根误差(Root Mean Squared Error,RMSE)

此处使用RMSE：

print type(y_pred),type(y_test)
print len(y_pred),len(y_test)
print y_pred.shape,y_test.shape
from sklearn import metrics
import numpy as np
sum_mean=0
for i in range(len(y_pred)):
    sum_mean += (y_pred[i]-y_test.values[i])**2

print "RMSE by hand:",np.sqrt(sum_mean/len(y_pred))

8.作图

import matplotlib.pyplot as plt
plt.figure()
plt.plot(range(len(y_pred)),y_pred,'b',label="predict")#蓝色线表示预测值
plt.plot(range(len(y_pred)),y_test,'r',label="test")#红色线为真实值
plt.legend(loc="upper right")#右上角显示标签
plt.xlabel("the number of sales")
plt.ylabel("value of sales")
plt.show()

2.1.3 结果分析

根据结果y=2.8769+0.0465*TV+0.1791*radio+0.00345*newspaper可以看出，newspaper的系数很小，再观察收益-newspaper散点图，我们发现newspaper的线性关系不明显，因此我们可以尝试去除这个特征，看看回归预测的结果如何。

feature_cols = ['TV','radio']
X = data[feature_cols]
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
model = linreg.fit(X_train,y_train)
zip(feature_cols,linreg.coef_)
y_pred = linreg.predict(X_test)
sum_mean=0
for i in range(len(y_pred)):
    sum_mean += (y_pred[i]-y_test.values[i])**2

print "RMSE by hand:",np.sqrt(sum_mean/len(y_pred))
plt.figure()
plt.plot(range(len(y_pred)),y_pred,'b',label="predict")#蓝色线表示预测值
plt.plot(range(len(y_pred)),y_test,'r',label="test")#红色线为真实值
plt.legend(loc="upper right")#右上角显示标签
plt.xlabel("the number of sales")
plt.ylabel("value of sales")
plt.show()

测得结果为：1.387

预测值与真实值的关联图如下：

在移除newspaper特征之后，得到的RMSE值变小了，说明newspaper特征可能不适合作为预测销量的特征，因此，我们得到了新的模型。

2.1.4 注意事项

本模型虽然简单，但它涵盖了机器学习的相当部分内容，如使用75%的训练集和25%的测试集，这往往是探索机器学习的第一步。得到的线性模型发现有负权，我们使用最为简单的方法：直接删除；但这样做，仍然得到了更好的预测结果。

在机器学习中，由“奥卡姆剃刀”原理：如果能够用简单模型解决问题，则不用复杂的模型，因为复杂模型往往增加了不确定性，造成过多的成本浪费，且容易过拟合。

2.2 Logistic回归

2.2.1实验数据

鸢尾花数据集或许是最有名的模式识别测试数据。该数据集包括3个鸢尾花类别，每个类别50个样本，其中一个类别是与另外两类线性可分的，而另外两类线性不可分。

由于最原始的数据集存在两个错误（35号和38号样本），因此我们在试验中使用的是修正过的数据。

数据下载地址：http://archive.ics.uci.edu/ml/dataset/Iris

2.2.2实验过程

（一）数据描述

该数据集共包含150行，每行1个样本，每个样本有5个字段：花萼长度(cm)，花萼宽度(cm)，花瓣长度(cm)，花瓣宽度(cm)，类别（三种，Iris Setosa，Iris Versicolor，Iris Virginica）

数据集特征	多变量	记录数	150
属性特征	实数	属性数目	4
相关应用	分类	缺失值	无

（二）实验代码

import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

def iris_type(s):
    it = {'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}
    return it[s]

if __name__ == "__main__":
    path = 'iris.data'#数据文件路径
    #路径，浮点型数据，逗号分隔，第4列用函数iris_type单独处理
    data = np.loadtxt(path,dtype=float,delimiter=',',converters={4:iris_type})
    #将数据的0-3列组成x，第4列得到y
    x,y = np.split(data,(4,),axis=1)
    #为了可视化，仅使用前两列特征
    x = x[:,:2]
    #Logistic回归模型
    logreg = LogisticRegression()
    #根据数据[x,y]，计算回归参数
    logreg.fit(x,y.ravel())
    #画图
    #横纵各采样多少个值
    N,M = 500,500
    #得到第0列范围
    x1_min,x1_max = x[:,0].min(),x[:,0].max()
    #得到第1列范围
    x2_min,x2_max = x[:,1].min(),x[:,1].max()
    t1 = np.linspace(x1_min,x1_max,N)
    t2 = np.linspace(x2_min,x2_max,M)
    #生成网格采样点
    x1,x2 = np.meshgrid(t1,t2)
    #测试点
    x_test = np.stack((x1.flat,x2.flat),axis=1)
    #预测值
    y_hat = logreg.predict(x_test)
    #使之与输入形状相同
    y_hat = y_hat.reshape(x1.shape)
    #预测值的显示
    plt.pcolormesh(x1,x2,y_hat,cmap=plt.cm.prism)
    plt.scatter(x[:,0],x[:,1],c=np.squeeze(y),edgecolors='k',cmap=plt.cm.prism)
    #显示样本
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(x1_min,x1_max)
    plt.ylim(x2_min,x2_max)
    plt.grid()
    plt.show()
    #训练集上的预测结果
    y_hat = logreg.predict(x)
    y = y.reshape(-1)
    print y_hat.shape
    print y.shape
    result = y_hat == y
    print y_hat
    print y
    print result
    c = np.count_nonzero(result)
    print c
    print 'Accuracy: %.2f%%' %(100*float(c)/float(len(result)))

2.2.3结果分析

（一）实验结果

（二）结果分析

1.仅仅使用两个特征：花萼长度和宽度，在150个样本中，有115个分类正确，正确率为76.67%。

2.当我们使用更多特征（4个特征全部使用），再次运行程序：

import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

def iris_type(s):
    it = {'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}
    return it[s]

if __name__ == "__main__":
    path = 'iris.data'#数据文件路径
    #路径，浮点型数据，逗号分隔，第4列用函数iris_type单独处理
    data = np.loadtxt(path,dtype=float,delimiter=',',converters={4:iris_type})
    #将数据的0-3列组成x，第4列得到y
    x,y = np.split(data,(4,),axis=1)
    #为了可视化，仅使用前两列特征
    #x = x[:,:2]
    #Logistic回归模型
    logreg = LogisticRegression()
    #根据数据[x,y]，计算回归参数
    logreg.fit(x,y.ravel())
    #画图
    #横纵各采样多少个值
    N,M,P,Q = 100,100,100,100
    #得到第0列范围
    x1_min,x1_max = x[:,0].min(),x[:,0].max()
    #得到第1列范围
    x2_min,x2_max = x[:,1].min(),x[:,1].max()
    #得到第2列范围
    x3_min,x3_max = x[:,2].min(),x[:,2].max()
    #得到第3列范围
    x4_min,x4_max = x[:,3].min(),x[:,3].max()
    t1 = np.linspace(x1_min,x1_max,N)
    t2 = np.linspace(x2_min,x2_max,M)
    t3 = np.linspace(x3_min,x3_max,P)
    t4 = np.linspace(x4_min,x4_max,Q)
    #生成网格采样点
    x1,x2,x3,x4 = np.meshgrid(t1,t2,t3,t4)
    #测试点
    x_test = np.stack((x1.flat,x2.flat,x3.flat,x4.flat),axis=1)
    #预测值
    y_hat = logreg.predict(x_test)
    #使之与输入形状相同
    y_hat = y_hat.reshape(x1.shape)
    #训练集上的预测结果
    y_hat = logreg.predict(x)
    y = y.reshape(-1)
    print y_hat.shape
    print y.shape
    result = y_hat == y
    print y_hat
    print y
    print result
    c = np.count_nonzero(result)
    print c
    print 'Accuracy: %.2f%%' %(100*float(c)/float(len(result)))

可以发现，在150个样本中，有144个分类正确，正确率为96.00%。

[二]机器学习之回归

猜你喜欢