Machine Learning-Programming Exercise 1: Linear Regression

数据描述

        数据:房屋面积+卧室个数+房价。其中,房屋面积和卧室个数位自变量,房价为因变量。数据下载链接

导入python包

import numpy as np
import matplotlib
import matplotlib.pyplot as plt


子函数描述

#将文本记录转化为Numpy
#filename为文件名,k为属性维数
def file2matrix(filename,k):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = np.zeros((numberOfLines,k))        #prepare matrix to return
    classLabelVector = np.zeros((numberOfLines,1))                #prepare labels return   
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split(',')
        returnMat[index,:] = listFromLine[0:k]
        classLabelVector[index,:] = float(listFromLine[-1])
        index += 1
    return returnMat,classLabelVector

#特征标准化
def featureNormalize(X):
    X_norm = np.array(X)
    #mu = np.zeros((1,X.shape[1]))
    #sigma = np.zeros((1, X.shape[1]))
    mu = np.mean(X, axis = 0)#对每列求均值
    sigma = np.std(X, axis = 0)#对每列求标准差
    X_norm = (X_norm - np.tile(mu, (X.shape[0],1)))/np.tile(sigma, (X.shape[0],1))
    return X_norm, mu, sigma

# 损失函数
def computeCostMulti(X, y, theta):
    m = len(y)
    J = np.sum((y - np.dot(X, theta))**2) / (2 * m)
    return J

#梯度下降
def gradientDescentMulti(X, y, theta, alpha, numb_iters):
    # Initialize some useful values
    m = len(y) # number of training examples
    J_history = []
    
    for i in range(numb_iters):
        theta -= alpha / m * np.dot(X.T, (np.dot(X, theta) - y))
        J_history.append(computeCostMulti(X, y, theta))
    return theta, J_history
       
# Normal Equation
# theta = (X'X)^(-1)X'y
def normalEqn(X, y):
    from numpy.linalg import inv
    return np.dot(np.dot(inv(np.dot(X.T, X)), X.T), y)


读入数据

returnMat, LabelVector = file2matrix('ex1data2.txt',2)

          这里,我们对文件进行读取。首先,open文件并读取其行数,定义存放特征数据和预测数据的变量(这里,我们使用的类型是np.array。其实我们也可以使用np.mat,都是大同小异)。然后,利用strip移除字符串头尾指定的字符(默认为空格或换行符)或字符序列,利用split(',')指定分隔符(这里为',')对字符串进行切片,将前两列存入特征数据中,最后一列存入预测数据中。这里,如果预测数据是英文字母,我们还需要对其进行类型的转换。returnMat, LabelVector分别为存放特征数据和预测数据的变量。


特征标准化

# Part 1: Feature Normalization
X, mu, sigma = featureNormalize(returnMat)


        我们对returnMat的数据求每一列的均值和标准差,这里使用到的函数为numpy自带的np.mean和np.std,具体使用详情可以查阅官方文件。然后使用np.tile对变量内容复制成输入矩阵同大小的矩阵。最后,(returnMat-mean)/std为标准化之后的结果,这里的‘/’是点除。

输入数据扩充

# Add intercept term to X
X = np.column_stack((np.ones((X.shape[0],1)), X))


        我们在特征数据的第一列插入一列1矩阵,方便与之后delta的常数项系数相乘。

梯度下降

# Part 2: Gradient Descent
# Choose some alpha value
alpha = 0.01
num_iters = 400
# Init Theta and Run Gradient Descent 
theta = np.zeros((X.shape[1],1))
theta, J_history = gradientDescentMulti(X, LabelVector, theta, alpha, num_iters)



梯度下降法的公式参见下图


        由公式可以知道,对于常数求偏导之后不会乘x。所以我们将x转置,第一列常数项全部为1,任何数乘以1都是其本身。而且利用x转置乘以误差项也可以达到将所有样例求和的目的。

theta -= alpha / m * np.dot(X.T, (np.dot(X, theta) - y))


迭代误差结果图

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(range(len(J_history)), J_history)
plt.show()


预测

# 预测
# Estimate the price of a 1650 sq-ft, 3 br house
# Recall that the first column of X is all-ones. Thus, it does not need to be normalized.

price = np.dot(np.column_stack((np.array([[1]]), ((np.array([[1650, 3]]) - mu) / sigma))), theta)

print("Predicted price of a 1650 sq-ft, 3 br house is $", price[0][0])


        结果:Predicted price of a 1650 sq-ft, 3 br house is $ 289221.547371

       这里,我们需要注意的是数据刚开始就标准化,所以这里的数据也需要进行标准化。


Normal Equations

# Part 3: Normal Equations
X, y = file2matrix('ex1data2.txt',2)
# Add intercept term to X
X = np.column_stack((np.ones((X.shape[0],1)), X))
theta1 = normalEqn(X, LabelVector)


# 预测
# Estimate the price of a 1650 sq-ft, 3 br house
# Recall that the first column of X is all-ones. Thus, it does not need to be normalized.

price1 = np.dot(np.column_stack((np.array([[1]]), np.array([[1650, 3]]))), theta1)

print("Predicted price of a 1650 sq-ft, 3 br house is $", price1[0][0])




最小二乘法的delta用矩阵的形式表示为:


结果:Predicted price of a 1650 sq-ft, 3 br house is $ 293081.464335。这里,因为刚开始没有对数据进行标准化处理,所以在预测时候也不需要进行标准化处理。


猜你喜欢

转载自blog.csdn.net/qq_18124075/article/details/80834077