机器学习——单变量线性回归详解

数据导入

CSV(Comma-Separated Values)文件

import pandas as pd

dataframe = pd.read_csv('ex1data1.txt', names=['population', 'profit'])#读取数据并赋予列名

dataframe.head(10)

#查看维数
dataframe.shape()
#查看详细信息
dataframe.info()

作图:我们可以先把这些离散的点在散点图上呈现出来,对数据有一个直观的感受,根据数据点的分布去选者一个合适的模型。

import matplotlib.pyplot as plt
  
plt.title("Matplotlib demo") 
plt.xlabel("popluation") 
plt.ylabel("profit") 
plt.plot(x,y,"ob") 
plt.show()

#使用pandas自带的作图,二维表
dataframe.plot(kind='scatter', x='population', y='profit', figsize=(12,8))
plt.show()

#比较丑,用matplotlib出图
plt.title("Matplotlib demo") 
plt.xlabel("popluation") 
plt.ylabel("profit") 
plt.plot(dataframe['population'],dataframe['profit'],"ob") 
plt.show()

向量化:将数据数据向量化:分成两个维度,第一维所有的行都要,X是前面所有列特征,Y是最后一列标签

#相当于参数b
#使X_0为1时,即相当于添加了偏置项b
dataframe.insert(0, 'Ones', 1)

pandas中insert列用法

DataFrame.insert(self, loc, column, value, allow_duplicates=False)None[source]
Insert column into DataFrame at specified location.

#允许重复,默认关闭
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

#指定插入位置
Parameters
loc-int
Insertion index. Must verify 0 <= loc <= len(columns).

#插入列的名字
column-str, number, or hashable object
Label of the inserted column.

#插入的值
value-int, Series, or array-like
allow_duplicates-bool, optional
#截取特征X,和标签Y
#pandas中loc,iloc是截取行
#可以直接通过属性索引列

# set X (training data) and y (target variable)
#第二维即列数
cols = dataframe.shape[1]

#第二维
X = dataframe.iloc[:,0:cols-1]#X选择所有行,去掉最后一列
Y = dataframe.iloc[:,cols-1:cols]#Y选择所有行,最后一列

Y.head(5)

Warning:

Note that contrary to usual python slices, both the start and the stop are included

  • loc works on labels in the index.(标签索引)

  • iloc works on the positions in the index (so it only takes integers). (位置索引,和列表索引类似,里面只能是数字)

#在进行矩阵运算之前要保证矩阵维度符合运算规则
X.shape, theta.shape, Y.shape

给出线性回归表达式

h θ ( x ) = θ T X = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n { {h}_{\theta }}\left( x \right)={ {\theta }^{T}}X={ {\theta }_{0}}{ {x}_{0}}+{ {\theta }_{1}}{ {x}_{1}}+{ {\theta }_{2}}{ {x}_{2}}+...+{ {\theta }_{n}}{ {x}_{n}} hθ(x)=θTX=θ0x0+θ1x1+θ2x2+...+θnxn

单变量只有 θ 0 {\theta }_{0} θ0 θ 1 {\theta }_{1} θ1

x 0 {x }_{0} x0为1时,即相当于添加了偏置项b

给出代价函数

J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{ { {\left( { {h}_{\theta }}\left( { {x}^{(i)}} \right)-{ {y}^{(i)}} \right)}^{2}}} J(θ)=2m1i=1m(hθ(x(i))y(i))2
用向量化的方式代替显式的求和(explicit summation)或者循环

def computeCost(X, Y, theta):
    inner = np.power(((X * theta.T) - Y), 2)
    #向量相减,一定要保证维度相同
    return np.sum(inner) / (2 * len(X))
computeCost(X, Y, theta)

通过梯度下降法是代价函数最小,给出参数更新公式

θ j : = θ j − α ∂ ∂ θ j J ( θ ) { {\theta }_{j}}:={ {\theta }_{j}}-\alpha \frac{\partial }{\partial { {\theta }_{j}}}J\left( \theta \right) θj:=θjαθjJ(θ)

  • 初始化:迭代次数,学习率,初始参数

    alpha=0.01
    iters =1000
    
    

    迭代次数通过循环控制

    def gradientDescent(X,Y,theta,alpha,iters):
        #初始化
        temp = np.matrix(np.zeros(theta.shape))
        #参数的数量
        parameters = int(theta.ravel().shape[1])
        #存储每一次cost,保证最后可以每一次迭代后的代价
        #初始化是为0的
        cost = np.zeros(iters)
        
        for i in range(iters):
            error =(X*theta.T)-Y
            
            for j in range(parameters):
                #同时更新每一个参数
                term = np.multiply(error,X[:,j])
                temp[0,j]=theta[0,j]-((alpha/len(X))*np.sum(term))
                
            theta = temp
            cost[i]= computeCost(X,Y,theta)
        return theta,cost
    
    g, cost = gradientDescent(X, Y, theta, alpha, iters)
    g
    

    matrix.flatten
    returns a similar output matrix but always a copy

    numpy.matrix.flatten

    method

    • matrix.``flatten(self, order=‘C’)[source]

      Return a flattened copy of the matrix.All N elements of the matrix are placed into a single row.

      Parameters:order:{‘C’, ‘F’, ‘A’, ‘K’}, optional‘C’ means to flatten in row-major (C-style) order. ‘F’ means to flatten in column-major (Fortran-style) order. ‘A’ means to flatten in column-major order if m is Fortran contiguous in memory, row-major order otherwise. ‘K’ means to flatten m in the order the elements occur in memory. The default is ‘C’.

      Returns:y:matrixA copy of the matrix, flattened to a (1, N) matrix where N is the number of elements in the original matrix.

    m = np.matrix([[1,2], [3,4]])
    >>> m.flatten()
    matrix([[1, 2, 3, 4]])
    >>> m.flatten('F')
    matrix([[1, 3, 2, 4]])
    

    matrix.flat
    a flat iterator on the array.

    x = np.arange(1, 7).reshape(2, 3)
    >>> x
    array([[1, 2, 3],
           [4, 5, 6]])
    >>> x.flat[3]
    4
    >>> x.T
    array([[1, 4],
           [2, 5],
           [3, 6]])
    >>> x.T.flat[3]
    5
    >>> type(x.flat)
    <class 'numpy.flatiter'>
    

    numpy.ravel
    related function which returns an ndarray

    matrix.``ravel(self, order=‘C’)[source]

    Return a flattened matrix.

    Refer to numpy.ravel for more documentation.

    • Parameters

      order{‘C’, ‘F’, ‘A’, ‘K’}, optionalThe elements of m are read using this index order. ‘C’ means to index the elements in C-like order, with the last axis index changing fastest, back to the first axis index changing slowest. ‘F’ means to index the elements in Fortran-like index order, with the first index changing fastest, and the last index changing slowest. Note that the ‘C’ and ‘F’ options take no account of the memory layout of the underlying array, and only refer to the order of axis indexing. ‘A’ means to read the elements in Fortran-like index order if m is Fortran contiguous in memory, C-like order otherwise. ‘K’ means to read the elements in the order they occur in memory, except for reversing the data when strides are negative. By default, ‘C’ index order is used.

    • Returns

      ret:matrix

      Return the matrix flattened to shape (1, N) where N is the number of elements in the original matrix. A copy is made only if necessary.

可视化结果

x = np.linspace(dataframe.population.min(), dataframe.population.max(), 100)
f = g[0, 0] + (g[0, 1] * x)

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(x, f, 'r', label='Prediction')
ax.scatter(dataframe.population, dataframe.profit, label='Traning Data')
ax.legend(loc=2)
ax.set_xlabel('Population')
ax.set_ylabel('Profit')
ax.set_title('Predicted Profit vs. Population Size')
plt.show()
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(iters), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
plt.show()

猜你喜欢

转载自blog.csdn.net/qq_45175218/article/details/104915716