机器学习——单变量线性回归详解

文章目录

numpy.matrix.flatten
- 可视化结果

数据导入

CSV(Comma-Separated Values)文件

import pandas as pd

dataframe = pd.read_csv('ex1data1.txt', names=['population', 'profit'])#读取数据并赋予列名

dataframe.head(10)

#查看维数
dataframe.shape()
#查看详细信息
dataframe.info()

作图：我们可以先把这些离散的点在散点图上呈现出来，对数据有一个直观的感受，根据数据点的分布去选者一个合适的模型。

import matplotlib.pyplot as plt
  
plt.title("Matplotlib demo") 
plt.xlabel("popluation") 
plt.ylabel("profit") 
plt.plot(x,y,"ob") 
plt.show()

#使用pandas自带的作图,二维表
dataframe.plot(kind='scatter', x='population', y='profit', figsize=(12,8))
plt.show()

#比较丑，用matplotlib出图
plt.title("Matplotlib demo") 
plt.xlabel("popluation") 
plt.ylabel("profit") 
plt.plot(dataframe['population'],dataframe['profit'],"ob") 
plt.show()

向量化：将数据数据向量化：分成两个维度，第一维所有的行都要，X是前面所有列特征，Y是最后一列标签

#相当于参数b
#使X_0为1时，即相当于添加了偏置项b
dataframe.insert(0, 'Ones', 1)

pandas中insert列用法

DataFrame.insert(self, loc, column, value, allow_duplicates=False) → None[source]
Insert column into DataFrame at specified location.

#允许重复，默认关闭
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

#指定插入位置
Parameters
loc-int
Insertion index. Must verify 0 <= loc <= len(columns).

#插入列的名字
column-str, number, or hashable object
Label of the inserted column.

#插入的值
value-int, Series, or array-like
allow_duplicates-bool, optional

#截取特征X，和标签Y
#pandas中loc，iloc是截取行
#可以直接通过属性索引列

# set X (training data) and y (target variable)
#第二维即列数
cols = dataframe.shape[1]

#第二维
X = dataframe.iloc[:,0:cols-1]#X选择所有行，去掉最后一列
Y = dataframe.iloc[:,cols-1:cols]#Y选择所有行，最后一列

Y.head(5)

Warning:

Note that contrary to usual python slices, both the start and the stop are included

loc works on labels in the index.(标签索引)
iloc works on the positions in the index (so it only takes integers). (位置索引，和列表索引类似，里面只能是数字）

#在进行矩阵运算之前要保证矩阵维度符合运算规则
X.shape, theta.shape, Y.shape

给出线性回归表达式

${h}_{\theta }}\left( x \right)={ {\theta }^{T}}X={ {\theta }_{0}}{ {x}_{0}}+{ {\theta }_{1}}{ {x}_{1}}+{ {\theta }_{2}}{ {x}_{2}}+...+{ {\theta }_{n}}{ {x}_{n}}$

单变量只有 ${\theta }_{0}$ 和 ${\theta }_{1}$

当 ${x }_{0}$ 为1时，即相当于添加了偏置项b

给出代价函数

$J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{ { {\left( { {h}_{\theta }}\left( { {x}^{(i)}} \right)-{ {y}^{(i)}} \right)}^{2}}}$
用向量化的方式代替显式的求和（explicit summation）或者循环

def computeCost(X, Y, theta):
    inner = np.power(((X * theta.T) - Y), 2)
    #向量相减，一定要保证维度相同
    return np.sum(inner) / (2 * len(X))
computeCost(X, Y, theta)

通过梯度下降法是代价函数最小，给出参数更新公式

${\theta }_{j}}:={ {\theta }_{j}}-\alpha \frac{\partial }{\partial { {\theta }_{j}}}J\left( \theta \right)$

初始化：迭代次数，学习率，初始参数
```
alpha=0.01
iters =1000
```
迭代次数通过循环控制
```
def gradientDescent(X,Y,theta,alpha,iters):
    #初始化
    temp = np.matrix(np.zeros(theta.shape))
    #参数的数量
    parameters = int(theta.ravel().shape[1])
    #存储每一次cost,保证最后可以每一次迭代后的代价
    #初始化是为0的
    cost = np.zeros(iters)
    
    for i in range(iters):
        error =(X*theta.T)-Y
        
        for j in range(parameters):
            #同时更新每一个参数
            term = np.multiply(error,X[:,j])
            temp[0,j]=theta[0,j]-((alpha/len(X))*np.sum(term))
            
        theta = temp
        cost[i]= computeCost(X,Y,theta)
    return theta,cost

g, cost = gradientDescent(X, Y, theta, alpha, iters)
g
```
matrix.flatten
returns a similar output matrix but always a copy

numpy.matrix.flatten

method
- matrix.``flatten(self, order=‘C’)[source]
  
  Return a flattened copy of the matrix.All N elements of the matrix are placed into a single row.
  
  Parameters:order:{‘C’, ‘F’, ‘A’, ‘K’}, optional‘C’ means to flatten in row-major (C-style) order. ‘F’ means to flatten in column-major (Fortran-style) order. ‘A’ means to flatten in column-major order if m is Fortran contiguous in memory, row-major order otherwise. ‘K’ means to flatten m in the order the elements occur in memory. The default is ‘C’.
  
  Returns:y:matrixA copy of the matrix, flattened to a (1, N) matrix where N is the number of elements in the original matrix.
```
m = np.matrix([[1,2], [3,4]])
>>> m.flatten()
matrix([[1, 2, 3, 4]])
>>> m.flatten('F')
matrix([[1, 3, 2, 4]])
```
matrix.flat
a flat iterator on the array.
```
x = np.arange(1, 7).reshape(2, 3)
>>> x
array([[1, 2, 3],
       [4, 5, 6]])
>>> x.flat[3]
4
>>> x.T
array([[1, 4],
       [2, 5],
       [3, 6]])
>>> x.T.flat[3]
5
>>> type(x.flat)
<class 'numpy.flatiter'>
```
numpy.ravel
related function which returns an ndarray

matrix.``ravel(self, order=‘C’)[source]

Return a flattened matrix.

Refer to numpy.ravel for more documentation.
- Parameters
  
  order{‘C’, ‘F’, ‘A’, ‘K’}, optionalThe elements of m are read using this index order. ‘C’ means to index the elements in C-like order, with the last axis index changing fastest, back to the first axis index changing slowest. ‘F’ means to index the elements in Fortran-like index order, with the first index changing fastest, and the last index changing slowest. Note that the ‘C’ and ‘F’ options take no account of the memory layout of the underlying array, and only refer to the order of axis indexing. ‘A’ means to read the elements in Fortran-like index order if m is Fortran contiguous in memory, C-like order otherwise. ‘K’ means to read the elements in the order they occur in memory, except for reversing the data when strides are negative. By default, ‘C’ index order is used.
- Returns
  
  ret:matrix
  
  Return the matrix flattened to shape (1, N) where N is the number of elements in the original matrix. A copy is made only if necessary.

可视化结果

x = np.linspace(dataframe.population.min(), dataframe.population.max(), 100)
f = g[0, 0] + (g[0, 1] * x)

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(x, f, 'r', label='Prediction')
ax.scatter(dataframe.population, dataframe.profit, label='Traning Data')
ax.legend(loc=2)
ax.set_xlabel('Population')
ax.set_ylabel('Profit')
ax.set_title('Predicted Profit vs. Population Size')
plt.show()

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(iters), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
plt.show()