Exercise 1: Linear Regression

Exercise 1: Linear Regression


introduce

In this exercise, you will implement linear regression and see how it works on your data.

Before starting the exercise, you need to download the following files for data upload :

  • ex1data1.txt - Univariate linear regression dataset
  • ex1data2.txt - Multivariate linear regression dataset

Throughout the exercise, the following mandatory assignments and marked *optional assignments are involved :

The required homework is to implement univariate linear regression; the optional homework is to implement multivariate linear regression.

1 Implement a simple example function

In this part of the exercise, you will implement the code to return a 5*5diagonal matrix of . The output is the same as:

1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1

1.1 Submit a solution

Perform the above implementation in the following code box. After completing some exercises, if you get the same result as above, it is a pass.

###在这里填入代码###
import numpy as np
print(np.eye(5))

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]

2 Univariate linear regression

In this part of the exercise, univariate linear regression will be implemented and used to predict the profit of a food truck.

Suppose you are the head of a restaurant and are considering opening new locations in different cities. The chain already has food trucks in different cities, and you get population and profit data for each city.

Now you need to use this data to help you choose the next city to expand.

The file ex1data1.txtcontains the dataset for the linear regression problem. The first column of data corresponds to the population of the city, and the second column of data corresponds to the profit of the food trucks in that city. A negative profit indicates a loss.

2.1 Plotting data

It is often useful to visualize the data before jumping into the exercises. For this dataset, a scatterplot can be used for visualization since it has only two attributes (population, profit).

# 引入所需要的库文件
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

%matplotlib inline
# 数据存储路径
path = '/home/jovyan/work/ex1data1.txt'

# 读入相应的数据文件
data = pd.read_csv(path, header=None,names=['Population','Profit'])

#查看数据的前五条
data.head(5)
Population Profit
0 6.1101 17.5920
1 5.5277 9.1302
2 8.5186 13.6620
3 7.0032 11.8540
4 5.8598 6.8233

Next, the code for data visualization is needed, and the image drawn by this part of the data should be the same as the following.

Main points:

  • Realize scatterplot visualization
  • The data is distributed as red dots
  • SD horizontal and vertical coordinate name

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-WgF3Rlg2-1686486633135)(1-1.png)]

###在这里填入代码###
data.plot(kind='scatter', x='Population',y='Profit',c='red',figsize=(12,8))
plt.show()

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-mMBxkdSP-1686486633136)(output_7_0.png)]

2.2 Gradient Descent

In this section, gradient descent will be used to select the appropriate linear regression parameters θ to fit a given dataset.

2.2.1 Update formula

The purpose of linear regression is to minimize a cost function:
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-PSDB63H1-1686486633137)(1-2.png)]

Setting h θ ( X ) h_{\theta}(X)hi( X ) is given by the following linear model:
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-t8k1nZ6n-1686486633137)(1-3.png)]

To review, the parameters of the model are θ j \theta_jijvalues ​​of , these will be adjusted to minimize the cost J ( θ ) J(\theta)J(θ)

One of these methods is to use the batch gradient descent algorithm. In the batch gradient descent, the update is performed each iteration. As each step of the gradient descent is calculated, the parameter θ j \theta_jijGetting closer and closer to making the cost J ( θ ) J(\theta)J ( θ ) reaches the lowest optimal value.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-qsprtsVn-1686486633137)(1-4.png)]
(Simultaneously update all θ j \theta_jij

2.2.2 Implementation

In the previous part of the exercise, we have loaded the required data into variables dataand named their columns separately.

Next, we added a dimension to the data to fit the intercept term θ 0 \theta_0i0. And set the initial parameter value to 0, the learning rate α \alphaα is set to 0.01.

#在列索引为0处添加数据列,该列值均为1
data.insert(0, 'Ones', 1)

#获取数据列数
cols = data.shape[1]

#对变量X和y进行初始化,并将其数据类型转换为矩阵
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
X = np.matrix(X.values)
y = np.matrix(y.values)

#学习率、迭代次数的初始化
alpha = 0.01
iterations = 1500
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-10-ba7d5fe01814> in <module>
      1 #在列索引为0处添加数据列,该列值均为1
----> 2 data.insert(0, 'Ones', 1)
      3 
      4 #获取数据列数
      5 cols = data.shape[1]


/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   3220         value = self._sanitize_column(column, value, broadcast=False)
   3221         self._data.insert(loc, column, value,
-> 3222                           allow_duplicates=allow_duplicates)
   3223 
   3224     def assign(self, **kwargs):


/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in insert(self, loc, item, value, allow_duplicates)
   4336         if not allow_duplicates and item in self.items:
   4337             # Should this be a different kind of error??
-> 4338             raise ValueError('cannot insert {}, already exists'.format(item))
   4339 
   4340         if not isinstance(loc, int):


ValueError: cannot insert Ones, already exists

2.2.3 Calculation cost J(θ)

While performing gradient descent to minimize the cost function J ( θ ) J(\theta)When J ( θ ) , it is helpful to monitor the state of convergence by computing the cost.

In this part of the exercise task, you need to implement a computational cost J ( θ ) J(\theta)Function of J ( θ )computeCost to check the convergence of the gradient descent implementation.

where Xand yare not scalar values ​​but matrices whose rows represent examples in the training set.

Important:
After completing the function, set θ \thetaThe θ value is initialized to 0 and the cost is calculated, and the obtained cost value is printed out.

If the result is 32.07, the calculation passes.

###在这里填入代码###
# 代价函数
def computeCost(X,y,w):
    inner = np.power(((X * w) - y),2)
    return np.sum(inner) / (2 * len(X))

theta = np.matrix(np.zeros((2,1)))

computeCost(X,y,theta)

32.072733877455676

2.2.4 Gradient Descent

Next, we will implement the gradient descent, the given code has implemented the loop structure, you only need to provide θ \theta in each iterationupdate of θ .

When implementing code, make sure you understand what is being optimized, and what is being updated.

Remember, the cost J ( θ ) J(\theta)J ( θ ) is a parameter-be vectorθ \thetaθ terminates instead ofXXXyyy . That is, we willJ ( θ ) J(\theta)The value of J ( θ ) is minimized by changing the vectorθ \thetavalue of θ , rather than by changingXXXyyy

A good way to verify that gradient descent is working is to look at J ( θ ) J(\theta)J ( θ ) and check if the value decreases with each step. On each iteration, the code callscomputeCostthe function and prints the cost. Assuming you implement gradient descent, calculate the cost correctly,J ( θ ) J(\theta)The J ( θ ) value should never increase and should converge to a stable value by the end of the algorithm.

Main points:

After the gradient descent is implemented, the final parameter values ​​need to be used to visualize the fitting results of the linear regression, and the drawing results need to be similar to those shown in the figure below.
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-7OI0DoqW-1686486633138)(1-5.png)]

###在这里填入代码###
def gradientDescent(X, y, theta, alpha, iters):
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1]) 
    cost = np.zeros(iters)
    
    for i in range(iters):
        error = (X * theta) - y
        
        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[j,0] = theta[j,0] - ((alpha / len(X)) * np.sum(term))
            
        theta = temp
        cost[i] = computeCost(X, y, theta)
        
    return theta, cost


# 开始训练,最终输出训练完成的模型参数
t_final, cost_final = gradientDescent(X, y, theta, alpha, iterations)

# 计算最终的参数所得到的成本值
computeCost(X, y, t_final)
4.483388256587726
###在这里填入代码###
#对拟合曲线进行绘制
x = np.linspace(data.Population.min(), data.Population.max(), 100)
f = t_final[0,0] + (t_final[1,0] * x)
 
fig, ax = plt.subplots(figsize=(9,6))
ax.plot(x, f, 'b', label='Prediction')
ax.scatter(data.Population, data.Profit, c='red',label='Traning Data')
ax.legend(loc=2)
ax.set_xlabel('Population')
ax.set_ylabel('Profit')
ax.set_title('Predicted Profit vs. Population Size')

Text(0.5, 1.0, 'Predicted Profit vs. Population Size')

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Cu0vC3tn-1686486633138)(output_15_1.png)]

2.3 Visualizing the cost function

In order to better understand the iterative calculation of the cost function, the cost value calculated at each step is recorded and plotted.

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(iterations), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
Text(0.5, 1.0, 'Error vs. Training Epoch')

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-pDHEK8ff-1686486633139)(output_17_1.png)]

optional exercise


3 Multivariate linear regression

In this part, multiple variables will be used to implement linear regression to predict house prices. Let's say you're currently selling a home and want to know what a good market price is.

One approach is to first gather information on recently sold homes and second to model home prices.

The file ex1data2.txtcontains home prices and related information for Portland, Oregon. The first column is the size of the house in square feet, the second column is the number of bedrooms, and the third column is the price of the house.

3.1 Feature Standardization

The following code will ex1data2.txtload and display this dataset from a file file.

Looking at the data, it can be seen that the size of the house is about 1000 times the number of bedrooms. When the difference between different eigenvalues ​​is several orders of magnitude, scaling the features can make the gradient descent converge faster .

path = '/home/jovyan/work/ex1data2.txt'
data2 = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])
data2.head()
Size Bedrooms Price
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900


In this part of the exercise, your task is to write the code and standardize the data in the dataset .

Main points :

  • Subtract the mean of each feature from the dataset.
  • After subtracting the mean, divide the new eigenvalues ​​by their respective "standard deviations"

Standard deviation is a measure of how much the range of values ​​for a particular characteristic varies (most data points will be within two standard deviations of the mean); this is an alternative to range of values.

When normalizing features, you need to store the values ​​used for normalization - mean and standard deviation. After learning the parameters from the model, it is often necessary to predict the price of new homes. Given a new xx at this pointThe x- values ​​(house size and number of bedrooms) must first be normalized to the new data using the mean and standard deviation previously calculated from the training set.

###在这里填入代码###
data2 = (data2 - data2.mean()) / data2.std()
data2.head()

Size Bedrooms Price
0 0.130010 -0.223675 0.475747
1 -0.504190 -0.223675 -0.084074
2 0.502476 -0.223675 0.228626
3 -0.735723 -1.537767 -0.867025
4 1.257476 1.090417 1.595389

3.2 Gradient Descent

In the previous exercise, we implemented the problem of gradient descent using univariate linear regression. In this part of the link, the only difference is that at this point our data becomes the matrix XXX

Assuming the function and the update rule of batch gradient descent remain constant, your task is to code the cost function and gradient descent for multivariate linear regression .

Main points :

  • Make sure your code can support data of any size and that the data is vectorized.
  • After the code implements the cost function and gradient descent, the final cost value should be approximately 0.13.
  • Please follow the requirements in the univariate linear regression exercise to draw the change curve of the cost.
###在这里填入代码###
data2.insert(0, 'Ones', 1)
cols = data2.shape[1]
X2 = data2.iloc[:,0:cols-1]
y2 = data2.iloc[:,cols-1:cols]

X2 = np.matrix(X2.values)
y2 = np.matrix(y2.values)
theta = np.matrix(np.array([0,0,0]))

w2_final, cost2_final = gradientDescent(X2, y2, theta.T, alpha, iterations)

print('The weight vector:\n',w2_final)
computeCost(X2, y2, w2_final)
 
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(iterations), cost2_final, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Iterations')

The weight vector:
 [[-1.00309831e-16]
 [ 8.84042349e-01]
 [-5.24551809e-02]]





Text(0.5, 1.0, 'Error vs. Iterations')

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-oSOIohoT-1686486633139)(output_23_2.png)]

Guess you like

Origin blog.csdn.net/qq_52187415/article/details/131263557