Univariate linear regression
Question: There are two columns of data in ex1data1.txt. The first column is the population of the city, and the second column is the profit of the city ’s snack bar. The purpose is to predict the profit of the snack bar based on the population of the city.
Formulas needed
achieve
import numpy as np import pandas as pd import matplotlib.pyplot as plt
numpy is used to perform matrix operations, pandas is often used to perform data operations, matplotlib is mainly used to visualize
Read the data set
path= "D:\\ML\\ML\\Linear Regression with One Variable\\ex1data1.txt" data = pd.read_csv(path,header=None,names=['Population','Profit'])#文件读取 data.head()#显示数据 data.plot(kind='scatter',x='Population',y='Profit',figsize=(12,8)) plt.show()
Calculate the loss function J (Ѳ)
def computeCost(X,y,theta): inner = np.power(((X*theta.T)-y),2) return np.sum(inner)/(2*len(X))
Processing the data set
data.insert (0, ' Ones ' , 1) # At 0 Insert a column name Ones # initialize variables X and Y # data.shape output (the number of rows, columns), the Shape [1] it is columns cols = data.shape [. 1] # calculates the number of columns # [start line: end line, starting column: end columns] X-data.iloc = [:,: -. 1] # except the last one before the closed iloc after opening, starting from 0, extracting specified row and column Y = data.iloc [:, cols. 1-: cols] # last one X- = np.matrix (X.values) Y = np.matrix (y.values) Theta = np.matrix (np.array ([0,0])) # array to matrix, depending on the difference between array and matrix
Gradient descent algorithm
def gradientDescent (X, y, theta, alpha, iters): temp = np.matrix (np.zeros (theta.shape)) parameters = int (theta.ravel (). shape [1]) # ravel flat operation changes In one dimension, calculate the number of parameters cost = np.zeros (iters) for i in range (iters): error = (X * theta.T) -y for j in range (parameters): term = np.multiply ( error, X [:, j]) temp [0, j] = theta [0, j]-((alpha / len (X)) * np.sum (term)) theta = temp cost [i] = computeCost(X,y,theta) return theta,cost
Set the learning rate and number of iterations and execute
alpha = 0.01 iters = 1500 g,cost = gradientDescent(X,y,theta,alpha,iters)
Visualization
#可视化 x = np.linspace(data.Population.min(), data.Population.max(), 100) f = g[0, 0] + (g[0, 1] * x) fig, ax = plt.subplots(figsize=(12,8))#拟合曲线 ax.plot(x, f, 'r', label='Prediction') ax.scatter(data.Population, data.Profit, label='Traning Data') ax.legend(loc=2) ax.set_xlabel('Population') ax.set_ylabel('Profit') ax.set_title('Predicted Profit vs. Population Size') plt.show() plt.plot(range(0,1500),cost) plt.show()
pandas:
pd.read_csv (filename): import data from a CSV file
data.head (n) displays the first 5 lines of data by default
data.plot(kind='scatter',x='Population',y='Profit',figsize=(12,8))画图
data.shape () gets the number of rows and columns of data, which is a tuple
data.shape [1] gets the number of data columns
data.iloc [:,:-1] Get part of the data of the data, the meaning is [start line: end line, start column: end column], the data contained is closed before opening
data.insert (0, 'Ones', 1) insert in the 0th column, the value is 1
numpy:
np.power (data, 2) squares the elements in the matrix
np.sum (data) sums the elements in the matrix
np.matrix (X.values) is converted into a matrix
data.ravel () to flatten the data
np.multiply corresponds to position multiplication, * is matrix multiplication
x = np.linspace (data.Population.min (), data.Population.max (), 100) generates an equally spaced sequence of 100 elements. The first two parameters are the beginning and end of the sequence.