Small text | number of small text data of the public tour
We have put on a theoretical part of the linear regression introduced over, then this period is certainly long-awaited piece of combat! Stasmodels packet from the following least square method, skleran least squares method, batch gradient descent algorithm, a stochastic gradient descent and small quantities of stochastic gradient descent, etc. to achieve linear regression. The following first recall several important formula:
Loss function:
Least Square Method for Optimizing Parameters:
gradient descent method for Optimizing Parameters:
The next method to be mentioned several implementations are based on linear regression formula that few come!
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf
Look at the data look like? This is a characteristic of 3 and composed of a label data set. Also found by the statistical distribution of three different eigenvalues for subsequent modeling will bring some impact. First, it will affect the speed of the gradient descent; second weight, will affect the right size, for example, the TV feature value, maximum value is 296, the mean is 147, more than five times radio is full, which will result in tv weight far greater than the radio, the extreme to think of the value of sales depends only on TV. To avoid this situation, prior to building the model need to preprocess the data, the data processing method where the z-score!
data = pd.read_csv('./Desktop/Advertising.csv',sep = ',')
print(data.describe())
TV radio newspaper sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 14.022500
std 85.854236 14.846809 21.778621 5.217457
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 10.375000
50% 149.750000 22.900000 25.750000 12.900000
75% 218.825000 36.525000 45.100000 17.400000
max 296.400000 49.600000 114.000000 27.000000
#将数据集分成训练集与测试集,并对训练集进行预处理
train,test = train_test_split(data,test_size = 0.2,shuffle = True,random_state = 0)
train.iloc[:,:-1] = (train.iloc[:,:-1]-train.iloc[:,:-1].mean())/train.iloc[:,:-1].std()
Further, using the feature value after the modeling process, the corresponding parameters of each characteristic that is representative of the above said the importance weight of each feature value! Thus, we can weight the value of screening useful features, Lasso regression (small weights) is based on this principle feature selection. So if some of the feature itself is more important than other characteristics, we can advance to increase its weight, then the modeling, which is locally weighted linear regression.
Locally weighted linear regression parameters optimal solution: where w is the weight matrix, adding the weights to the respective features.
But many first expand here, least squares linear model is first constructed by statsmodels package.
#statsmodels包、最小二乘法
stats_model = smf.ols('sales~ TV + radio + newspaper',data = train).fit()
print(stats_model.summary())
OLS Regression Results :
==============================================================================
Dep. Variable: sales R-squared: 0.907
Model: OLS Adj. R-squared: 0.905
Method: Least Squares F-statistic: 505.4
Date: Wed, 19 Jun 2019 Prob (F-statistic): 4.23e-80
Time: 22:41:19 Log-Likelihood: -297.29
No. Observations: 160 AIC: 602.6
Df Residuals: 156 BIC: 614.9
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 14.2175 0.124 114.463 0.000 13.972 14.463
TV 3.7877 0.125 30.212 0.000 3.540 4.035
radio 2.8956 0.132 21.994 0.000 2.636 3.156
newspaper -0.0596 0.132 -0.451 0.653 -0.321 0.202
==============================================================================
Omnibus: 13.557 Durbin-Watson: 2.038
Prob(Omnibus): 0.001 Jarque-Bera (JB): 15.174
Skew: -0.754 Prob(JB): 0.000507
Kurtosis: 2.990 Cond. No. 1.42
==============================================================================
We mentioned last one, by determining the quality model F-test, T test determines whether the parameters are significant and to determine the extent R side on the dependent variable explained argument. From the graph we know, stats_model R-square was 0.907, indicating better the fit, the model further by F test, in addition to the parameter newspaper through the T-test, characterized in newspaper thus removed, and then rebuild the model.
stats_model1 = sfa.ols('sales~ TV + radio',data = train).fit()
print(stats_model1.summary())
OLS Regression Results :
==============================================================================
Dep. Variable: sales R-squared: 0.907
Model: OLS Adj. R-squared: 0.905
Method: Least Squares F-statistic: 761.9
Date: Wed, 19 Jun 2019 Prob (F-statistic): 1.50e-81
Time: 22:41:35 Log-Likelihood: -297.40
No. Observations: 160 AIC: 600.8
Df Residuals: 157 BIC: 610.0
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 14.2175 0.124 114.754 0.000 13.973 14.462
TV 3.7820 0.124 30.401 0.000 3.536 4.028
radio 2.8766 0.124 23.123 0.000 2.631 3.122
==============================================================================
Omnibus: 13.633 Durbin-Watson: 2.040
Prob(Omnibus): 0.001 Jarque-Bera (JB): 15.256
Skew: -0.756 Prob(JB): 0.000487
Kurtosis: 3.000 Cond. No. 1.05
==============================================================================
stats_model1 have passed the test, so the expression can be written as: sales = 3.7820tv + 2.8766radio + 14.2175, followed by building a model of linear_model sklearn.
#数据预处理
x = data.iloc[:,:-2]
y = data.iloc[:,-1:]
x = (x-x.mean())/x.std()
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,
shuffle = True,random_state = 0)
#sklearn的最小二乘法
lr = LinearRegression()
lr.fit(x_train,y_train)
result_lr = lr.predict(x_test)
print('r2_score:{}'.format(r2_score(y_test,result_lr))) #R 方
print('coef:{}'.format(lr.coef_))
print('intercept:{}'.format(lr.intercept_))
r2_score:0.8604541663186569
coef:[[3.82192087 2.89820718]]
intercept:[14.03854759]
sklearn constructed model expression for the sales = 3.8279tv + 2.8922radio + 14.0385.
Then the optimal solution by a method of least squares parameter requirements, custom functions for solving a least squares method,
Because it involves matrix operations, thus beginning a first data set into a matrix format.
#手写最小二乘法
def ols_linear_model(x_train,x_test,y_train,y_test):
x_train.insert(0,'b',[1]*len(x_train)) #为了运算的方便,将x0设为1
x_test.insert(0,'b',[1]*len(x_test))
x_train = np.matrix(x_train)
y_train = np.matrix(y_train)
x_test = np.matrix(x_test)
#下面涉及到矩阵的求逆,因此先判断是否可逆
if np.linalg.det(x_train.T*x_train) == 0:
print('奇异矩阵,不可逆')
else:
#最优参数求解
weights = np.linalg.inv(x_train.T*x_train)*x_train.T*y_train
#预测
y_predict = x_test*weights
print('r2_score:{}'.format(r2_score(y_test,y_predict)))
print('coef:{}'.format(weights[1:]))
#因为x0为1,因此第一个参数就是截距
print('intercept:{}'.format(weights[0]))
#结果
ols_linear_model(x_train,x_test,y_train,y_test)
r2_score:0.860454166318657
coef:[[3.82192087]
[2.89820718]]
intercept:[[14.03854759]]
Construction handwriting least squares model expression for the sales = 3.8219tv + 2.8982radio + 14.0385. Handwritten by the least squares method, we can see the shortcomings of the least squares method, when xTx irreversible, that is no solution when x is not full rank matrix, then it can not find an optimum solution by the least squares method. So with optimization algorithms - gradient descent method.
Gradient descent update every time the number of parameters used in the data set can be divided into: a batch gradient descent algorithm, a stochastic gradient descent and small batch gradient descent method.
- Batch gradient descent: each update parameters using all data sets (all)
- Stochastic gradient descent method: each update parameters, using only one data set (one)
- Small batch gradient descent: each update parameters, using a portion of the data set (one <num <all)
#批量梯度下降:
def gradient_desc(x_train, y_train,x_test,alpha, max_itor):
x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train).flatten()
theta = np.zeros(x_train.shape[1])
episilon = 1e-8
iter_count = 0
loss = 10
#当损失函数达到阈值或者达到最大迭代次数停止循环
while loss > episilon and iter_count < max_itor:
loss = 0
iter_count+=1
#梯度(使用训练集所有数据)
gradient = x_train.T.dot(x_train.dot(theta) - y_train)/ len(y_train)
theta = theta - alpha * gradient
#损失函数
loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train))
y_predict = x_test.dot(theta)
print('r2_score:{}'.format(r2_score(y_test,y_predict)))
print('coef:{}'.format(theta[1:]))
print('intercept:{}'.format(theta[0]))
#结果
gradient_desc(x_train, y_train,x_test,alpha=0.001, max_itor=10000)
r2_score:0.8604634817515153
coef:[3.82203058 2.8981221 ]
intercept:14.037935836020237
Construction of the model expression for sales = 3.8220tv + 2.8981radio + 14.0379. Drop by handwriting batch gradient method, we can see its drawback is that each update parameters will need to use all of the training data set, the amount of data once the training set is large, very long computation time. So there is an optimized algorithm - stochastic gradient descent.
#随机梯度下降:
def s_gradient_desc(x_train, y_train,x_test,alpha, max_itor):
x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train).flatten()
theta = np.zeros(x_train.shape[1])
episilon = 1e-8
iter_count = 0
loss = 10
#当损失函数达到阈值或者达到最大迭代次数停止循环:
while loss > episilon and iter_count < max_itor:
loss = 0
iter_count+=1
rand_i = np.random.randint(len(x_train))
#梯度(使用训练集某一数据):
gradient = x_train[rand_i].T.dot(x_train[rand_i].dot(theta) - y_train[rand_i])
theta = theta - alpha * gradient
#损失函数:
loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train))
y_predict = x_test.dot(theta)
print('r2_score:{}'.format(r2_score(y_test,y_predict)))
print('coef:{}'.format(theta[1:]))
print('intercept:{}'.format(theta[0]))
print('iter_count:{}'.format(iter_count))
#结果
s_gradient_desc(x_train, y_train,x_test,alpha=0.001, max_itor=10000)
r2_score:0.8607601654222723
coef:[3.83573278 2.90238477]
intercept:14.036801544903055
Construction of the model expression for sales = 3.8357tv + 2.9023radio + 14.0368. By constructing a stochastic gradient descent method, we know it has the disadvantage that it only needs to update parameters each time a training set of data, there may be obtained local optimal solution. So comprehensive batch gradient descent and stochastic gradient descent advantages law, give small quantities of gradient descent.
#小批量梯度下降:
def sb_gradient_desc(x_train, y_train,x_test,alpha,num,max_itor):
x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train).flatten()
theta = np.zeros(x_train.shape[1])
episilon = 1e-8
iter_count = 0
loss = 10
#当损失函数达到阈值或者达到最大迭代次数停止循环:
while loss > episilon and iter_count < max_itor:
loss = 0
iter_count+=1
rand_i = np.random.randint(0,len(x_train),num)
#梯度(使用训练集某一部份数据):
gradient = x_train[rand_i].T.dot(x_train[rand_i].dot(theta) - y_train[rand_i])/num
theta = theta - alpha * gradient
#损失函数:
loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train))
y_predict = x_test.dot(theta)
print('r2_score:{}'.format(r2_score(y_test,y_predict)))
print('coef:{}'.format(theta[1:]))
print('intercept:{}'.format(theta[0]))
print('iter_count:{}'.format(iter_count))
#结果
sb_gradient_desc(x_train, y_train,x_test,alpha=0.001,num=20,max_itor=10000)
r2_score:0.860623250516056
coef:[3.82871666 2.89894667]
intercept:14.042705519319549
Construction of the model expression for sales = 3.8287tv + 2.8989radio + 14.0427.
in conclusion:
- State model: sales = 3.7820tv + 2.8766radio + 14.2175
- sklearn:sales=3.8279tv+2.8922radio+14.0385
- Handwriting least squares: sales = 3.8219tv + 2.8982radio + 14.0385
- Batch gradient descent: sales = 3.8220tv + 2.8981radio + 14.0379
- Stochastic gradient descent: sales = 3.8357tv + 2.9023radio + 14.0368
- Small batch gradient descent: sales = 3.8287tv + 2.8989radio + 14.0427
— end —
Small data text brigade
The upper right corner stamp "+ concern" for the latest share
If you like it, please share or thumbs