Easy Start machine learning - linear regression combat

Small text | number of small text data of the public tour

We have put on a theoretical part of the linear regression introduced over, then this period is certainly long-awaited piece of combat! Stasmodels packet from the following least square method, skleran least squares method, batch gradient descent algorithm, a stochastic gradient descent and small quantities of stochastic gradient descent, etc. to achieve linear regression. The following first recall several important formula:

Loss function: [official]
Least Square Method for Optimizing Parameters: [official]
gradient descent method for Optimizing Parameters:[official]

The next method to be mentioned several implementations are based on linear regression formula that few come!

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf

Look at the data look like? This is a characteristic of 3 and composed of a label data set. Also found by the statistical distribution of three different eigenvalues ​​for subsequent modeling will bring some impact. First, it will affect the speed of the gradient descent; second weight, will affect the right size, for example, the TV feature value, maximum value is 296, the mean is 147, more than five times radio is full, which will result in tv weight far greater than the radio, the extreme to think of the value of sales depends only on TV. To avoid this situation, prior to building the model need to preprocess the data, the data processing method where the z-score!

data = pd.read_csv('./Desktop/Advertising.csv',sep = ',')
print(data.describe())
            TV       radio   newspaper       sales
count  200.000000  200.000000  200.000000  200.000000
mean   147.042500   23.264000   30.554000   14.022500
std     85.854236   14.846809   21.778621    5.217457
min      0.700000    0.000000    0.300000    1.600000
25%     74.375000    9.975000   12.750000   10.375000
50%    149.750000   22.900000   25.750000   12.900000
75%    218.825000   36.525000   45.100000   17.400000
max    296.400000   49.600000  114.000000   27.000000

#将数据集分成训练集与测试集,并对训练集进行预处理
train,test = train_test_split(data,test_size = 0.2,shuffle = True,random_state = 0)
train.iloc[:,:-1] = (train.iloc[:,:-1]-train.iloc[:,:-1].mean())/train.iloc[:,:-1].std()

Further, using the feature value after the modeling process, the corresponding parameters of each characteristic that is representative of the above said the importance weight of each feature value! Thus, we can weight the value of screening useful features, Lasso regression (small weights) is based on this principle feature selection. So if some of the feature itself is more important than other characteristics, we can advance to increase its weight, then the modeling, which is locally weighted linear regression.

Locally weighted linear regression parameters optimal solution: [official]where w is the weight matrix, adding the weights to the respective features.

But many first expand here, least squares linear model is first constructed by statsmodels package.

#statsmodels包、最小二乘法
stats_model = smf.ols('sales~ TV + radio + newspaper',data = train).fit()
print(stats_model.summary())
OLS Regression Results :                           
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.907
Model:                            OLS   Adj. R-squared:                  0.905
Method:                 Least Squares   F-statistic:                     505.4
Date:                Wed, 19 Jun 2019   Prob (F-statistic):           4.23e-80
Time:                        22:41:19   Log-Likelihood:                -297.29
No. Observations:                 160   AIC:                             602.6
Df Residuals:                     156   BIC:                             614.9
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     14.2175      0.124    114.463      0.000      13.972      14.463
TV             3.7877      0.125     30.212      0.000       3.540       4.035
radio          2.8956      0.132     21.994      0.000       2.636       3.156
newspaper     -0.0596      0.132     -0.451      0.653      -0.321       0.202
==============================================================================
Omnibus:                       13.557   Durbin-Watson:                   2.038
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               15.174
Skew:                          -0.754   Prob(JB):                     0.000507
Kurtosis:                       2.990   Cond. No.                         1.42
==============================================================================

We mentioned last one, by determining the quality model F-test, T test determines whether the parameters are significant and to determine the extent R side on the dependent variable explained argument. From the graph we know, stats_model R-square was 0.907, indicating better the fit, the model further by F test, in addition to the parameter newspaper through the T-test, characterized in newspaper thus removed, and then rebuild the model.

stats_model1 = sfa.ols('sales~ TV + radio',data = train).fit()
print(stats_model1.summary())
OLS Regression Results :                           
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.907
Model:                            OLS   Adj. R-squared:                  0.905
Method:                 Least Squares   F-statistic:                     761.9
Date:                Wed, 19 Jun 2019   Prob (F-statistic):           1.50e-81
Time:                        22:41:35   Log-Likelihood:                -297.40
No. Observations:                 160   AIC:                             600.8
Df Residuals:                     157   BIC:                             610.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     14.2175      0.124    114.754      0.000      13.973      14.462
TV             3.7820      0.124     30.401      0.000       3.536       4.028
radio          2.8766      0.124     23.123      0.000       2.631       3.122
==============================================================================
Omnibus:                       13.633   Durbin-Watson:                   2.040
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               15.256
Skew:                          -0.756   Prob(JB):                     0.000487
Kurtosis:                       3.000   Cond. No.                         1.05
==============================================================================

stats_model1 have passed the test, so the expression can be written as: sales = 3.7820tv + 2.8766radio + 14.2175, followed by building a model of linear_model sklearn.

#数据预处理
x = data.iloc[:,:-2]
y = data.iloc[:,-1:]
x = (x-x.mean())/x.std()
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,
                                      shuffle = True,random_state = 0)

#sklearn的最小二乘法
lr = LinearRegression()
lr.fit(x_train,y_train)
result_lr = lr.predict(x_test)
print('r2_score:{}'.format(r2_score(y_test,result_lr)))  #R 方
print('coef:{}'.format(lr.coef_))
print('intercept:{}'.format(lr.intercept_))
r2_score:0.8604541663186569
coef:[[3.82192087 2.89820718]]
intercept:[14.03854759]

sklearn constructed model expression for the sales = 3.8279tv + 2.8922radio + 14.0385.

Then the optimal solution by a method of least squares parameter requirements, [official]custom functions for solving a least squares method,

Because it involves matrix operations, thus beginning a first data set into a matrix format.

#手写最小二乘法
def ols_linear_model(x_train,x_test,y_train,y_test):
    x_train.insert(0,'b',[1]*len(x_train))      #为了运算的方便,将x0设为1    
    x_test.insert(0,'b',[1]*len(x_test))
    x_train = np.matrix(x_train)
    y_train = np.matrix(y_train)
    x_test = np.matrix(x_test)

#下面涉及到矩阵的求逆,因此先判断是否可逆
    if np.linalg.det(x_train.T*x_train) == 0:       
        print('奇异矩阵,不可逆')
    else:
#最优参数求解
        weights = np.linalg.inv(x_train.T*x_train)*x_train.T*y_train 
 #预测
        y_predict = x_test*weights                    
        print('r2_score:{}'.format(r2_score(y_test,y_predict)))
        print('coef:{}'.format(weights[1:]))
#因为x0为1,因此第一个参数就是截距
        print('intercept:{}'.format(weights[0]))      

#结果
ols_linear_model(x_train,x_test,y_train,y_test)
r2_score:0.860454166318657
coef:[[3.82192087]
 [2.89820718]]
intercept:[[14.03854759]]

Construction handwriting least squares model expression for the sales = 3.8219tv + 2.8982radio + 14.0385. Handwritten by the least squares method, we can see the shortcomings of the least squares method, when xTx irreversible, that is no solution when x is not full rank matrix, then it can not find an optimum solution by the least squares method. So with optimization algorithms - gradient descent method.

Gradient descent update every time the number of parameters used in the data set can be divided into: a batch gradient descent algorithm, a stochastic gradient descent and small batch gradient descent method.

  • Batch gradient descent: each update parameters using all data sets (all)
  • Stochastic gradient descent method: each update parameters, using only one data set (one)
  • Small batch gradient descent: each update parameters, using a portion of the data set (one <num <all)
#批量梯度下降:
def gradient_desc(x_train, y_train,x_test,alpha, max_itor):
    x_train = np.array(x_train)
    x_test = np.array(x_test)
    y_train = np.array(y_train).flatten()
    theta = np.zeros(x_train.shape[1])  
    episilon = 1e-8
    iter_count = 0
    loss = 10

#当损失函数达到阈值或者达到最大迭代次数停止循环
    while loss > episilon and iter_count < max_itor:   
        loss = 0
        iter_count+=1
#梯度(使用训练集所有数据)
        gradient = x_train.T.dot(x_train.dot(theta) - y_train)/ len(y_train)  
        theta = theta - alpha * gradient 
#损失函数
        loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train)) 

    y_predict = x_test.dot(theta)
    print('r2_score:{}'.format(r2_score(y_test,y_predict)))
    print('coef:{}'.format(theta[1:]))
    print('intercept:{}'.format(theta[0]))

#结果
gradient_desc(x_train, y_train,x_test,alpha=0.001, max_itor=10000)
r2_score:0.8604634817515153
coef:[3.82203058 2.8981221 ]
intercept:14.037935836020237

Construction of the model expression for sales = 3.8220tv + 2.8981radio + 14.0379. Drop by handwriting batch gradient method, we can see its drawback is that each update parameters will need to use all of the training data set, the amount of data once the training set is large, very long computation time. So there is an optimized algorithm - stochastic gradient descent.

#随机梯度下降:
def s_gradient_desc(x_train, y_train,x_test,alpha, max_itor):
    x_train = np.array(x_train)
    x_test = np.array(x_test)
    y_train = np.array(y_train).flatten()
    theta = np.zeros(x_train.shape[1])  
    episilon = 1e-8
    iter_count = 0
    loss = 10

#当损失函数达到阈值或者达到最大迭代次数停止循环:
    while loss > episilon and iter_count < max_itor:   
        loss = 0
        iter_count+=1
        rand_i = np.random.randint(len(x_train))
#梯度(使用训练集某一数据):
        gradient = x_train[rand_i].T.dot(x_train[rand_i].dot(theta) - y_train[rand_i]) 
        theta = theta - alpha * gradient
#损失函数:
        loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train))   

    y_predict = x_test.dot(theta)
    print('r2_score:{}'.format(r2_score(y_test,y_predict)))
    print('coef:{}'.format(theta[1:]))
    print('intercept:{}'.format(theta[0]))
    print('iter_count:{}'.format(iter_count))

#结果
s_gradient_desc(x_train, y_train,x_test,alpha=0.001, max_itor=10000)
r2_score:0.8607601654222723
coef:[3.83573278 2.90238477]
intercept:14.036801544903055

Construction of the model expression for sales = 3.8357tv + 2.9023radio + 14.0368. By constructing a stochastic gradient descent method, we know it has the disadvantage that it only needs to update parameters each time a training set of data, there may be obtained local optimal solution. So comprehensive batch gradient descent and stochastic gradient descent advantages law, give small quantities of gradient descent.

#小批量梯度下降:
def sb_gradient_desc(x_train, y_train,x_test,alpha,num,max_itor):
    x_train = np.array(x_train)
    x_test = np.array(x_test)
    y_train = np.array(y_train).flatten()
    theta = np.zeros(x_train.shape[1])  
    episilon = 1e-8
    iter_count = 0
    loss = 10

#当损失函数达到阈值或者达到最大迭代次数停止循环:
    while loss > episilon and iter_count < max_itor:  
        loss = 0
        iter_count+=1
        rand_i = np.random.randint(0,len(x_train),num)
#梯度(使用训练集某一部份数据):
        gradient = x_train[rand_i].T.dot(x_train[rand_i].dot(theta) - y_train[rand_i])/num 
        theta = theta - alpha * gradient
#损失函数:
        loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train))   
   
    y_predict = x_test.dot(theta)
    print('r2_score:{}'.format(r2_score(y_test,y_predict)))
    print('coef:{}'.format(theta[1:]))
    print('intercept:{}'.format(theta[0]))
    print('iter_count:{}'.format(iter_count))

#结果
sb_gradient_desc(x_train, y_train,x_test,alpha=0.001,num=20,max_itor=10000)
r2_score:0.860623250516056
coef:[3.82871666 2.89894667]
intercept:14.042705519319549

Construction of the model expression for sales = 3.8287tv + 2.8989radio + 14.0427.

in conclusion:

  • State model: sales = 3.7820tv + 2.8766radio + 14.2175
  • sklearn:sales=3.8279tv+2.8922radio+14.0385
  • Handwriting least squares: sales = 3.8219tv + 2.8982radio + 14.0385
  • Batch gradient descent: sales = 3.8220tv + 2.8981radio + 14.0379
  • Stochastic gradient descent: sales = 3.8357tv + 2.9023radio + 14.0368
  • Small batch gradient descent: sales = 3.8287tv + 2.8989radio + 14.0427

end

Small data text brigade

The upper right corner stamp "+ concern" for the latest share

If you like it, please share or thumbs

 

Published 33 original articles · won praise 30 · views 30000 +

Guess you like

Origin blog.csdn.net/d345389812/article/details/93206773