python unary linear regression

In the above graph, the 'x' axis represents the pizza diameter and the 'y' axis represents the pizza price. It can be seen that the price of pizza is positively related to its diameter, which is also in line with our daily experience. Naturally, the bigger the price, the more expensive it is. Let's use scikit-learn to build the model

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
font = FontProperties( fname = r"C:\Windows\Fonts\myinghwl.ttf" ) 
 from sklearn.linear_model import LinearRegression #Create  a fitted model#
 def runplt():
    plt.figure()
    plt.title('LINE1', fontproperties=font)
    plt.xlabel('LINE', fontproperties=font)
    plt.ylabel('PRICES', fontproperties=font)
    plt.axis ([ 0 , 25 , 0 , 25 ])
    plt.grid(True)
    return plt
plt = runplt()
X = [[6], [8], [10], [14], [18]]
y= [[ 7 ], [ 9 ], [ 13 ], [ 17.5 ], [ 18 ]]
plt.plot(X,y, '.k' )
plt.show()
model = LinearRegression()
model.fit(X,y)
print('12prices:$%.2f'% model.predict(12))

PS. I want to display the text in the image through fontproperties, the result is gg, I don't know where the problem is.... I can only change the Chinese name to an English name.

Univariate linear regression assumes that there is a linear relationship between the explanatory variable and the response variable, and the space formed by this linear model is a hyperplane. A hyperplane is a linear subspace with co-dimensions in n-dimensional Euclidean space, such as a line in a plane, a plane in a space, and is always one dimension less than the space that contains it. In univariate linear regression, one dimension is the response variable and the other dimension is the explanatory variable, for a total of two dimensions. Therefore, its hyperplane has only one dimension, which is a line

The sklearn.linear_model.LinearRegression class in the above code is an estimator. Estimators predict outcomes based on observations. In scikit-learn, all estimators have fit() and predict() methods.

 fit() is used to analyze the model parameters, and predict() is the model composed of the model parameters calculated by fit(), and the values ​​obtained by predicting the explanatory variables. 

Because all estimators have both methods, it is easy for all scikit-learn to experiment with different models. The fit() method of the LinearRegression class learns the following univariate linear regression model: Y=A+BX

y represents the predicted value of the response variable, in this case the predicted value of pizza price, and is the explanatory variable, in this case the pizza diameter. Intercepts and correlation coefficients are the most concerned things for linear regression models. . With this model, you can calculate the price for different diameters, 8 inches $7.33, 20 inches $18.75

The commonly used method for parameter estimation of a linear regression fitting model is ordinary least squares (ordinary least squares) or linear least squares (linear least squares). First, we define the fitting cost function, and then perform mathematical statistics on the parameters.

The cost function, also known as the loss function, is used to define the error between the model and the observations. The difference between the price predicted by the model and the data in the training set is called residuals or training errors. The model will be used to calculate the test set later. The difference between the price predicted by the model and the data in the test set is called prediction errors or test errors.

The residual of the model is the longitudinal distance between the training sample points and the linear regression model, as shown in the following figure:

X2=[[0],[10],[14],[25]]
model = LinearRegression()
model.fit(X,y)
y2=model.predict(X2)
plt.plot(X,y, '.k' )
plt.plot(X2,y2, '.g' ) #residual value prediction#
 yr= model.predict(X)
 for idx,x in enumerate (X):
    plt.plot([x, x], [y[idx], yr[idx]], 'r-')
plt.show()

We can achieve the best fit by minimizing the sum of the residuals, which means that the value predicted by the model is the closest to the data in the training set. The function that evaluates the fit of the model is called the residual sum of squares cost function. It is to minimize the sum of the squares of the residuals of all training data and the model, as follows:

import numpy as np 
print ( 'square:%.2f' % np .mean((model.predict(X)-y)** 2 ))
 #We can achieve the best fit by minimizing the sum of the residuals, also That is to say, the value predicted by the model is the closest to the data in the training set, which is the best fit. The function that evaluates the fit of the model is called the residual sum of squares cost function.

Least Squares for Solving Univariate Linear Regression

The parameters are obtained by minimizing the cost function, and the correlation coefficient beta is first calculated. According to the frequentist point of view, we first need to calculate the variance of x and the covariance of x and y

Variance is used to measure the degree of sample dispersion. If the samples are all in phase , , , then the variance is 0. The smaller the variance, the more concentrated the sample is, and the more dispersed the sample is anyway. The formula for calculating variance is as follows:

xbar = ( 6 + 8 + 10 + 14 + 18 )/ 5
 variance = (( 6 -xbar)** 2 +(-xbar)** 2 +( 0 -xbar)** 2 +( 14 -xbar)* * 2 +( 18 -xbar)** 2 )/ 4
 print (variance)
 #Get the parameters by minimizing the cost function, first find the correlation coefficient beta. According to the frequentist point of view, the variance of x and the covariance of x and y need to be calculated first.
 #Variance is used to measure the degree of sample dispersion. If the samples are all in phase , , , then the variance is 0. The smaller the variance, the more concentrated the sample is, and the more dispersed the sample is anyway. The variance calculation formula is as follows:
 print (np.var([ 6 , 8 , 10 , 14, 18 ], ddof = 1 )) #Numpy has the var method to directly calculate the variance, the ddof parameter is the Bessel (unbiased estimate) correction coefficient (Bessel's correction), set to 1, you can get an unbiased estimate of the sample variance quantity

Covariance represents the overall trend of the two variables. If the trends of the two variables are consistent, that is, if one of them is greater than its own expected value, and the other is also greater than its own expected value, then the covariance between the two variables is positive. If the trends of the two variables are opposite, that is, one of them is greater than its expected value and the other is less than its own expected value, then the covariance between the two variables is negative. If two variables are uncorrelated, the covariance is 0, and the variables being linearly independent does not mean that there must be no other correlation
. The covariance formula is as follows:

ybar =(7+9+13+17.5+18)/5
cov=((6-xbar)*(7-ybar)+(8-xbar)*(9-ybar)+(10-xbar)*(13-ybar)+(14-xbar)*(17.5-ybar)+(18-xbar)*(18-ybar))/4
print (cov)

现在有了方差和协方差,就可以计算相关系统贝塔了


算出贝塔后,就可以计算阿尔法了:

将前面的数据带入公式就可以求出阿尔法了:

α = 12.9 − 0.9762931034482758 × 11.2 = 1.9655172413793114
这样就通过最小化成本函数求出模型参数了。 把匹萨直径带入方程就可以求出对应的价格了,如 11 英寸直径价格 $12.70,18 英寸直径价格 $19.54

有些度量方法可以用来评估预测效果,我们用 R 方(r-squared)评估匹萨价格预测的效果。 R 方也叫确定系数(coefficient of determination),表示模型对现实数据拟合的程度。 计算 R 方的方法有几种。 一元线性回归中 R 方、、于皮尔逊积矩相关系数(Pearson product moment correlation coefficient 或 Pearson's r)的平方
这种方法计算的 R 方一定介于 0~1 之间的正数。 其他计算方法,包括 scikit-learn 中的方法,不是用皮尔逊积矩相关系数的平方计算的,因此当模型拟合效果很差的时候 R 方会是负值。 下面用 scikitlearn 方法来计算 R 方

=56.8
然后,计算残差平方和,和前面的一样:






R 方是 0.6620 说明测试集里面过半数的价格都可以通过模型解释。 现在,用 scikit-learn 来验证一下。 LinearRegression 的 score 方法可以计算 R 方:

#测试集
X_test = [[8],[9],[11],[16],[12]]
y_test = [[ 11 ], [ 8.5 ], [ 15 ], [ 18 ], [ 11 ]]
model = LinearRegression()
model.fit(X,y)
model.score(X_test,y_test)
print (model.score(X_test,y_test))
Reference address https://www.jianshu.com/p/738f6092ef53

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324517739&siteId=291194637