Multiple linear regression analysis of the problem

What is multiple linear regression

In regression analysis, if there are two or more independent variables, it is called multiple regression. ** In fact, a phenomenon often linked to a number of factors, a more independent variables to predict the optimal combination of common or estimate the dependent variable, independent variable to predict or estimate is more effective than just using, more in line with actual. ** Multiple linear regression therefore greater than the practical significance of linear regression.

y = B0 + v1x1 v2x2 + + ... + + e # vpxp 公式

An example of speaking today

Here is a excel file data, we examine the factors which affect sales in the end is the most obvious, is TV, or radio, or newspaper, which is looking for sales in the end is the house caused by the elements, how to increase sales?


Import relatively library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')   #使用ggplot样式
from sklearn.linear_model import LinearRegression  # 导入线性回归
from sklearn.model_selection import train_test_split  # 训练数据
from sklearn.metrics import mean_squared_error  #用来计算距离平方误差,评价模型

open a file

data = pd.read_csv('Advertising.csv')
data.head()  #看下data


First Paint analyze

plt.scatter(data.TV, data.sales)

plt.scatter(data.radio, data.sales)

plt.scatter(data.newspaper, data.sales)


Analysis seen from the figure the point spread too wide newspaper, predicted no relationship, should be removed

Access codes link

x = data[['TV','radio','newspaper']]
y = data.sales
x_train,x_test,y_train,y_test = train_test_split(x, y)  #得到训练和测试训练集
model = LinearRegression()  #导入线性回归
model.fit(x_train, y_train)  # 
model.coef_    # 斜率 有三个
model.intercept_  # 截距

get

array([ 0.04480311,  0.19277245, -0.00301245])
3.0258997429585506
for i in zip(x_train.columns, model.coef_):
    print(i)    #打印对应的参数
('TV', 0.04480311217789182)
('radio', 0.19277245418149513)
('newspaper', -0.003012450368706149)
mean_squared_error(model.predict(x_test), y_test)  # 模型的好坏用距离的平方和计算
4.330748450267551

y = .04480311217789182 * x1 + x2 -0.003012450368706149 .19277245418149513 * * x3 + 3.0258997429585506

We can see the newspaper coefficient of less than 0, indicating that the put, but then how to improve the impact of sales model, is to remove the value of newspaper

x = data[['TV','radio']]
y = data.sales
x_train,x_test,y_train,y_test = train_test_split(x, y)
model2 = LinearRegression()
model2.fit(x_train,y_train)
model2.coef_
model2.intercept_
mean_squared_error(model2.predict(x_test),y_test)
array([0.04666856, 0.17769367])
3.1183329992288478
2.984535789030915  # 比第一个model的小,说明更好

y = 0.04666856 * x1 * x2 + 3.1183329992288478 +0.17769367

Published 824 original articles · won praise 269 · views 290 000 +

Guess you like

Origin blog.csdn.net/weixin_44510615/article/details/105286185