What is multiple linear regression
In regression analysis, if there are two or more independent variables, it is called multiple regression. ** In fact, a phenomenon often linked to a number of factors, a more independent variables to predict the optimal combination of common or estimate the dependent variable, independent variable to predict or estimate is more effective than just using, more in line with actual. ** Multiple linear regression therefore greater than the practical significance of linear regression.
y = B0 + v1x1 v2x2 + + ... + + e # vpxp 公式
An example of speaking today
Here is a excel file data, we examine the factors which affect sales in the end is the most obvious, is TV, or radio, or newspaper, which is looking for sales in the end is the house caused by the elements, how to increase sales?
Import relatively library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot') #使用ggplot样式
from sklearn.linear_model import LinearRegression # 导入线性回归
from sklearn.model_selection import train_test_split # 训练数据
from sklearn.metrics import mean_squared_error #用来计算距离平方误差,评价模型
open a file
data = pd.read_csv('Advertising.csv')
data.head() #看下data
First Paint analyze
plt.scatter(data.TV, data.sales)
plt.scatter(data.radio, data.sales)
plt.scatter(data.newspaper, data.sales)
Analysis seen from the figure the point spread too wide newspaper, predicted no relationship, should be removed
Access codes link
x = data[['TV','radio','newspaper']]
y = data.sales
x_train,x_test,y_train,y_test = train_test_split(x, y) #得到训练和测试训练集
model = LinearRegression() #导入线性回归
model.fit(x_train, y_train) #
model.coef_ # 斜率 有三个
model.intercept_ # 截距
get
array([ 0.04480311, 0.19277245, -0.00301245])
3.0258997429585506
for i in zip(x_train.columns, model.coef_):
print(i) #打印对应的参数
('TV', 0.04480311217789182)
('radio', 0.19277245418149513)
('newspaper', -0.003012450368706149)
mean_squared_error(model.predict(x_test), y_test) # 模型的好坏用距离的平方和计算
4.330748450267551
y = .04480311217789182 * x1 + x2 -0.003012450368706149 .19277245418149513 * * x3 + 3.0258997429585506
We can see the newspaper coefficient of less than 0, indicating that the put, but then how to improve the impact of sales model, is to remove the value of newspaper
x = data[['TV','radio']]
y = data.sales
x_train,x_test,y_train,y_test = train_test_split(x, y)
model2 = LinearRegression()
model2.fit(x_train,y_train)
model2.coef_
model2.intercept_
mean_squared_error(model2.predict(x_test),y_test)
array([0.04666856, 0.17769367])
3.1183329992288478
2.984535789030915 # 比第一个model的小,说明更好
y = 0.04666856 * x1 * x2 + 3.1183329992288478 +0.17769367