100 days of machine learning (Day3)-Multiple_Linear_regression--Multiple linear regression

1 analysis

1 Multiple linear regression

  • Use a model (a linear equation fitted to observed values) to describe the relationship between an outcome and two or more characteristics
  • The implementation process is similar to single eigenvalue
  • Through analysis, find out the features that have the greatest impact on the prediction results, as well as the correlation between different variables. y=b0+b1x1+b2x2+...+bnxn

2 predictions

  • A successful regression analysis must prove that the following predictions are correct
  1. Linearity: There is a linear relationship between variables and independent variables.
  2. Homoskedasticity: The random error term (disturbance term) in the overall regression function has constant variance conditional on the explanatory variables.
  3. Multivariate normality: multiple factors affect the results together
  4. No multicollinearity: There is weak or almost no correlation between the independent variables. -----For example, dummy variable trap

3 Filter variables

  • Too many variables lead to inaccurate models
  • Variables have no effect on the outcome but have a great effect on other independent variables

1.forward selection: add features one by one

1) Choose a difference level (significance level) such as SL=0.05, which means 95% contribution to the result (2) Build all simple regression models and find the smallest P value (3) Establish a simple model and have The variable with the smallest P value is added to this model (4) If P>SL, the model is successfully established, otherwise proceed to the third step.

2. Backward Elimination: First, all features are included, and then each feature is tried to be deleted. It is tested which deleted feature has the greatest improvement in model accuracy, and finally the feature that has the highest improvement in model accuracy is deleted. And so on, until deleting the feature cannot improve the model.

3. Stepwise is a method that combines the above two methods. When a feature is added, stepwise will try to delete a feature until a certain preset standard is reached. The disadvantage of this method is that the preset standard is difficult to determine, and it is easy to fall into overfitting----- https://onlinecourses.science.psu.edu/stat501/node/329/

4. Bi-direction Comparison

4 dummy variables

  • Add non-numeric data to the model using categorical data

5 Dummy variable trap

  • If each qualitative factor in the model has m mutually exclusive types and the model has an intercept term, only m-1 dummy variables can be introduced into the model, otherwise complete multicollinearity will occur.

Take gender as an example. In fact, a dummy variable is enough. For example, when it is 1, it means "male", and when it is 0, it means "not male", that is, female. If two dummy variables "male" and "female" are set, there is no semantic problem and it is understandable, but there will be one more variable in the regression prediction, and this extra variable will have an impact on the regression prediction results. Generally speaking, if the dummy variable is one less type than the actual variable.

2 Practical operation

Step1 data preprocessing----similar to  Day1_Data preprocessing

But be careful to avoid the dummy variable trap

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : ,  4 ].values

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

X = X[: , 1:]  ### 避免虚拟变量陷阱 

from sklearn.cross_validation import train_test_split   ## 分割数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

PS: Avoid the dummy data trap, choose only two (3-1) dummy variables 

Step 2: Fitting Multiple Linear Regression to the Training set----similar to Day2-simple linear regression

from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() 
regressor.fit(X_train, Y_train) ## Apply the multiple linear regression model to the training set

Step 3: Predicting the Test set results

y_pred = regressor.predict(X_test)

Step 4: Visualization

plt.scatter(np.arange(10),Y_test, color = 'red',label='y_test')
plt.scatter(np.arange(10),y_pred, color = 'blue',label='y_pred')
plt.legend(loc=2);
plt.show()

3

1Complete  code

2Required  data

reference

  1. Filter variables:  Backward Elimination, Forward Selection and Stepwise-CSDN Blog
  2. Dummy variables:  https://www.moresteam.com/whitepapers/download/dummy-variables.pdf
  3. Dummy variable trap:  A brief summary of dummy variables - a short book
  4. Stepwise regression:  https://onlinecourses.science.psu.edu/stat501/node/329/

Guess you like

Origin blog.csdn.net/xiaoshun007/article/details/133377365