Machine Learning (4) - Multiple Linear Regression

The relationship between a unique dependent variable and multiple independent variables
Here the independent variables are not just numerical before processing Above
:
write picture description here
What we have to do is to find the best b0, b1, ....bn
here are Data about 50 companies:
write picture description here
spend1, 2, and 3 represent the company's spending in certain three aspects, state is the company's address, and profit is the company's income last year. Now we need to choose the target company and ask for the best performance, that is, use the data in the first four columns to predict profit.
But we found that the h in y=b0+b1*x1+b2*x2+b3*x3+b4*h must be a numerical value, that is, the difference in h caused by the different state, divide the state into three columns, each column corresponds to an address , convert the data in these three columns into dummy variables,
that is, when Beijing is 0, Shanghai is 1, and hangzhou can be completely ignored, because after the two columns of data are determined, the corresponding third column of data will be known.
Finally Well, we just converted this one column of data containing three addresses into two columns, each with only 0 or 1 data


To the data processing of the previous part plus the operation of the test set and the test set

dataset = pd.read_csv('COM.csv')
X = dataset.iloc[:, :-1].values  
y = dataset.iloc[:, 4].values
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
float_formatter= lambda X:"%.2f" % X
np.set_printoptions(formatter={'float_kind':float_formatter})
X = X[:, 1:]  #删除第0列
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

fit regressor

regressor = LinearRegression()  #回归器的对象
regressor.fit(X_train, y_train)  #拟合

# Predicting the Test set results
y_pred = regressor.predict(X_test)  #包含了的预测结果的向量

This part is the same as simple linear regression, please refer to: simple linear regression


Reverse elimination, select important indicators and indicators that can be eliminated
First add the corresponding independent variable of b0, that is, the column is all 1

X_train = np.append(arr = np.ones((40, 1)).astype(int), values = X_train, axis = 1)

Backpropagation is generally divided into five steps

  1. The significance index is generally 0.05. When the p value of each column is greater than the index, it can be considered to be deleted. Otherwise, the impact on the target value can not be ignored.
  2. Enter all training to fit
  3. Pick the column with the largest p-value
  4. If the p-value of the third step is greater than the significance indicator, delete this column
  5. Repeat step 3 for the remaining training values ​​until the p-value is less than the significance indicator
Ximo = X_train
regressor_OLS = sm.OLS(endog = y_train, exog = Ximo).fit()
#endog 因变量  exog 自变量
print(regressor_OLS.summary())

write picture description here
The first summary results show that the p-value of X2 is the largest, and it is greater than the significance index.
Follow this step to judge once. …


Source code and data set download address: Download RAR

If you don't have points, you can ask me for free, and leave your email address in the comment area. When I see it, I will pack it and send it~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324408331&siteId=291194637