First of all to express their own understanding of the multiple linear regression:
equation:
y is the correct result. p0 is a constant term, e is the error, p1, p2, p3 and the like are set by sklearn we should get out of the training data regression coefficients, x1, x2, x3, etc. in our training set of feature vectors.
This time I used a data set is kaggle enrollment probability forecast data set:
Search admission it wants to go kaggle
https://www.kaggle.com/datasets
Long like this:
Chance of Admit which is ultimately their own predictions label
idea is very simple, the code ~
(Oh, I was running on jupyter)
a: data exploration
Import PANDAS AS PD Import numpy AS NP Import matplotlib.pyplot AS PLT csv_data = pd.read_csv ( ' ./data/Admission_Predict.csv ' ) # read csv file content Print (csv_data.info ()) # understand the basic data table : the number of lines, number of columns, each column of data types, data integrity. You can see there are 500 lines per column, we can say there is no missing values. Print (csv_data.describe ()) # understanding the total number, mean, standard deviation some statistics Print (csv_data.head ()) # Understanding Data appearance ~ csv_data.drop ( ' Serial. No. ' , Axis =. 1, InPlace = True) # remove a little with ID # data were normalized, simply by dividing the maximum value thereof ... csv_data['GRE Score'] = csv_data['GRE Score']/340 csv_data['TOEFL Score'] = csv_data['TOEFL Score']/120 csv_data['University Rating'] = csv_data['University Rating']/5 csv_data['SOP'] = csv_data['SOP']/5 csv_data['LOR '] = csv_data['LOR ']/5 csv_data[' CGPA of ' ] = csv_data [ ' CGPA of ' ] / 10 # Data Exploration
operation result:
II: Simple visualization
import seaborn as sns print(csv_data.columns) sns.regplot('GRE Score','Chance of Admit ',data=csv_data)
Show all features:
sns.pairplot (CSV_data, diag_kind = ' kde ' , plot_kws = { ' alpha ' : 0.2})
You can see from the chart, there are so little return appearance -
three: model building
from sklearn import linear_model features = ['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA', 'Research',] # 特征选择 X = csv_data[features].iloc[:420,:-1] Y = csv_data.iloc[:420,-1] #选择训练集 X_test = csv_data[features].iloc[420:,:-1] Y_test = csv_data.iloc[420:,-1] #选择测试集 regr = linear_model.LinearRegression() #构造线性回归模型 regr.fit(X,Y) #模型训练 print(regr.predict(X_test)) # 预测 print(list(Y_test)) #答案 print(regr.score(X_test,Y_test)) #准确度
结果:
嘿,达到88%的准确度了呢,有用,开心/
The End~