Super simple and multiple linear regression application

First of all to express their own understanding of the multiple linear regression:

equation:

y is the correct result. p0 is a constant term, e is the error, p1, p2, p3 and the like are set by sklearn we should get out of the training data regression coefficients, x1, x2, x3, etc. in our training set of feature vectors.

This time I used a data set is kaggle enrollment probability forecast data set:

Search admission it wants to go kaggle

https://www.kaggle.com/datasets

Long like this:

 

Chance of Admit which is ultimately their own predictions label 

idea is very simple, the code ~
(Oh, I was running on jupyter)

a: data exploration
Import PANDAS AS PD
 Import numpy AS NP
 Import matplotlib.pyplot AS PLT 


csv_data = pd.read_csv ( ' ./data/Admission_Predict.csv ' )
 # read csv file content 
Print (csv_data.info ())
 #   understand the basic data table : the number of lines, number of columns, each column of data types, data integrity. You can see there are 500 lines per column, we can say there is no missing values. 
Print (csv_data.describe ())
 #   understanding the total number, mean, standard deviation some statistics 
Print (csv_data.head ())
 #   Understanding Data appearance ~ 
csv_data.drop ( ' Serial. No. ' , Axis =. 1, InPlace = True)
 #   remove a little with ID 

# data were normalized, simply by dividing the maximum value thereof ...
csv_data['GRE Score'] = csv_data['GRE Score']/340
csv_data['TOEFL Score'] = csv_data['TOEFL Score']/120
csv_data['University Rating'] = csv_data['University Rating']/5
csv_data['SOP'] = csv_data['SOP']/5
csv_data['LOR '] = csv_data['LOR ']/5
csv_data[' CGPA of ' ] = csv_data [ ' CGPA of ' ] / 10 # Data Exploration

 

operation result:

 

II: Simple visualization

import seaborn as sns


print(csv_data.columns)
sns.regplot('GRE Score','Chance of Admit ',data=csv_data)

 

 

Show all features:
sns.pairplot (CSV_data, diag_kind = ' kde ' , plot_kws = { ' alpha ' : 0.2})

 

 

You can see from the chart, there are so little return appearance - 

three: model building
from sklearn import linear_model


features = ['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA', 'Research',]
# 特征选择
X = csv_data[features].iloc[:420,:-1]
Y = csv_data.iloc[:420,-1]
#选择训练集
X_test = csv_data[features].iloc[420:,:-1]
Y_test = csv_data.iloc[420:,-1]
#选择测试集

regr = linear_model.LinearRegression()
#构造线性回归模型
regr.fit(X,Y)
#模型训练
print(regr.predict(X_test))  # 预测
print(list(Y_test))  #答案
print(regr.score(X_test,Y_test))  #准确度

 结果:

嘿,达到88%的准确度了呢,有用,开心/

The End~



Guess you like

Origin www.cnblogs.com/byadmin/p/11613421.html