Advanced Statistical Methods - Linear regression

  1. The advantages of Scikit - learn:
    [1] Incredible documentation
    [2] Variety (Regression, Classification, Clustering, Support vector machines, Dimensionality reduction).
  2. Numerical stability: the basic idea is that training an algorithm is about performing complicated mathematical operations in the background when the numbers you are dealing with are too small or too big, your code may break.
  3. Standardization: the process of subtracting the mean and dividing by the standard deviation. Standardization is a type of normalization and sometimes the two terms are used interchangeably.
  4. 下面是Python 代码
import numpy as np
import pandas as pd
import matplotlib.pyploy as plt
import seaborn as sns
sns.set()  #default style

from sklearn.linear_model import LinearRegression

data = pd.read_csv('1.01. Simple linear regression.csv')

x = data['SAT']
y = data['GPA']

x_matrix = x.values.reshape(-1,1)  #将一维数组换成二维数组

reg = LinearRegression()  #reg is an instance of the linear regression class
reg.fit(x_matrix,y)  #先是input, 再target

print (reg.score(x_matrix,y))  #做R-squared

print (reg.coef_)  #显示Coefficients

print(reg.intercept_)  #显示intercept

new_data = pd.DataFrame(data=[1740,1760], columns=['SAT'])
print(new_data)
print(reg.predict(new_data))
new_data['Predict_GPA'] = reg.predict(new_data)
print(new_data)

plt.scatter(x,y)
yhat = reg.coef_*x_matrix + reg.intercept_
fig = plt.plot(x,yhat,lw=4, c='orange', label='regression line')
plt.xlabel('SAT')
plt.ylabel('GPA')
plt.show()

下面是结果的显示:
在这里插入图片描述
在这里插入图片描述

  1. 下面还是Python 代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()  #default style

from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression

data = pd.read_csv('1.02.Multiple-linear-regression.csv')

x = data[['SAT', 'Rand 1,2,3']]
y = data['GPA']

reg = LinearRegression()
reg.fit(x,y)

print(reg.coef_)

print(reg.intercept_)

print(reg.score(x,y))  #returns the R-Squared of a linear regression

print(x.shape)  #84 is the number of observations, 2 is the number of predictors

r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1 - (1-r2)*(n-1)/(n-p-1)  # 计算R squared 的公式
print(adjusted_r2)

print(f_regression(x,y))  # 第一个:F-statistics, 第二个: p-values
p_values= f_regression(x,y)[1]
print(p_values)
print(p_values.round(3))

reg_summary = pd.DataFrame(data=x.columns.values, columns=['Features'])
print(reg_summary)
reg_summary['Coefficients'] = reg.coef_
reg_summary['p-values'] = p_values.round(3)
print(reg_summary)

print("*************")
scaler = StandardScaler()  #contains all standardization info
print(scaler.fit(x))
x_scaled = scaler.transform(x)
print(x_scaled)

print("*************")
reg_sum = pd.DataFrame([['Intercept'],['SAT'],['Rand 1,2,3']], columns=['Features'])
reg_sum['Weights'] = reg.intercept_, reg.coef_[0], reg.coef_[1]
print(reg_sum)

print("*************")
New_data = pd.DataFrame(data=[[1700,2],[1800,1]], columns=['SAT','Rand 1,2,3'])
print(New_data)
print("*************")
print(reg.predict(New_data))
print("*************")
new_data_scaled = scaler.transform(New_data)
print(new_data_scaled)
print("*************")
print(reg.predict(new_data_scaled))

a = np.arange(1,101)
print(a)
print("*************")
b = np.arange(501, 601)
print(b)
print("*************")
print(train_test_split(a))
print("*************")
a_train, a_test,b_train, b_test = train_test_split(a,b, test_size=0.2, shuffle=True)  # 第一个数组是train, 第二个数组是test
print(a_train.shape, a_test.shape)
print("*************")
print(a_train)
print("*************")
print(a_test)
print("*************")
print(b_train)
print("*************")
print(b_test)

结果如下:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
6. Feature selection simplifies models, improves speed and prevents a series of unwanted issues arising from having too many features.
7. feature_selection.f_regression: f_regression creates simple linear regressions of each feature and the dependent variable.
8. These are the univariate p-values reached from simple linear models. They do not reflect the interconnection of the features in our multiple linear regression.
9. P-values are one of the best ways to determine if a variable is redundant, but they provide no information whatever about how useful a variable is.
10. Standartization and the feature scaling is the process of transforming the data we are working with into a standard scale.
11. StandardScaler() is a preprocessing module used to standardize (or scale) data.
12. StandardScaler.transform(x) transforms the unscaled inputs using the information contained in the scaler object (feature-wise).
13. Overfitting means our regression has focused on the particular data set so much, it has “missed the point”.
14. Underfitting means the model has not captured the underlying logic of the data. It provides and answer, but does not capture the underlying logic of the data. It doesn’t have strong predictive power under fitted models are clumsy and have a low accuracy.
15. np.arange([start,]stop,[step]) returns evenly spaced values withing a given interval. By default the output is an ndarray.
16. train_test_split(x) splits arrays or matrices into random train and test subsets.

猜你喜欢

转载自blog.csdn.net/BSCHN123/article/details/103725750