Machine Learning (2) - Linear Regression

Foreword

Return to solve the problem, thinking simple and easy to implement, the basis of many powerful nonlinear model, with good results interpretability, contains many important thought of machine learning.

The horizontal axis is characterized regression problem, the vertical axis represents output label, horizontal classification features are longitudinal axis, the output color flag.

Only one sample feature called simple linear regression.

Because y = | x | is not Everywhere, so we do not choose to take the gap in absolute terms, but squared difference.

The basic idea of ​​a class of machine learning algorithm:

Almost all parameters such learning algorithms are routine, such as linear regression, polynomial regression, logistic regression, SVM, neural network ......

Here is a typical least-squares problem: minimize the squared error

The derivation process:

Simple linear regression class writing

import numpy as np


class SimpleLinearRegression1:

    def __init__(self):
        """初始化Simple Linear Regression 模型"""
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        """根据训练数据集x_train,y_train训练Simple Linear Regression模型"""
        assert x_train.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert len(x_train) == len(y_train), \
            "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        num = 0.0
        d = 0.0
        for x, y in zip(x_train, y_train):
            num += (x - x_mean) * (y - y_mean)
            d += (x - x_mean) ** 2

        self.a_ = num / d
        self.b_ = y_mean - self.a_ * x_mean

        return self

    def predict(self, x_predict):
        """给定待预测数据集x_predict,返回表示x_predict的结果向量"""
        assert x_predict.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert self.a_ is not None and self.b_ is not None, \
            "must fit before predict!"

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        """给定单个待预测数据x,返回x的预测结果值"""
        return self.a_ * x_single + self.b_

    def __repr__(self):
        return "SimpleLinearRegression1()"

Vectorization

numpy in the vector operation is much higher than that used for the cycle efficiency, easy to see that the above simplification of a and b can be expressed as equation Σwi * vi, but this is in fact the dot product of two vectors, the optimization of after writing class:

class SimpleLinearRegression2:

    def __init__(self):
        """初始化Simple Linear Regression模型"""
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        """根据训练数据集x_train,y_train训练Simple Linear Regression模型"""
        assert x_train.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert len(x_train) == len(y_train), \
            "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        self.a_ = (x_train - x_mean).dot(y_train - y_mean) / (x_train - x_mean).dot(x_train - x_mean)
        self.b_ = y_mean - self.a_ * x_mean

        return self

    def predict(self, x_predict):
        """给定待预测数据集x_predict,返回表示x_predict的结果向量"""
        assert x_predict.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert self.a_ is not None and self.b_ is not None, \
            "must fit before predict!"

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        """给定单个待预测数据x_single,返回x_single的预测结果值"""
        return self.a_ * x_single + self.b_

    def __repr__(self):
        return "SimpleLinearRegression2()"

Measure of linear regression

Mean square error MSE (mean squared error)

Root mean square error RMSE (root mean squared error)

The average absolute error MAE (mean absolute error)

Code

The following codes are run in jupyter notebook

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

boston=datasets.load_boston()

x=boston.data[:,5] #只使用房间数量RM这个特征

y=boston.target

plt.scatter(x,y)
plt.show()

The top can be found at some point very strange, all greater than 50 may have been written, measuring instruments may have a maximum value (greater than or questionnaires are written in the 50's when completing 50), so we want y> 50 points removed .

x=x[y<50]
y=y[y<50]

plt.scatter(x,y)
plt.show()

Such is correct.

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)

%run F:\python3玩转机器学习\线性回归\SimpleLinearRegression.py
    
reg = SimpleLinearRegression2()
reg.fit(x_train,y_train)

reg.a_
reg.b_

plt.scatter(x_train,y_train)
plt.plot(x_train,reg.predict(x_train),color="r")
plt.show()

y_predict=reg.predict(x_test)

mse_test=np.sum((y_predict-y_test)**2)/len(y_test)
mse_test

from math import sqrt
rmse_test=sqrt(mse_test)
rmse_test

mae_test=np.sum(np.absolute(y_predict-y_test))/len(y_test)
mae_test

sklearn the mse and mae:

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
#没有rmse,需要自己手动开根

mean_squared_error(y_test,y_predict)

mean_absolute_error(y_test,y_predict)

Mae can be found a little smaller than rmse, but their dimensions are the same (and the same y units), this is because in the square will rmse larger error amplifier, and mae will not. So we try to make rmse small as possible makes sense.

R Squared

The best classification accuracy is 1, the worst is 0, and rmse mae and can not meet.

Code:

1-mean_squared_error(y_test,y_predict)/np.var(y_test)

Encapsulated into function

def mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum((y_true - y_predict)**2) / len(y_true)


def root_mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""

    return sqrt(mean_squared_error(y_true, y_predict))


def mean_absolute_error(y_true, y_predict):
    """计算y_true和y_predict之间的MAE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)


def r2_score(y_true, y_predict):
    """计算y_true和y_predict之间的R Square"""

    return 1 - mean_squared_error(y_true, y_predict)/np.var(y_true)

Multiple linear regression equations and formal

Multiple linear regression:

Linear regression categories:

import numpy as np
from sklearn.metrics import r2_score


class LinearRegression:

    def __init__(self):
        """初始化Linear Regression模型"""
        self.coef_ = None #系数
        self.intercept_ = None #截距
        self._theta = None

    def fit_normal(self, X_train, y_train):
        """根据训练数据集X_train, y_train训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"
%run f:\python3玩转机器学习\线性回归\LinearRegression.py

boston=datasets.load_boston()
X=boston.data #全部特征
y=boston.target
X=X[y<50]
y=y[y<50]

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=666)

reg=LinearRegression()
reg.fit_normal(X_train,y_train)

reg.coef_

reg.intercept_

reg.score(X_test,y_test)

scikit-learn Linear Regression

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()

lin_reg.fit(X_train,y_train)

lin_reg.intercept_
lin_reg.coef_

lin_reg.score(X_test,y_test)

KNN Regressor

from sklearn.neighbors import KNeighborsRegressor

knn_reg=KNeighborsRegressor()
knn_reg.fit(X_train,y_train)
knn_reg.score(X_test,y_test)

#网格搜索最优参数
param_grid =[
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,11)],
},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,11)],
'p':[i for i in range(1,6)]
}]


knn_reg=KNeighborsRegressor()


from sklearn.model_selection import GridSearchCV


grid_search = GridSearchCV(knn_reg,param_grid)

grid_search.fit(X_train,y_train)

grid_search.best_params_
grid_search.best_score_ #此方法不能用,因为评价标准和前面不同,采用了CV

grid_search.best_estimator_.score(X_test,y_test) #这样才是真正的和前面相同的评价标准

Linear regression interpretability

boston=datasets.load_boston()
X=boston.data #全部特征
y=boston.target
X=X[y<50]
y=y[y<50]

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(X,y)

lin_reg.coef_

np.argsort(lin_reg.coef_)

boston.feature_names[np.argsort(lin_reg.coef_)]

The coefficients can be found by the large index after sorting childhood,

View DESC description of the features which can be found in house prices and RM (number of rooms), CHAS (whether by the river), and so a positive correlation, and NOX (carbon monoxide) and other negative correlation.

Linear regression summary

Guess you like

Origin www.cnblogs.com/mcq1999/p/Machine_Learning_3.html