Machine Learning ten days learning plan -2

Machine Learning ten days learning plan -2

Linear Regression

Principle
data set {(x1, y1), ( x2, y2), ..., (xn, yn)}, where xi = (xi1, xi2, ... , xid), y∈R;
, n represents the number of samples , d represents the dimension of each variable.
We may describe the relationship between x and y by a linear function:
f ( x ) = i 0 + i 1 x 1 + i 2 x 2 + . . . . . . + i n x n f(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} +......+\theta_{n}x_{n}
By determining a set of values ​​of θ to that f (x) as much as possible approximation y.

Deriving
the mean square error is a loss function used linear regression, is deduced from the maximum likelihood estimation.
The principle is to assume that the relationship between the target variable
Y i = i T x i + the y ^ {i} = \ theta ^ {T} x ^ {i} + \ eta
wherein η is noise immediately, assuming that a Gaussian distribution, may be known about the probability distribution of [theta], [theta] to establish maximum likelihood function by selecting [theta], so maximizing the likelihood function, i.e. such that
l ( i ) = 1 / 2 i = 1 n ( ( y i θ T x i ) 2 l(\theta) = 1/2 \sum_{i=1}^{n}((y^{i}-\theta^{T}x^{i})^2 is maximized. Specific derivation slightly.

This result divided by the number of samples n, i.e., mean square error, and therefore uses this value as a cost function to optimize the statistical model is then statistically reasonable.

Optimization of
gradient descent
setting initial parameter [theta] a, so that through continuous iterative J (θ) minimum, needs to be J (θ) of the partial derivative [theta].
Here Insert Picture DescriptionIn this method, the parameters on each data point has moved simultaneously, thus becoming batch gradient descent method .
Accordingly, if a time for updating the parameter a sample, a time for moving a data point. which is
θ = θ + α ( y i f θ ( x ) i ) \theta = \theta + \alpha(y^{i} - f_{\theta}(x)^{i}) be thestochastic gradient descent method.
A large amount of data, the stochastic gradient descent method is often superior to batch gradient descent.

Code:

#生成训练dataset
import numpy as np
np.random.seed(1234)
x = np.random.rand(500, 3)
y = x.dot(np.array([4.2, 5.7, 10.8]))

Batch gradient descent:

#批量梯度下降法求解参数
class LR_GD():
    def __init__(self):
        self.w = None
    def fit(self, X, y, alpha=0.002, loss=1e-10):
        y = y.reshape((-1, 1))
        [m,d] = np.shape(X)
        self.w = np.zeros((d,1))
        tol = 1e5
        theta = self.w
        diff = np.dot(X, theta) - y
        tol = (1/2)*np.dot(np.transpose(diff), diff)
        while  tol > loss:
            gradient = np.dot(np.transpose(x), diff)
            theta = theta - gradient*alpha
            diff = np.dot(X, theta)-y
            tol = (1/m)*np.dot(np.transpose(diff), diff)
            tol = tol[0][0]
            #print(tol)
        return theta
    def predict(self, X, weight):
        y_pred = np.dot(X,weight)
        return y_pred
#测试
if __name__ == "__main__":
    lr_gd = LR_GD()
    weight = lr_gd.fit(x, y)
    print("估计的参数值为:%s"%(weight))
    x_test = np.array([2,4,5]).reshape(1, -1)
    print("预测值为:%s" %(lr_gd.predict(x_test, weight)))

operation result:

估计的参数值为:[[ 4.20000516]
 [ 5.70002061]
 [10.7999735 ]]
预测值为:[[85.19996027]]

Stochastic gradient descent method:

#随机梯度下降法求解
class LR_stoGD():
    def __init__(self):
        self.w = None
    def fit(self, X, y, alpha=0.002, loss=1e-10):
        y = y.reshape((-1,1))
        [m,d] = np.shape(X)
        self.w = np.zeros((1,d))
        theta = self.w
        for i in range(m):
            diff = sum(np.dot((X[i,:]), np.transpose(theta))) - y[i,:]
            gradient = X[i,:]*diff
            theta = theta - alpha * gradient
            #print(theta)
        return theta
if __name__ == "__main__":
    lr_stogd = LR_stoGD()
    w = lr_stogd.fit(x, y)
    print('估计的参数值是:%s'%(w))

operation result:

估计的参数值是:[[3.58871918 4.00836808 4.17746749]]

Since only 500 samples, so the estimated parameter values are very inaccurate.
Use only one sample point to update the rising gradient becomes stochastic gradient algorithm, if billions of samples and thousands of features, batch gradient descent method of computation complexity is too high, it is suitable for stochastic gradient rises law.

Released six original articles · won praise 0 · Views 97

Guess you like

Origin blog.csdn.net/weixin_43959248/article/details/105045522
Recommended