Linear Algebra, Matrix Computation, Automatic Derivation

Learning record: 

Regarding this knowledge, I think this blog is more detailed.

[Hands-on deep learning | week1b] 05 Linear Algebra 06 Matrix Calculation 07 Automatic Derivation_import torchx = torch.arange(4.0)y = x*torch.dot(x_Big Stomach Sheep's Blog-CSDN Bloghttps: // blog.csdn.net/davidyang_980/article/details/122042158

"Hands-On Deep Learning"

Hands-on Deep Learning (d2l.ai) https://zh-v2.d2l.ai/d2l-zh-pytorch.pdf

1. Linear Algebra

linear model

The linear assumption means that the target can be expressed as a weighted sum of features, such as the following formula:

price = warea · area + wage · age + b

The wara and wage in the formula are called weight, and the weight determines the influence of each feature on our predicted value. b is called bias (bias), offset (offset) or intercept (intercept).

Strictly speaking, the above formula is an affine transformation of the input features. Affine transformation is characterized by linear transformation of features through weighted sums and translation through offset terms. The predicted value of the output is determined by the affine transformation of the input features through the linear model, and the affine transformation is determined by the selected weights and biases.

The notational matrix X ∈ R n×d can conveniently refer to the n samples of our entire dataset. Among them, each row of X is a sample, and each column is a feature. For a feature set X, the predicted value yˆ ∈ R n can be expressed by matrix-vector multiplication as: yˆ = Xw + b

loss function

The loss function can quantify the difference between the actual value of the target and the predicted value. Usually we choose a non-negative number as the loss, and the smaller the value, the smaller the loss, and the loss is 0 for perfect prediction. The most commonly used loss function in regression problems is the squared error function. When the predicted value of sample i is yˆ(i), and its corresponding true label is y(i), the squared error can be defined as the following formula:

 When training a model, we want to find a set of parameters (w∗ , b∗) that minimize the total loss over all training samples. as follows:

Analytical solution

The solution of linear regression can be simply expressed by a formula. This type of solution is called analytical solution. First, the bias b is incorporated into the parameter w by appending to the matrix containing all parameters a column. Our prediction problem is to minimize ∥y − Xw∥2. There is only one critical point on the loss plane, which corresponds to the minimum loss for the entire region. Set the derivative of the loss with respect to w to 0 to obtain an analytical solution:

stochastic gradient descent

Stochastic gradient descent can optimize almost all deep learning models. It reduces the error by continuously updating the parameters in the direction of decreasing loss function. The steps of the algorithm are as follows: (1) Initialize the value of the model parameters, such as random initialization; (2) Randomly select a small batch of samples from the data set and update the parameters in the direction of the negative gradient, and iterate this step continuously. For the squared loss and affine transformation, we can write explicitly as follows:

Guess you like

Origin blog.csdn.net/qq_46703208/article/details/129697581