Basic optimization algorithm
navigation
gradient descent
- No solution is shown for our model. (It is rare to have a completely consistent linear model in life, and most models do not show solutions)
- Gradient: The direction in which the value of a function increases the fastest.
- Negative gradient: The direction in which the value of a function decreases fastest.
- Learning rate η: How far along the direction η goes at a time. (eta is pronounced as: yita)
- (-η * reciprocal) is where the function decreases the fastest. Then w0+(-η * reciprocal) is the position of w1.
1.1 Mini-batch stochastic gradient descent
- Each time you find the gradient, you need to derive the entire loss function. This loss function is the average loss over all our samples. Means:
求一次梯度,需要对整个样本算一遍
. about this开销很大,很贵
.
1.2 Summary
Linear regression implementation
1. Process data
- If you do not have the d2l package, you need to enter cmd and run it as an administrator. Enter:
pip install -U d2l -i https://mirrors.aliyun.com/pypi/simple/
Download. - If it reports:
ModuleNotFoundError: No module named ‘torchvision’
, enter directly in jupyter notebook:pip install torchvision -i https://mirrors.aliyun.com/pypi/simple/
This code defines a
synthetic_data
function called which is used to generate synthetic data.
This function accepts three parameters:
w
: A one-dimensional tensor (vector) representing the weight of the model.b
: A scalar representing the bias term of the model.num_examples
: Integer, indicating the number of data samples to be generated.
The main steps of the function are as follows:
- Use to
torch.normal(0, 1, (num_examples, len(w)))
generate(num_examples, len(w))
a random tensor with a standard normal distribution of shapeX
, with mean 0 and standard deviation 1. - Use the matrix multiplication operator
torch.matmul(X, w)
to multiplyX
the weightsw
and then add the bias termb
to get the predicted valuey
. - Use random noise
torch.normal(0, 0.01, y.shape)
that is generated with they
same shape and obeys the standard normal distribution and addedy
to it to simulate the noise of real data. - Finally, use
y.reshape((-1, 1))
willy
be converted to(-1, 1)
a 2D tensor of shape , where-1
the size of that dimension is automatically calculated based on the size of the other dimensions.简单来说,reshape((-1,1))就是将数组转换成只有一列,行数不确定的二维数组。
The function returns the generated synthetic data X
and converted labels y
.
This code is an example of using the d2l library to draw a scatter plot.
d2l.set_figsize()
Used to set the size of graphics, you can specify width and height.
d2l.plt.scatter(features[:,1].detach().numpy(), labels.detach().numpy(), 1)
This line of code draws a scatter plot, where:
features[:,1]
Indicates taking the second column of data from the feature matrix as the x coordinate;labels
Represents label data;1
The radius of the scatter point is 1.
detach()
The method detaches the tensor from the computational graph and returns an independent tensor in memory, which avoids modifying the original data when drawing. numpy()
Method converts a tensor to a NumPy array for use in plotting functions.
1.3 Generate a mini-batch of size batch_size
This code defines a data_iter
function called an iterator that generates batches of training data.
The function accepts three parameters: batch_size
, features
and labels
. Among them, batch_size
represents the number of samples in each batch, features
is the feature matrix, and labels
is the label vector.
First, the function counts the total number of samples num_examples
and then creates a list containing the indices of all samples indices
. Next, use random.shuffle()
the method to randomly shuffle the index.
Next, the function uses a loop to generate the batch data. In each loop, it takes indices from the shuffled index list batch_size
and converts these indices into a tensor batch_indices
. Then, the function uses yield
the statement to return the feature matrix and label vector corresponding to the current batch.
Due to the use of yield
the statement, this function is a generator function that can generate batches of data one by one in a loop without loading all the data into memory at once. This can effectively reduce memory usage and improve training efficiency.
2. Processing models
This code defines two PyTorch tensors w
and b
, used in linear regression models in neural networks.
w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
This line of code creates a tensor of shape (2,1) w
whose elements are randomly sampled from a normal distribution with mean 0 and standard deviation 0.01. requires_grad=True
Indicates that the gradient of this tensor needs to be calculated in order to update the parameters during backpropagation.
b = torch.zeros(1, requires_grad=True)
This line of code creates a tensor of shape (1,) b
whose elements are all initialized to 0. requires_grad=True
Indicates that the gradient of this tensor needs to be calculated in order to update the parameters during backpropagation.
In neural networks, w
and b
usually represent the weight and bias terms of the linear regression model, respectively. By continuously iterating optimization algorithms (such as stochastic gradient descent), the values of w
and can be updated b
, making the prediction results of the model closer and closer to the true value.
3. Model evaluation
This code defines a sgd
function called to perform the mini-batch stochastic gradient descent algorithm.
The function accepts three parameters: params
, lr
and batch_size
. where params
is a list or tensor containing model parameters, lr
is the learning rate, and batch_size
is the number of samples in each batch.
The function uses torch.no_grad()
a context manager to disable gradient calculations to avoid taking up too much memory during backpropagation.
Next, the function uses a loop to iterate over params
each parameter in . For each parameter, it updates the parameter by first param.grad
dividing the current batch's gradient by the batch size batch_size
, then multiplying by the learning rate lr
, and subtracting this value from the original parameter value. Finally, use param.grad.zero_()
to zero out the gradient for that parameter so that the new gradient value is used in the next iteration.
In summary, this code implements a simple stochastic gradient descent algorithm for training neural network models.
4. Training process
The blocked code is: print(f'epoch{epoch+1}, loss{float(train_l.mean()):f}')
This code is a complete process for training a neural network model, including forward propagation, calculation loss, back propagation and parameter update.
First, a loop is defined for
to iterate over multiple epochs. In each epoch, use data_iter()
the function to generate a batch of data X,y
, where batch_size
is the number of samples in each batch, features
is the feature matrix, and labels
is the label vector.
Next, for each data point X
, use net(X,w,b)
forward propagation to get the predicted value. Then, y
the loss between the predicted value and the true label is calculated loss(net(X,w,b),y)
.
Next, call l.sum().backward()
backpropagation on the loss to calculate the gradient of each parameter. Then, the stochastic gradient descent algorithm is used sgd([w,b], lr, batch_size)
to update the parameters w
and b
.
After each epoch, use with torch.no_grad():
the context manager to turn off the gradient calculation to avoid taking up too much memory during the output training process. Then, use train_l = loss(net(features, w, b), labels)
to calculate the loss of the current model on the test set, and print out the current epoch and average loss.
In short, this code implements a standard neural network training process, improving the performance of the model through continuous iterative optimization algorithms.