Hands-on Deep Learning Notes (6) - Multilayer Perceptron and Implementation

多层感知机(MLP,Multilayer Perceptron)Also called 人工神经网络(ANN,Artificial Neural Network), in addition to the input and output layers, there can be multiple hidden layers in the middle.

1.1 Hidden layer

Overcome the limitations of linear models by adding one or more hidden layers to the network, allowing it to handle more general types of functional relationships, the simplest implementation is to stack many fully connected layers together. Each layer outputs to the layers above until the final output is generated.
A single hidden layer multilayer perceptron with 5 hidden units
A single-hidden-layer multilayer perceptron has 5 hidden units, and the input layer does not involve any computation, so using this network to generate output only requires the computation of the hidden and output layers. Therefore, the number of layers in this multilayer perceptron is 2. Note that both layers are fully connected. Every input affects every neuron in the hidden layer, and every neuron in the hidden layer affects every neuron in the output layer.
To realize the potential of the multi-layer architecture, we need an additional key element: applying a nonlinearity to each hidden unit after an affine transformation, 激活函数(activation function)the output of the activation function is called 活性值(activations). In general, with an activation function, it is no longer possible to degenerate our multilayer perceptron into a linear model

1.2 Activation function

激活函数(activation function)Determining whether neurons should be activated by computing a weighted sum and adding a bias, they transform the input signal into an output differentiable operation, most activation functions are nonlinear. Since activation functions are the foundation of deep learning, some common activation functions are briefly introduced below.

1.2.1 ReLU function

The most popular activation function is 修正线性单元(Rectified linear unit,ReLU)because it is simple to implement while performing well in various prediction tasks. ReLU provides a very simple nonlinear transformation. Given an element x, the ReLU function is defined as the maximum value of that element and 0:
insert image description here
In layman's terms, the ReLU function keeps only positive elements and discards all negative elements by setting the corresponding activity value to 0. To get an intuitive feel, we can draw a graph of the function. As you can see from the figure, the activation function is piecewise linear.

%matplotlib inline
import torch
from d2l import torch as d2l

x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = torch.relu(x)
d2l.plot(x.detach(), y.detach(), 'x', 'relu(x)', figsize=(5, 2.5))

insert image description here
When the input is negative, the derivative of the ReLU function is 0, and when the input is positive, the derivative of the ReLU function is 1.Note that when the input value is exactly equal to 0, the ReLU function is not differentiable. Below is the derivative of the ReLU function.

y.backward(torch.ones_like(x), retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of relu', figsize=(5, 2.5))

insert image description here
The reason for using ReLU is that it performs derivation particularly well: either let the parameter disappear or let the parameter pass. This makes the optimization perform better, and ReLU alleviates the vanishing gradient problem that plagued previous neural networks

1.2.2 sigmoid function

For an input whose domain is in R, the sigmoid function transforms the input into an output on the interval (0, 1). Therefore, a sigmoid is often called 挤压函数(squashing function): It compresses an arbitrary input in the range (-inf, inf) to some value in the interval (0, 1):
insert image description here
In the earliest neural networks, scientists were interested in ” or “unfired” biological neurons. Thus, pioneers in this field can be traced all the way back to McCulloch and Pitts, the inventors of artificial neurons, who focused on the threshold unit. A threshold unit takes a value of 0 when its input is below a certain threshold, and a value of 1 when its input exceeds the threshold.
The sigmoid function was a natural choice when people came to focus on gradient-based learning because it is a smooth, differentiable threshold unit approximation.

y = torch.sigmoid(x)
d2l.plot(x.detach(), y.detach(), 'x', 'sigmoid(x)', figsize=(5, 2.5))

insert image description here
The derivative of the sigmoid function is the following formula: The
insert image description here
graph of the derivative of the sigmoid function is shown below. Note that when the input is 0, the derivative of the sigmoid function reaches a maximum value of 0.25; while the input is farther away from 0 in either direction, the derivative is closer to 0.

# 清除以前的梯度
x.grad.data.zero_()
y.backward(torch.ones_like(x),retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))

insert image description here

1.2.3 tanh function

Similar to the sigmoid function, the tanh (hyperbolic tangent) function can also compress its input to the interval (-1, 1). The formula for the tanh function is as follows:
insert image description here
Below we draw the tanh function. Note that the tanh function approximates a linear transformation when the input is around 0.The shape of the function is similar to the sigmoid function, except that the tanh function is symmetric about the origin of the coordinate system

y = torch.tanh(x)
d2l.plot(x.detach(), y.detach(), 'x', 'tanh(x)', figsize=(5, 2.5))

insert image description here
The derivative of the tanh function is:
insert image description here
as the input approaches 0, the derivative of the tanh function approaches the maximum value of 1. Similar to what we saw in the image of the sigmoid function, the farther the input is from the 0 point in either direction, the closer the derivative is to 0.

# 清除以前的梯度
x.grad.data.zero_()
y.backward(torch.ones_like(x),retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))

insert image description here

1.3 Concise implementation

import torch
from torch import nn
from d2l import torch as d2l

npx.set_np()

Compared to the neat implementation of softmax regression, the only difference is that we add 2 fully connected layers (previously we only added 1 fully connected layer). The first layer is the hidden layer, which contains 256 hidden units and uses the ReLU activation function. The second layer is the output layer.

net = nn.Sequential(nn.Flatten(),
                    nn.Linear(784, 256),
                    nn.ReLU(),
                    nn.Linear(256, 10))

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

net.apply(init_weights);

batch_size, lr, num_epochs = 256, 0.1, 10
loss = nn.CrossEntropyLoss()
trainer = torch.optim.SGD(net.parameters(), lr=lr)

train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

insert image description here

For the same classification problem, the implementation of the multilayer perceptron is the same as the implementation of softmax regression, but the implementation of the multilayer perceptron adds a hidden layer with an activation function.

Guess you like

Origin blog.csdn.net/qq_52118067/article/details/122910683