Write chatGPT by yourself: neurons and loss functions of neural networks

chatGPT is based on the so-called large model. There are two keywords here, one is "big" and the other is "model". Let's first look at what is called "model". The so-called model is actually the neural network in deep learning, which is composed of many basic units called "neurons". A neuron is a basic computing unit that performs two operations. First, a matrix M is multiplied by an input vector X, and the result is a one-dimensional vector W X, which is then added to another one-dimensional vector b . The result is still a one-dimensional vector W X + b. These steps are collectively called linear operations. Finally, this one-dimensional vector will be input to a function f, and the final output is also a vector f(W*X + b). This step is called non- The basic process of linear operation is as follows:
Please add a picture description
chatGPT has 175 billion parameters, which means it is composed of a super-large network formed by interconnecting 175 billion computing units like the one above. A key step in the above process is the execution of the function f, which is also called the activation function. Its purpose is to make a nonlinear transition from the result of the previous linear operation. There are four main types of it. The first is called sigmoid , its expression is 1 / (1 + e^(-x)), let's look at its function graph:

import torch
import matplotlib.pyplot as plt

#创建x插值点[-5.0, -4.9, -4.8,...., 5.0]
x = torch.range(-5., 5., 0.1)
print(f"x:{
      
      x}")
#执行激活函数
y = torch.sigmoid(x)
print(f"y:{
      
      y}")
#根据插值绘图
plt.plot(x.numpy(), y.numpy())

After the above code is executed, the output graph is as follows:
insert image description here
its output is between 0 and 1. If we want the network to predict a certain probability, then we can use this function at the end of the network. There is a problem with it, that is, in x Where it is close to 1.0 or 0, if you derive the x at these positions, the slope of the tangent line will be very close to 0, which will cause a problem called "vanishing gradient" when training the network.

The second activation function is called tanh(x), its expression is (e^(x)- e ^ (-x)) / (e ^ (x) + e ^ (-x)), we use the following code Plot its function:

import torch 
import matplotlib.pyplot as plt
x = torch.range(-5., 5., 0.1)
y = torch.tanh(x) 
plt.plot(x.numpy(), y.numpy()) 
plt.show()

After the above code is run, the result is as follows:
insert image description here
The third type is called ReLU, which is the most important and most widely used activation function. Its analytical formula is f(x)= max(0,x), which looks simple, but in It is quite effective in practice, let's take a look at its graph:

import torch
import matplotlib.pyplot as plt

relu = torch.nn.ReLU()
x = torch.range(-5., 5., 0.1)
y = relu(x)

plt.plot(x.numpy(), y.numpy())
plt.show()

The result after running the above code is as follows:
insert image description here
Its logic is very simple, that is, convert all values ​​less than 0 to 0, and keep those greater than 0 unchanged. It has a problem that in the area less than 0, its image is a straight line, which means that the result of its derivative in this area is 0, which will have an adverse effect on the training of the network, so it has a variable The body is called leaky ReLU, and the function is f(x)=max(x, ax), where the parameter a needs to be obtained through network training. Let's look at its function graph:

import torch 
import matplotlib.pyplot as plt
prelu = torch.nn.PReLU(num_parameters=1) 
x = torch.range(-5., 5., 0.1)
y = prelu(x)
plt.plot(x.numpy(), y.detach().numpy()) 
plt.show()

insert image description here
The last commonly used activation function is called softmax. Its function is to calculate the percentage of each option among the given options. For example, if we judge whether the animal in a picture is a cat or a dog, then this function will give two Each outcome corresponds to the probability of being a dog. The expression of this function is: softmax(xi) = (e ^xi) / ( e ^ x1 + e ^ x2 + ... + e ^xk), let's look at the relevant code of this function:

import torch.nn as nn 
import torch 

softmax = nn.Softmax(dim = 1)
x_input = torch.randn(1,3)
#y_output对应向量中所有分量加总为1
y_output = softmax(x_input)
describeTensor(x_input)
describeTensor(y_output)
#把输出结果的分量加总
print(torch.sum(y_output, dim=1))

After the above code is executed, the result is as follows:

Type: torch.FloatTensor
shape/size: torch.Size([1, 3])
values: tensor([[ 0.7110,  0.0178, -0.8281]])
Type: torch.FloatTensor
shape/size: torch.Size([1, 3])
values: tensor([[0.5832, 0.2916, 0.1251]])
tensor([1.])

Another important concept in deep learning is the loss function. It's actually a mathematical way to describe how good or bad an outcome is. Suppose we have a network to identify whether the input picture is a cat or a dog. The network outputs two values, one value corresponds to the probability of a dog, and the other value corresponds to the probability of a cat. If the network recognition ability is strong enough, then when inputting a picture of a dog, the probability value corresponding to the dog should be as large as possible, and the value corresponding to the cat should be as small as possible. The loss function is to use a mathematical function to describe "corresponding dog The probability value should be as large as possible, and the value corresponding to the cat should be as small as possible." This situation.

In the case of "supervised learning", the input data of the network during training will have corresponding answers. For example, when we train the network to recognize cat and dog pictures, each picture will also have a corresponding label value. If it is a dog picture, then the label 1.0, if it is a cat picture, then mark 0, we use y to represent this mark value, use y^ to represent the probability that the network gives the picture is a cat or a dog, we can use a variety of formulas to describe the accuracy of the network output, the first One is called the average sum of squares (MSE), and its formula is as follows:
Please add a picture description
The pytorch framework provides this function, which we can call directly. The code is as follows:

import torch
import torch.nn as nn 
mse_loss = nn.MSELoss()
outputs = torch.Tensor([1,2])
targets = torch.Tensor([3,4])
#[(3-1)^2 + (4-2)^2] / 2
loss = mse_loss(outputs, targets)
print(loss)

The output of the above code is 4.0,

The second loss function is called cross entropy, and its formula is:
Please add a picture description
This formula is often used to determine which category the input belongs to, and its use is based on the softmax function described earlier. Assuming that there are four types of items in the input picture that the network needs to judge, they are cats, dogs, cows, and sheep. We use one-hot-vector to represent these five different types. If it is a cat, the corresponding vector is [1, 0,0,], if it is a dog, then it is [0,1,0,0,], and so on.

When we input a cat picture into the network, the network uses softmax to calculate the possibility of five objects, for example, the output is [0.775, 0.116, 0.039,0.070], then corresponding to the above formula, the value of i is 0 to 4, y0 =1,y1=0,y2=0,y3=0, y ^ 0 = 0.775, y ^ 1 = 0.116, y ^ 2 = 0.039, y ^ 4 = 0.070, when we adjust the internal parameters of the network, let it output After the result is substituted into the above formula, the result is as small as possible. The result of this adjustment makes the value corresponding to the 0th component output by the network be as large as possible after receiving the cat picture.

Let's see how to use pytorch to call the above loss function:

import torch
import torch.nn as nn

ce_loss = nn.CrossEntropyLoss()

outputs = torch.randn(3,5)
print(outputs)
'''
outputs对应向量会在CrossEntropyLoss中进行softmax运算,将其分量正规化
1对应向量[0, 1, 0, 0, 0]
0对应向量[1, 0, 0, 0, 0]
4对应向量[0, 0, 0, 0, 1]
分别用上面向量跟outputs中对应向量进行cross entropy 计算,最终把三个计算结果加总求平均后输出
'''
targets = torch.tensor([1, 0, 4], dtype = torch.int64)
loss = ce_loss(outputs, targets)
print(loss)

After the above code runs, it outputs a value. Since outputs are randomly initialized vectors, the output results are different each time it is run.

Finally, there is another loss function that is a variant of the above called binary cross-entropy loss. It mainly classifies the categories within two types, so the value of the element in targets does not exceed 1, and the value of the element in outputs must be between 0 and 1. Let's look at the code:

bce_loss = nn.BCELoss()
sigmoid = nn.Sigmoid()
probabilities = sigmoid(torch.randn(4,1)) #把分量取值在0,1之间
#view(4,1)把一个包含4个分量的一维向量转换成一个包含4个一维向量的2维数组,每个向量只包含一个元素
targets = torch.tensor([1, 0, 1, 0], dtype=torch.float32).view(4,1)
loss = bce_loss(probabilities, targets)
print(probabilities)
print(loss)

The output after running the above code is:

tensor([[0.6935],
        [0.8990],
        [0.6251],
        [0.3131]])
tensor(0.8760)

For more content, please search Coding Disney at station b.

Guess you like

Origin blog.csdn.net/tyler_download/article/details/129720872