Common noun concepts in deep learning: Sota, Benchmark, Baseline, end-to-end model, what exactly does transfer learning refer to

Sota

Sota is actually the abbreviation of State of the arts. It refers to the model with the best performance in a certain field. Generally, it refers to those models with very high scores on some benchmark data sets. .
SOTA model: It does not refer to a specific model, but refers to the best/most advanced model in this research task.
SOTA result: refers to the result/performance/performance of the best model in this research task.

Trick

Trick refers to the meaning of skills, that is, small techniques and methods that can improve the performance of machine learning or deep learning during training (such as data enhancement in deep learning, etc.), and its meaning can be felt in English explanation: trick
; prank; trick; gimmick; trick; deceit

instruction fine-tuning

(Instruction FineTuning), for the existing pre-training model, give additional instructions or label data sets to improve the performance of the model, such as P-tuning, prompt-tuning, prefix-tuning.

Incremental fine-tuning

It refers to adding additional layers in the neural network, such as lora, adapter.

Generative artificial intelligence AICG

AIGC is a method of using Generative AI (GAI, generative AI), which can simulate the way of human beings to create a large amount of content in a short period of time, and can generate text, images, audio, video, code, etc.
The most representative ones are the two popular models in early 2023:
ChatGPT: a language model that can quickly understand and answer human questions
DALL-E-2: able to create a corresponding high-quality image based on text

AGI general artificial intelligence

[AGI (abbreviation for Artificial general intelligence,)] Compared with existing AI systems, AGI has a higher level of intelligence and a wider range of applications. Situations and tasks. AGI can not only learn and think like human beings, but also make action plans autonomously, and be creative and innovative.

non-end-to-end model

Non-end-to-end is input -> model A -> output A => model B -> output B -> ... => output. Different from end-to-end, non-end-to-end is regarded as a pipeline job. For example, in a typical NLP natural language processing problem, multiple steps such as word segmentation, part-of-speech tagging, syntactic analysis, and semantic analysis are performed independently. Each step It is an independent task, and its result will affect the next step, thus affecting the result of the whole training.
In the target detection task: R-CNN requires separate training of three modules, including ① CNN feature vector extraction ② SVM classification ③ border correction.

End-to-end model (PipLine)

From the input end to the output end, a prediction result will be obtained, and the error will be obtained by comparing the prediction result with the real result, and the error will be backpropagated to each layer of the network, and the weights and parameters of the network will be adjusted until the model converges or achieves the expected effect. So far, all the operations in the middle are included in the neural network and are no longer divided into multiple modules for processing. From the original data input to the result output, from the input end to the output end, the #neural network in the middle is self-contained (it can also be treated as a black box), which is the end-to-end model.

Benchmark、Baseline

Both Benchmark and baseline refer to the most basic comparison object. The motivation of your thesis comes from wanting to surpass the existing baseline/benchmark. Your experimental data needs to use the baseline/benchmark as a benchmark to judge whether there is any improvement. The only difference is that baseline pays attention to a set of methods, while benchmark is more inclined to a currently highest indicator, such as precision, recall and other quantifiable indicators. For example, BERT is the current SOTA in NLP tasks, and you have an idea that can surpass BERT. In the experimental part of the paper, the baseline that your method needs to compare is BERT, and the benchmark that needs to be compared is the specific indicators of BERT.

Backbone backbone network

The Backbone backbone network means that it is a part of the network. Most of the time it refers to the network that extracts features. Its function is to extract the information in the picture for use by the subsequent network. These networks often use resnet VGG, etc.
Because these networks have proved that the feature extraction ability on classification and other issues is very strong. When using these networks as the backbone, we directly load the officially trained model parameters, followed by our own network. Let these two parts of the network be trained at the same time, because the loaded backbone model already has the ability to extract features. During our training process, it will be fine-tuned to make it more suitable for our own tasks.

Mask

mask . (mask, mask) is an operation in deep learning. It is equivalent to putting a mask on the original tensor to shield or select some specific elements, and use the selected image, figure or object to block the image to be processed (all or part) to control the image processing. area or process.

In digital image processing, the mask is a two-dimensional matrix array, and sometimes a multivalued image is also used. In digital image processing, the image mask is mainly used for :
① Extracting the region of interest. Multiply the pre-made ROI mask with the image to be processed to obtain the ROI image. The image values ​​in the ROI remain unchanged, while the values ​​of the images outside the ROI are all 0.
② shielding effect. Use a mask to shield certain areas on the image so that they do not participate in processing or calculation of processing parameters, or only process or count the masked areas.
③ Structural feature extraction. Detect and extract structural features similar to masks in images using similarity variables or image matching methods.
④ Production of special shape images. Use selected images, graphics or objects to block the image to be processed (all or part) to control the area or process of image processing. The specific image or object used for overlay is called a mask or stencil.

Padding

padding is a shorthand property, padding. Define the space between the element border and the element content, that is, the top, bottom, left, and right inner margins. The padding area of ​​an element refers to the understanding of the padding of the spatial convolution process between its content and its border

serial, concurrent, parallel

Serial I started cooking at 12:00 noon, and halfway through, my supervisor called me. I had to wait until I finished cooking before I could answer the phone. This meant that I did not support parallelism and concurrency . Rows have one execution unit (only one execution task unit).

Concurrently, I started cooking at 12:00 noon. Halfway through, my girlfriend called. I picked up the phone, stopped cooking, and continued after talking to my girlfriend. This shows that I support concurrency . From a macro point of view, I did two things at the same time , but in fact I still dealt with them one by one: I started two tasks at the same time, but temporarily suspended one of them to focus on one thing, and at the same time Did two things. But not really doing more than one thread at the same time.

Parallel I started cooking at 12:00 noon. In the middle of it, my supervisor called (the paper was in progress), and I answered the phone while chopping vegetables, which shows that I support parallelism and concurrency. Two or more actions of speaking and cutting vegetables are performed simultaneously, and two processing units CPU are used . (There are multiple task execution units in parallel, and multiple tasks can be executed at the same time)
The so-called parallelism means that I have the ability to process multiple tasks at the same time. In fact, at the same time: when one CPU executes one thread, another CPU can execute another thread. One thread, two threads do not seize CPU resources, and can be carried out at the same time.

fine-tuning process

Fine-tuning actually talks about using the original model parameters ("knowledge") to initialize the existing model, and on this basis continue to train your own model ("reprocessing"). In human terms, it is to slightly modify the existing model and then do a small amount of training, which is mainly used when the number of samples is insufficient.

thread, process

A process consists of multiple threads.
Data is difficult to share between different processes.
Data of different threads under the same process can be easily shared.
Processes consume more computer resources than threads.
Processes do not affect each other, but if one process hangs, the entire process it is in will hang.
The process can be extended to multiple machines, suitable for multi-core and distributed.
The memory addresses used by a process can be limited.

supervised learning

It is to use enough labeled data sets to train the model, and each sample in the data set has a manually labeled label. The popular understanding is that during the learning process of the model, the "teacher" guides the direction in which the model should learn or adjust.

unsupervised learning

It means that the data used for training the model does not have label information manually marked, and starts directly. The popular understanding is to rely on the "students" to summarize and summarize knowledge through continuous exploration without teacher guidance, and try to discover the inherent laws or characteristics in the data to label the training data.

Semi-supervised learning generalization (Generalization)

It is to train the model when only a small amount of labeled data can be obtained, but a large amount of data can be obtained, so that the learner does not depend on external interaction, and automatically uses unlabeled samples to improve learning performance. Semi-supervised learning is supervised A learning method that combines learning and unsupervised learning.

transfer learning

Generally speaking, transfer learning is to use existing knowledge to learn new knowledge, take the model developed for task A as the initial point, and reuse it in the process of developing the model for task B.
The core is to find the similarity between existing knowledge and new knowledge, which is to draw inferences from one instance. Since it is too costly to learn the target domain directly from scratch, we turn to using existing relevant knowledge to assist in learning new knowledge as soon as possible.
For example, if you already know how to play Chinese chess, you can learn chess based on your existing chess knowledge; if you already know how to write Java programs, you can learn C# based on your existing programming ideas; if you have learned English, you can use analogies Learn French; etc.

Meta-learning

It is Meta Learning-learning to learn, the idea of ​​meta-learning is to learn the learning process and train the model from scratch. Obviously, this allows the model to learn to use past experience to quickly learn new tasks like humans, instead of taking over the data set and starting training from 0.
Meta-learning hopes to enable the model to acquire the ability to learn to learn and adjust parameters, so that it can quickly learn new tasks on the basis of acquiring existing knowledge . Machine learning is to adjust the parameters first, and then directly train the deep model under specific tasks. Meta-learning is to first train a better hyperparameter through other tasks, and then train for specific tasks.

CNN、RNN、LSTM

(Convolutional Neutral Network CNN) Convolutional neural networks are mainly used for visual tasks such as images and videos.
[(Recurrent Neural Network, RNN)] Recurrent neural network, mainly used to process sequence data, such as text and speech
(Long Short Term Memory, LSTM) long short-term memory network, is the most well-known and successful extension of recurrent neural network

Reinforcement Learning (RL)

(Reinforcement Learning) obtains an adaptive learning ability by obtaining incentives from the outside to correct the learning direction.

Generative Adversarial Networks (GAN-Generative Adversarial Networks)

GAN contains two models, one is a generative model and the other is a discriminative model. The task of the discriminative model is to judge whether a given instance looks like natural real data or fake (real instance comes from the dataset, fake instance comes from the generative model.

We can think of the generator as a counterfeiter, trying to create counterfeit money, and the discriminator as the police, trying to let legitimate money in and catch the counterfeit money. To succeed in this game, counterfeiters must learn how to make money that is indistinguishable from real money, meaning that the generator network must learn to create samples from the same distribution as the training data. The discriminator is to distinguish the authenticity of the input data. Through the continuous learning and improvement of the two, and finally reaching the data generated by the generator, the discriminator cannot distinguish whether it is real data.

Reinforcement Learning with Human Feedback (RLHF)

(Reinforcement Learning from Human Feedback) Construct a human feedback data set, train an incentive model, and imitate human preferences to score the results. This is the core technology for large language models in the post-GPT-3 era to more and more resemble human dialogue.

Generalization

The generalization ability of the model is easy to understand. It is the performance of the model in the test set (the data model in which has not been seen before), that is, the ability of the model to draw inferences from one instance, but these data must satisfy the iid (independent and identical distribution) and in in the same distribution.
An example is: a picture model has not seen before, but this picture is in the same distribution as TrainDataSet, and satisfies iid, the model can predict this picture very well, this is the generalization of the model, in the test set, the model predicts new The higher the accuracy of the data, the better the generalization ability of the model can be said.

Regularization

Regularization is a modification of a learning algorithm aimed at reducing generalization error rather than training error. Strategies for regularization include:
Constraints and penalties are designed to encode specific types of prior knowledge.
Simple models are preferred.
Other forms of regularization, such as ensemble methods, combine multiple hypotheses to explain the training data.

throughput

Refers to the number of successfully transmitted data per unit time (measured in bits, bytes, packets, etc.) for a network, device, port, virtual circuit or other facility. There are also other words: network throughput system throughput port throughput logistics throughput Quantity. Similar to a quantity per unit time.

large model

It generally refers to models with more than 100 million parameters, but this standard has been upgraded, and there are models with more than one trillion parameters. A large language model (Large Language Model, LLM) is a large model for language.

175B、60B、540B

These generally refer to the number of parameters, B means Billion/billion, and 175B means 175 billion parameters, which is the parameter scale of GPT3.

emerge

The study found that when the model size reaches a certain threshold, the accuracy in scenarios such as multi-step arithmetic, college exams, and word interpretation will be significantly improved, which is called emergence.

chain of thought

(Chain-of-Thought, CoT). By letting the large language model (LLM) disassemble a question into multiple steps, analyze step by step, and gradually get the correct answer. It should be pointed out that for complex problems, the probability of LLM directly giving wrong answers is relatively high. The chain of thought can be regarded as a kind of fine-tuning of instructions.


import torch
import torchvision
from torch.utils import data
from torchvision import transforms
from d2l import torch as d2l
from torch import nn
%matplotlib inline
d2l.use_svg_display()

import torch
import torch.nn as nn

targets = torch.tensor([1, 2, 0])  # 某个时间步batch中各样本的labels
preds = torch.tensor([[1.4, 0.5, 1.1], [0.7, 0.4, 0.2], [2.4, 0.2, 1.4]])   # 某个时间步的batch*preds

criterion1 = nn.CrossEntropyLoss()   # 不屏蔽padding
criterion2 = nn.CrossEntropyLoss(ignore_index=0)   # 不屏蔽padding

loss1 = criterion1(preds, targets)
# 返回  tensor(1.1362)

loss2 = criterion2(preds, targets)
# 返回  tensor(1.5088) , 例子中该损失只考虑了2个样本


Guess you like

Origin blog.csdn.net/yanhaohui/article/details/130382102