Summary of Deep Learning Details

computer vision

Object detection, semantic segmentation, object classification

Natural Language Processing NLP

data structure

  • data structure
    insert image description here
  • access element
    insert image description here

linear regression

It can be seen as a single-layer neural network with an explicit solution

optimization

Gradient descent, hyperparameters: learning rate, batch size

classification regression

single layer perceptron, multilayer perceptron

  • The multi-layer perceptron uses hidden layers and activation functions to obtain nonlinear models. Commonly used activation functions are Sigmoid, Tanh, and ReLU;
  • Softmax to deal with multi-classification problems
  • Hyperparameters of multilayer perceptron: number of hidden layers, size of each hidden layer
  • Validation set data and test set data cannot be mixed together, k-fold cross-validation

overfitting underfitting

insert image description here

Model capacity: the ability to fit various functions

insert image description here

The complexity of the control model: the number of parameters, the selection range of parameter values

Weight Decay and Dropout

  • Weight decay is to reduce the complexity of the model by rigidly restricting the weight not to exceed a certain value
  • The discarding method randomly sets some output items to 0 to control the complexity of the model, and the discarding probability is a hyperparameter to control the complexity of the model

numerical stability

  • gradient explosion
  • vanishing gradient
  • Maintain training stability: change multiplication to addition, normalization (gradient normalization, gradient clipping), reasonable weight initialization and activation functions

convolutional layer

Each output channel can recognize a specific pattern, and multiple channels can be fused

pooling layer

The input channel is equal to the output channel, alleviating the sensitivity of the convolutional layer to the position

Regularization

Regularization and normalization mean that the value of the weight should not be too large to avoid certain overfitting

Batch Normalization

  • Batch normalization fixes the mean and variance in mini-batches, then learns appropriate offsets and scaling
  • Can speed up the convergence speed, but generally does not change the model accuracy, you can use a larger learning rate to speed up the model convergence

loss function

  1. L2 Loss
    insert image description here

  2. L1 Loss
    insert image description here

  3. Huber Rubost Loss robust error
    insert image description here

Residual network ResNet

The residual block makes it easier to train a deep network, and can effectively avoid training to the later stage. If the gradient is too small, the training is very slow. Through the residual block, you can train first, and then come back to update the gradient.

image augmentation

  • Generate pictures online, do random enhancement, and will not generate pictures after image enhancement
  • Data augmentation obtains diversity by deforming data so that the generalization ability of the model is better. Common image augmentation includes flipping, cutting and discoloration

finetune

Finetune is a kind of transfer learning, which is to train a model on a larger data set, and directly take the structural parameters of the model and apply it to a small data set (the two data sets have certain similarities), but In the final fc (classification or regression problem) to randomly initialize parameter training
insert image description here

Anchor box

  • The border box (BordingBox)
    insert image description here
    first generates a large number of anchor boxes and assigns labels, and each anchor box is used as a sample for training. Use NMS to remove redundant predictions during prediction

Target Detection

  • R-cnn (regional convolutional neural network)
  • Mask R-cnn: If there is a pixel-level number, use fcn to use this information
  • Faster rcnn
    has high precision, but the processing speed is very slow, not as good as yolo (you only look once)
  • ssd (single-shot multi-frame detection)
    is no longer maintained and developed, and it is rarely used
  • yolo
    yolo divides the picture evenly into SxS anchor boxes, and each anchor box predicts B edge boxes

semantic segmentation

Classification at the pixel level, divided into background and other types
insert image description here

transposed convolution

The height and width of the input image can be increased, which can be simply understood as a reverse convolution operation, and the result obtained is the opposite of the convolution operation
insert image description here


  • insert image description here
    The number of channels output by the fully connected convolutional neural network FCN = the number of categories
  • Style migration
    Find a style tensor and content tensor respectively, three losses, style, content, noise

hardware upgrade

insert image description here

sequence model

Can use Markov prediction or autoregressive prediction

  • Text preprocessing
    Convert some words in the sentence into data that can be processed
  • Language Modeling
    Estimates joint probabilities of sequences of text, often n-grams using statistical methods
  • RNN (Recurrent Neural Network)
    stores a time series information
    . The quality of a predictive time series type model can be regarded as a classification problem, that is, the probability of the next token index, which can be measured by the average cross entropy
    insert image description here
  • Gradient clipping
    is often used in gradient clipping rnn to effectively prevent gradient explosion

Gated Recurrent Unit (GRU)

Can control which is important, which is not important, the mechanism that can pay attention (update gate); the mechanism that can be forgotten (forget gate)

LSTM long short-term memory network

  • Forget the gate: decrement the value towards 0
  • Input Gate: Deciding not to ignore the input data
  • Output gate: decide whether to use hidden state

Deep Recurrent Neural Networks

insert image description here
Deep Recurrent Neural Networks use multiple hidden layers for more nonlinearity

bidirectional recurrent neural network

insert image description here

  • Bidirectional Recurrent Neural Networks Utilize Orientational Temporal Information Through Inversely Updated Hidden Layers
  • It is usually used to extract features and fill in the blanks of the sequence, not to predict the future

encoder-decoder

insert image description here

Seq2seq

insert image description here
insert image description here
insert image description here

Elmo pre-trained model

Forward and backward prediction, polysemy

  • GPT one-way language model
  • BERT bidirectional language model

beam search

Beam search stores the k best candidates at each search

  • When k=1, it is a greedy search
  • When k=n, it is an exhaustive search

attention mechanism

insert image description here

We can look at the Attention mechanism in this way (the reference picture is the above picture): imagine the constituent elements in the Source as consisting of a series of <Key, Value> data pairs. At this time, given an element Query in the Target, pass Calculate the similarity or correlation between Query and each Key, obtain the weight coefficient of each Key corresponding to Value, and then perform weighted summation on Value to obtain the final Attention value. So in essence, the Attention mechanism is to weight and sum the Value values ​​​​of the elements in the Source, and Query and Key are used to calculate the weight coefficient corresponding to the Value.
insert image description here

insert image description here

insert image description here

insert image description here

seq2seq using attention wit

insert image description here
insert image description here

self-attention mechanism

The self-attention mechanism model is good at processing extremely long texts, but the calculation cost is extremely high, requiring thousands of GPUs to calculate at the same time. The longer the text, the more resources are consumed, which is a square relationship.

insert image description here

insert image description here
insert image description here

optimization

insert image description here

Guess you like

Origin blog.csdn.net/weixin_45277161/article/details/129328691