computer vision
Object detection, semantic segmentation, object classification
Natural Language Processing NLP
data structure
- data structure
- access element
linear regression
It can be seen as a single-layer neural network with an explicit solution
optimization
Gradient descent, hyperparameters: learning rate, batch size
classification regression
single layer perceptron, multilayer perceptron
- The multi-layer perceptron uses hidden layers and activation functions to obtain nonlinear models. Commonly used activation functions are Sigmoid, Tanh, and ReLU;
- Softmax to deal with multi-classification problems
- Hyperparameters of multilayer perceptron: number of hidden layers, size of each hidden layer
- Validation set data and test set data cannot be mixed together, k-fold cross-validation
overfitting underfitting
Model capacity: the ability to fit various functions
The complexity of the control model: the number of parameters, the selection range of parameter values
Weight Decay and Dropout
- Weight decay is to reduce the complexity of the model by rigidly restricting the weight not to exceed a certain value
- The discarding method randomly sets some output items to 0 to control the complexity of the model, and the discarding probability is a hyperparameter to control the complexity of the model
numerical stability
- gradient explosion
- vanishing gradient
- Maintain training stability: change multiplication to addition, normalization (gradient normalization, gradient clipping), reasonable weight initialization and activation functions
convolutional layer
Each output channel can recognize a specific pattern, and multiple channels can be fused
pooling layer
The input channel is equal to the output channel, alleviating the sensitivity of the convolutional layer to the position
Regularization
Regularization and normalization mean that the value of the weight should not be too large to avoid certain overfitting
Batch Normalization
- Batch normalization fixes the mean and variance in mini-batches, then learns appropriate offsets and scaling
- Can speed up the convergence speed, but generally does not change the model accuracy, you can use a larger learning rate to speed up the model convergence
loss function
-
L2 Loss
-
L1 Loss
-
Huber Rubost Loss robust error
Residual network ResNet
The residual block makes it easier to train a deep network, and can effectively avoid training to the later stage. If the gradient is too small, the training is very slow. Through the residual block, you can train first, and then come back to update the gradient.
image augmentation
- Generate pictures online, do random enhancement, and will not generate pictures after image enhancement
- Data augmentation obtains diversity by deforming data so that the generalization ability of the model is better. Common image augmentation includes flipping, cutting and discoloration
finetune
Finetune is a kind of transfer learning, which is to train a model on a larger data set, and directly take the structural parameters of the model and apply it to a small data set (the two data sets have certain similarities), but In the final fc (classification or regression problem) to randomly initialize parameter training
Anchor box
- The border box (BordingBox)
first generates a large number of anchor boxes and assigns labels, and each anchor box is used as a sample for training. Use NMS to remove redundant predictions during prediction
Target Detection
- R-cnn (regional convolutional neural network)
- Mask R-cnn: If there is a pixel-level number, use fcn to use this information
- Faster rcnn
has high precision, but the processing speed is very slow, not as good as yolo (you only look once) - ssd (single-shot multi-frame detection)
is no longer maintained and developed, and it is rarely used - yolo
yolo divides the picture evenly into SxS anchor boxes, and each anchor box predicts B edge boxes
semantic segmentation
Classification at the pixel level, divided into background and other types
transposed convolution
The height and width of the input image can be increased, which can be simply understood as a reverse convolution operation, and the result obtained is the opposite of the convolution operation
The number of channels output by the fully connected convolutional neural network FCN = the number of categories- Style migration
Find a style tensor and content tensor respectively, three losses, style, content, noise
hardware upgrade
sequence model
Can use Markov prediction or autoregressive prediction
- Text preprocessing
Convert some words in the sentence into data that can be processed - Language Modeling
Estimates joint probabilities of sequences of text, often n-grams using statistical methods - RNN (Recurrent Neural Network)
stores a time series information
. The quality of a predictive time series type model can be regarded as a classification problem, that is, the probability of the next token index, which can be measured by the average cross entropy
- Gradient clipping
is often used in gradient clipping rnn to effectively prevent gradient explosion
Gated Recurrent Unit (GRU)
Can control which is important, which is not important, the mechanism that can pay attention (update gate); the mechanism that can be forgotten (forget gate)
LSTM long short-term memory network
- Forget the gate: decrement the value towards 0
- Input Gate: Deciding not to ignore the input data
- Output gate: decide whether to use hidden state
Deep Recurrent Neural Networks
Deep Recurrent Neural Networks use multiple hidden layers for more nonlinearity
bidirectional recurrent neural network
- Bidirectional Recurrent Neural Networks Utilize Orientational Temporal Information Through Inversely Updated Hidden Layers
- It is usually used to extract features and fill in the blanks of the sequence, not to predict the future
encoder-decoder
Seq2seq
Elmo pre-trained model
Forward and backward prediction, polysemy
- GPT one-way language model
- BERT bidirectional language model
beam search
Beam search stores the k best candidates at each search
- When k=1, it is a greedy search
- When k=n, it is an exhaustive search
attention mechanism
We can look at the Attention mechanism in this way (the reference picture is the above picture): imagine the constituent elements in the Source as consisting of a series of <Key, Value> data pairs. At this time, given an element Query in the Target, pass Calculate the similarity or correlation between Query and each Key, obtain the weight coefficient of each Key corresponding to Value, and then perform weighted summation on Value to obtain the final Attention value. So in essence, the Attention mechanism is to weight and sum the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient corresponding to the Value.
seq2seq using attention wit
self-attention mechanism
The self-attention mechanism model is good at processing extremely long texts, but the calculation cost is extremely high, requiring thousands of GPUs to calculate at the same time. The longer the text, the more resources are consumed, which is a square relationship.