review : self-supervised learning for text
1. Self-supervised learning for speech
Using the Speech version of bert can work better on speech tasks. If there is no self-supervised, other models may require tens of thousands of hours of data.
Superb
- ytb course: MpsVE60iRLM
- Tool: s3prl
2. Self-supervised Learning for Image
3. Generative Approaches
Speech
applied to voice
- Bert series
Cover parts of the sound signal to restore the model
- GPT series
Given a text, predict what the next text will be; given a sound signal, predict what the next speech will be:
the text is a word, and the sound must be greater than three words.
Image
Apply it to an image, straighten it into a one-dimensional vector
and then cover some pixels to predict the next pixel; or given a segment of pixels, predict what the next pixel is, and then take downstream tasks, such as classification.
4. Predictive Approach
Compared with text, speech and images contain a lot of details and are often more difficult.
Image - predicts whether the image has been rotated
Image - content prediction
Predict in which direction the second small block of two blocks is in the first small block.
Self-supervised learning without generation.
Let the machine predict the result of the cluster.
5. Contrastive Learning - contrastive learning
It is better to keep the same vectors as close as possible, and keep the different vectors as far away as possible.
But don't know the category, how to do it?
SimCLR
The picture of data augmentation on the image is a positive pair, and if it is the picture of data augmentation of another picture, it is a negative pair.
data augmentation:
- random crop
- color distortions
- gaussian blur
Speech version of speech simclr
Booger
Contrastive Learning with Voice Versions
- CPC
- Wav2Vec
You can use the encoder in downstream tasks or use the encoder and predicter together in downstream tasks.
Cover the token and let bert learn to fill in the blanks:
The wav2vec 2.0 version combines the encoder and bert encode together for training. Some parts of the input mask are used to predict the tokek of the mask with the output vector. The closer the 3 of the mask, the better, and the farther it is from the side, the better. .
Bert is actually a kind of contrastive learning, and it also makes the correct answer as close as possible
Classification task: higher score is better
Contrastive learning task: smaller score is better
If there are many categories in the classification task, it is impossible for the random pair to enumerate all other combinations. You can learn in a contrasive way, so that the product of the correct embedding and last layer output is as small as possible, and some incorrect embedding and last layer output are randomly selected. The bigger the product, the better.
It is difficult to exhaustively enumerate all the negative samples. We just want embedding to best represent itself. Bert can be regarded as the idea of comparative learning.
difficulty
Choose negative samples:
- Choose negative samples that are difficult enough, but not too difficult negative samples (such as two cats of the same type, which also pulls the vector of cats away)
The following describes how to avoid selecting negative samples
6. Bootstrapping Approaches
If there are no negative samples, as long as two pictures are given, two very close vectors will be given. This is not the result we want: if
only positive samples are used, one channel is followed by one predictor, so that the embeddings generated by the two channels are as close as possible. , but only the right channel is updated. After training, the parameters of the right channel are synchronized to the left channel:
- The structure of the left and right sides is a little different
- Only train the encoder on one side, and then copy the trained parameters to the other side
The two channels are different, so that the two cannot cheat together, and the above structure can also be understood in the way of knowledge distillation:
Related algorithms:
- Image
- BYOL
- simsiam
- Speech
- Data2vec
7.Simply Extra Regularization
Given a batch, the difference of each dimension is required to be greater than a threshold:
The most important thing is variance, you can add some optimization to Covariance in the future, the voice direction is similar to DeloRes