Article directory
- Review:Self-supervised Learning for Text
- Self-supervised Learning for Speech
- Self-supervised Learning for Image
-
- 1. Generative Approaches
- 2. Predictive Approach (Analysis of the shortcomings of Generator: voice and influence contain many details, and it is difficult to generate directly)
- 3. Contrastive Learning (self-supervise learning without producing anything)
- 4. Bootstrapping Approaches (the next two tricks: avoid the choice of negative example)
- 5. Simply Extra Regularizaion
- Concluding Remarks (and many, many more ways..)
pdf | Video
Review:Self-supervised Learning for Text
Self-supervised Learning for Speech
A small amount of labeled data is used for training: Downstream Model (such as a simple Linear model), if necessary, the entire model can be fine-tuned (not necessary)
Voice version of BERT
Self-supervised Learning for Image
Self-supervised exceeds supervised, very potential
1. Generative Approaches
phonetically
It is impossible to copy, there are still qualitative differences between voice and text, and some design needs to be made according to the characteristics of voice.
For example, the content of adjacent sound vectors is often very close. If you just cover a certain vector, the machine will not learn anything, because the content of adjacent sound vectors is often very close, and the machine only needs to use the vectors on both sides to do an interpolation. , the predictions are close to ten, and the self-supervised only learns the interpolation.
Therefore, in terms of speech, it is necessary to mask a long list of features, instead of masking only one feature at a time, forcing the machine to solve difficult problems
In terms of speech, you can make a different attempt: mask certain dimensions of the vector, this method will allow the machine to learn the information of the speaker (? semantic?)
The GPT series is used in speech, the
difference: the vector that predicts the time far enough (because the adjacent is too simple)
On image
2. Predictive Approach (Analysis of the shortcomings of Generator: voice and influence contain many details, and it is difficult to generate directly)
There are many ways to learn something by making this simple task for a machine to solve.
Question: What kind of small tasks can unleash the potential of a machine? There is no particularly good answer yet. You need to have a better understanding of the sound and image characteristics in order to design a better mini-game for the machine to play.
The following is a more general approach: simplify the generation, make it something simpler, and then predict,
such as Clustering, first turn the complex vector into a token, and then predict these symbols, which is easier
3. Contrastive Learning (self-supervise learning without producing anything)
How to know positive or negative? (Data Augmentation)
The question then becomes how to do Data Augmentation? It's too easy, and the machine can't learn anything, and it's too difficult to do. How to control the degree of Augmentation? Look at the original paper SimCLR, try various combinations of Augmentation, and tell you how to do the best Augmentation. The literature says that random cropping is the most effective
SimCLR on Speech: Speech SimCLR
Another type: MoCo (more memory bank and momentum encoder), adding training tips, so that training can be successful, see the literature for details
Contrastive Learning on Voice
The output token is Discrete, why: 1. Use BERT; 2. Remove other noises
2.0: Train together (Continuous vs discrete?)
Another angle of understanding: Classification vs Contrastive (explain why Contrastive is feasible, but it is also doing classification? Contrastive's Negative is just a sample, not all, and Classification's Negative is all, which also understands why MoCo does a memory bank? store more negative classes?)
Contrastive is obviously better for insufficient computing resources.
If the token of Classification is 10w, the resources will not be enough, especially in the early years
Root cause: Infinite negative examples? (I can't make it if I can't save it, so I will do a clustering first)
BERT is applied to the voice
There is another problem: how to choose Negative Examples?
For example, cats and the sky may only learn to draw "color" information,
so there must be "hard enough examples", such as cats and dogs, tigers
One question: What if both are pictures of cats? We don't know that the two are cats. If we take the pictures of the two cats as negative examples, shouldn't they be regarded as the same kind of things?
4. Bootstrapping Approaches (the next two tricks: avoid the choice of negative example)
What sorcery is this?
There must be predictor and copy on the right side to avoid Collapse
Key point: left and right architectures are different
Another point of view to understand: Bootstrapping
5. Simply Extra Regularizaion
Most importantly: Variance