[(Strongly recommended) Li Hongyi 2021/2022 Spring Machine Learning Course] 2022 - Magical Self-Supervised Learning Model on Voice and Image [Precise]


insert image description here
pdf | Video

Review:Self-supervised Learning for Text

insert image description here

Self-supervised Learning for Speech

A small amount of labeled data is used for training: Downstream Model (such as a simple Linear model), if necessary, the entire model can be fine-tuned (not necessary)

insert image description here

Voice version of BERT

insert image description here

Self-supervised Learning for Image

insert image description here

Self-supervised exceeds supervised, very potential
insert image description here

1. Generative Approaches

phonetically

insert image description here
insert image description here

It is impossible to copy, there are still qualitative differences between voice and text, and some design needs to be made according to the characteristics of voice.
For example, the content of adjacent sound vectors is often very close. If you just cover a certain vector, the machine will not learn anything, because the content of adjacent sound vectors is often very close, and the machine only needs to use the vectors on both sides to do an interpolation. , the predictions are close to ten, and the self-supervised only learns the interpolation.

Therefore, in terms of speech, it is necessary to mask a long list of features, instead of masking only one feature at a time, forcing the machine to solve difficult problems
insert image description here

In terms of speech, you can make a different attempt: mask certain dimensions of the vector, this method will allow the machine to learn the information of the speaker (? semantic?)
insert image description here

The GPT series is used in speech, the
difference: the vector that predicts the time far enough (because the adjacent is too simple)
insert image description here

On image

insert image description here

2. Predictive Approach (Analysis of the shortcomings of Generator: voice and influence contain many details, and it is difficult to generate directly)

insert image description here

There are many ways to learn something by making this simple task for a machine to solve.
insert image description here
insert image description here

Question: What kind of small tasks can unleash the potential of a machine? There is no particularly good answer yet. You need to have a better understanding of the sound and image characteristics in order to design a better mini-game for the machine to play.

The following is a more general approach: simplify the generation, make it something simpler, and then predict,
such as Clustering, first turn the complex vector into a token, and then predict these symbols, which is easier
insert image description here

3. Contrastive Learning (self-supervise learning without producing anything)

insert image description here

insert image description here
How to know positive or negative? (Data Augmentation)

insert image description here

The question then becomes how to do Data Augmentation? It's too easy, and the machine can't learn anything, and it's too difficult to do. How to control the degree of Augmentation? Look at the original paper SimCLR, try various combinations of Augmentation, and tell you how to do the best Augmentation. The literature says that random cropping is the most effective

SimCLR on Speech: Speech SimCLR

Another type: MoCo (more memory bank and momentum encoder), adding training tips, so that training can be successful, see the literature for details
insert image description here

Contrastive Learning on Voice

insert image description here

The output token is Discrete, why: 1. Use BERT; 2. Remove other noises
insert image description here

insert image description here

2.0: Train together (Continuous vs discrete?)

insert image description here
Another angle of understanding: Classification vs Contrastive (explain why Contrastive is feasible, but it is also doing classification? Contrastive's Negative is just a sample, not all, and Classification's Negative is all, which also understands why MoCo does a memory bank? store more negative classes?)

Contrastive is obviously better for insufficient computing resources.
If the token of Classification is 10w, the resources will not be enough, especially in the early years
insert image description here

Root cause: Infinite negative examples? (I can't make it if I can't save it, so I will do a clustering first)
insert image description here
BERT is applied to the voice

insert image description here

There is another problem: how to choose Negative Examples?

For example, cats and the sky may only learn to draw "color" information,
so there must be "hard enough examples", such as cats and dogs, tigers
One question: What if both are pictures of cats? We don't know that the two are cats. If we take the pictures of the two cats as negative examples, shouldn't they be regarded as the same kind of things?

insert image description here

4. Bootstrapping Approaches (the next two tricks: avoid the choice of negative example)

What sorcery is this?

There must be predictor and copy on the right side to avoid Collapse

Key point: left and right architectures are different
insert image description here

Another point of view to understand: Bootstrapping
insert image description here
insert image description here

5. Simply Extra Regularizaion

insert image description here
Most importantly: Variance

insert image description here

Concluding Remarks (and many, many more ways...)

insert image description here

Guess you like

Origin blog.csdn.net/weixin_43154149/article/details/124386156