Machine Learning: Self-supervised Learning for Speech and image

insert image description here

review : self-supervised learning for text

insert image description here

1. Self-supervised learning for speech

insert image description here
Using the Speech version of bert can work better on speech tasks. If there is no self-supervised, other models may require tens of thousands of hours of data.

Superb

insert image description here
insert image description here
insert image description here


2. Self-supervised Learning for Image

insert image description here
insert image description here


3. Generative Approaches

insert image description here

Speech

applied to voice

  • Bert series
    insert image description here

Cover parts of the sound signal to restore the model
insert image description here

  • GPT series
    Given a text, predict what the next text will be; given a sound signal, predict what the next speech will be:
    insert image description here
    the text is a word, and the sound must be greater than three words.

Image

Apply it to an image, straighten it into a one-dimensional vector
insert image description here
and then cover some pixels to predict the next pixel; or given a segment of pixels, predict what the next pixel is, and then take downstream tasks, such as classification.

4. Predictive Approach

insert image description here
Compared with text, speech and images contain a lot of details and are often more difficult.

Image - predicts whether the image has been rotated

insert image description here

Image - content prediction

Predict in which direction the second small block of two blocks is in the first small block.
insert image description here

Self-supervised learning without generation.
insert image description here
Let the machine predict the result of the cluster.


5. Contrastive Learning - contrastive learning

insert image description here
insert image description here
It is better to keep the same vectors as close as possible, and keep the different vectors as far away as possible.
But don't know the category, how to do it?

SimCLR

The picture of data augmentation on the image is a positive pair, and if it is the picture of data augmentation of another picture, it is a negative pair.
insert image description here
data augmentation:

  • random crop
  • color distortions
  • gaussian blur

Speech version of speech simclr

Booger

insert image description here

Contrastive Learning with Voice Versions

  • CPC
  • Wav2Vec

insert image description here
You can use the encoder in downstream tasks or use the encoder and predicter together in downstream tasks.

insert image description here
insert image description here
Cover the token and let bert learn to fill in the blanks:

insert image description here

The wav2vec 2.0 version combines the encoder and bert encode together for training. Some parts of the input mask are used to predict the tokek of the mask with the output vector. The closer the 3 of the mask, the better, and the farther it is from the side, the better. .

insert image description here
Bert is actually a kind of contrastive learning, and it also makes the correct answer as close as possible
insert image description here

insert image description here

Classification task: higher score is better
Contrastive learning task: smaller score is better

If there are many categories in the classification task, it is impossible for the random pair to enumerate all other combinations. You can learn in a contrasive way, so that the product of the correct embedding and last layer output is as small as possible, and some incorrect embedding and last layer output are randomly selected. The bigger the product, the better.

insert image description here
It is difficult to exhaustively enumerate all the negative samples. We just want embedding to best represent itself. Bert can be regarded as the idea of ​​comparative learning.
insert image description here

difficulty

Choose negative samples:

  • Choose negative samples that are difficult enough, but not too difficult negative samples (such as two cats of the same type, which also pulls the vector of cats away)
    insert image description here
    The following describes how to avoid selecting negative samples

6. Bootstrapping Approaches

insert image description here
If there are no negative samples, as long as two pictures are given, two very close vectors will be given. This is not the result we want: if
insert image description here
only positive samples are used, one channel is followed by one predictor, so that the embeddings generated by the two channels are as close as possible. , but only the right channel is updated. After training, the parameters of the right channel are synchronized to the left channel:
insert image description here

  • The structure of the left and right sides is a little different
  • Only train the encoder on one side, and then copy the trained parameters to the other side

The two channels are different, so that the two cannot cheat together, and the above structure can also be understood in the way of knowledge distillation:

insert image description here
insert image description here
Related algorithms:

  • Image
    • BYOL
    • simsiam
  • Speech
    • Data2vec

7.Simply Extra Regularization

insert image description here
Given a batch, the difference of each dimension is required to be greater than a threshold:
insert image description here
insert image description here

The most important thing is variance, you can add some optimization to Covariance in the future, the voice direction is similar to DeloRes

Summarize

insert image description here

Guess you like

Origin blog.csdn.net/uncle_ll/article/details/131798275