[OpenMMLab AI Practical Camp Phase II Notes] The fourth day of deep learning pre-training and MMPretrain

Introduction to MMPreTrain algorithm library

1. Algorithm library and task composition

MMPretrain is a newly upgraded (MMPretrain originates from MMClassification and MMSelfSup) pre-training open source algorithm framework, which aims to provide a variety of powerful pre-training backbone networks and supports different pre-training strategies.
Main functions:
(1) Abundant models : including backbone models, self-supervised learning algorithms, and multimodal learning algorithms
(2) Data set support : common data sets such as COCO and ImageNet
(3) Training skills and strategies : optimizer and learning Rate strategy, data enhancement strategy
(4) Ease of use : a large number of preset configuration files, pre-trained models, model training tools, model parameter calculation tools, CAM interpretability analysis, various visualization tools, deployment tools, etc.

2. Framework overview

2.1 OpenMMLab software stack

Insert image description here

2.2 MMPreTrain installation steps

Basic installation :

conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y# 创建环境
conda activate open-mmlab # 激活环境
pip install openmim #安装mim工具
git clone https://github.com/open-mmlab/mmpretrain.git # 下载mmpretrain
cd mmpretrain #进入目录
mim install -e . #安装

Multimodal dependencies : (installed when using multimodal related content)

mim install -e ".[multimodal]"
2.3 Important Concept - Configuration File

You can adjust the following content under the configuration file to realize your own ideas. The training of the deep learning model involves several aspects:
(1) Model structure: how many layers the model has, how many channels are in each layer, etc.
(2) Data: data set division, data file path, batch size, data enhancement strategy, etc.
(3) Training Optimization: Gradient descent algorithm, learning rate parameters, total training rounds, learning rate change strategy, etc. (
4) Runtime: GPU, distributed environment configuration, etc.
(5) Auxiliary functions: print logs, save weight files regularly

2.4 Code framework

Insert image description here

2.5 Data flow

Insert image description here

2.6 How configuration files work

Insert image description here

Classic backbone network

1.ResNet

Counter-intuitive problem : When the convolution degenerates into an identity map, the deep network is the same as the shallow network. Therefore, the deep network should have the same classification accuracy as the shallow network, but the actual effect is that the deep network performs worse. .
Conjecture : Although the deep network has the potential to achieve higher accuracy, it is difficult for conventional optimization algorithms to find this better model, that is, fitting the newly added convolutional layer to an approximate identity map can just make the shallow network become better.
The basic idea of ​​residual learning:
Insert image description here
ResNet's achievements and influence:
One of the most influential and widely used model structures in the field of deep learning, won the CVPR2016 Best Paper Award. The residual structure has also been widely used so far, whether it is today's various visual Transformers in computer vision or convolutional neural networks such as ConvNeXt, or there are residual structures in GPT and various large language models

2.Vision Transformer

2.1 Attention mechanism

Insert image description here
Insert image description here

2.2 Multi-head attention mechanism

Insert image description here
Benefits: During the calculation process, the attention of different heads can extract different features separately, thereby improving the performance of the network

2.3 Vision Transformer

Insert image description here

self-supervised learning

Massive data requires less labeled data. In order to utilize massive data without relying on labeling, the network can learn the corresponding feature expressions from the data itself.
Common types:
1. Based on various proxy tasks, such as image coloring, etc.
2. Based on contrast learning
3. Based on mask learning

1. SimCLR

The basic assumption of contrastive learning : If the model can extract the essence of the picture content well, then no matter what kind of data enhancement operation the picture undergoes, the extracted features should be very similar.
Insert image description here

2.MAE

Basic assumption : Only by understanding the content of the image and grasping the contextual information of the image can the model recover the randomly occluded content in the image.
Insert image description here

multimodal algorithm

1.CLIP(ICML 2021)

Insert image description here

2.BLIP

Insert image description here
ITC ——Image-Text Contrastive Loss feature distinction
ITM ——Image-Text Matching Loss feature matching
LM ——Language Modeling Loss text generation
BLIP combines three kinds of losses to build a new Vision Language Pre-training framework, And can complete various downstream tasks, such as image retrieval, image description generation, visual question answering

Other multimodal algorithms: BLIP-2, Flamingo, Kosmos-1, LLaVA

Guess you like

Origin blog.csdn.net/qq_41776136/article/details/131056850