VALSE2023-Content summary (updating)

insert image description here

博文为精选内容,完整ppt请留言索取
The update will be completed within a week, so stay tuned

The 2023 Vision And Learning SEminar (VALSE) was held at Wuxi Taihu International Expo Center from June 10th to 12th , hosted by the Chinese Society for Artificial Intelligence and the Chinese Society for Image and Graphics, Jiangnan University , Organized by the Management Committee of Wuxi National High-tech Industrial Development Zone. A total of 3 conference keynote reports, 4 conference invited reports, 12 annual progress reports (APR) reports, 4 workshops (Tutorial), and 20 seminars (Workshop) were presented . In addition, there are 186 papers and posters of top conferences and top journals to display exchange activities


Valse 2023 于2023年6月9日在无锡 太湖博览中心召开

Conference Invited Report & Annual Progress Review (APR)

1. Feature encoding and digital retina

Speaker: Gao Wen of Peking University
From the cognitive and psychological basis, to the feature coding method, digital retina,
insert image description here
insert image description here
and finally model compression, terminal deployment, large models, etc.,
insert image description here
and the visual model of Pengcheng-Dasheng
insert image description here


2. Thinking and some problems of the next generation of deep learning

Jiao Licheng Xidian University
This report focuses on discussing the research related to the basic theory of deep learning with you. First, the origin and development of deep learning are reviewed . Then, it discusses the re-understanding and re-thinking of deep learning, which leads to the basic theory that should be broken through . Then, the representation, learning and optimization theory of deep learning are discussed from three aspects: brain-inspired, physical-inspired and evolution-inspired . Finally, some thoughts on the next generation of deep learning are given.

insert image description here
Optimization Theory:
insert image description here
Representation Theory and Learning Theory:
insert image description here
Interdisciplinary Impacts from Other Disciplines:
insert image description here

Origin and Revelation
insert image description here

Of course, there are also electromagnetism, statistical thermodynamics, optics, energy models, and quantum intelligence. If you need a complete PPT, please send a private message.

insert image description here

There is also a review of NAS for meta-learning and neural network search. Finally, the concluding thoughts:
insert image description here


3. Computer Vision – From Isolated to Systematic Approaches

Chen Xilin Institute of Computing Technology, Chinese Academy of Sciences

In the field of AI, the research paradigm for a long time has been single-point research with the core of isolated algorithms. At the same time, problems such as uneven sample distribution and task diversity widely exist in the real world. For the previous isolated research paradigm, these problems are obviously insurmountable difficulties. Therefore, it is necessary to explore and integrate multi-modal information from a systematic perspective, and build a systematic learning system from sensory exploration, from passive to active. This report will introduce some of our recent thinking and attempts in this area, and explore the path of continuous accumulation and learning from a systematic perspective.

History of CV Development
insert image description here

Several trends in computer vision:
insert image description here
the logic behind the model :
1. What is the model? (not the complexity of the algorithm)
Model = Algorithm complexity X Cultivate data
Cultivate data scale and dimension disaster
2. Model maturity M = calculation Power/model complexity
Thinking level – example: NN before 2000, very original results, few people can recognize it
Research level – example: NN around 2010, became an important means in academia Research level
industry level – example: Today's large model industrial level
individual user level

Is the big model the hope or the end?
1. IBM 360's enlightenment computer architecture The
big model leads to the AI ​​architecture: 1. The big model becomes a component (used directly);
Architecture
1. The interface between the basic capabilities of AI; 2. The structural support of AGI; 3. The research field beyond traditional AI topics; 4. Beyond a single intelligence to promote a comprehensive intelligent body

Take home messages:
insert image description here

4. NeRF-based 3D vision annual progress report

Liu Yebin Tsinghua University

Neural Radiative Field (NeRF) is a 3D representation based on implicit field and volume rendering, and has
attracted extensive attention for its end-to-end differentiability and high-quality viewpoint generation . Since NeRF was proposed, scholars have made many improvements to its implicit field itself or the volume rendering process to achieve accelerated reasoning and training, geometric and apparent decoupling, material and lighting editing, and dynamic and static and multi-scale under sparse viewpoints. scene modeling. At the same time, by combining multivariate representations and generative models, the applications of NeRF in the field of 3D vision are emerging in an endless stream. This report will review the important research results of the neural radiation field in the past year, covering the optimization of its characterization basis and representative application research, focusing on the two existing challenges of NeRF, including high-quality 3D reconstruction and Rendering, and the extension of NeRF to efficient four-dimensional representation of spatiotemporal dynamic scenes to discuss and look forward to . First the rationale:
insert image description here

importance:
insert image description here

Several major research directions :
insert image description here

Four common scene modeling :
insert image description here
There are also several scene modeling for embodied applications :

insert image description here


5. Frontier progress of diffusion probability model

Zhu Jun Tsinghua University

AIGC is developing rapidly, and the diffusion probability model is one of the key technologies of AIGC, and
has made remarkable progress in text image generation and 3D generation
. This report introduces several advances in diffusion probability models, including the basic theory and efficient algorithms of diffusion probability models , large-scale multimodal diffusion models , and 3D generation . First is the principle:
insert image description here

Two different differential equations of SDE and ODE are compared :

ODE (Ordinary Differential Equation, Ordinary Differential Equation) describes the relationship between deterministic variables over time. It is composed of differential equations of the form dt/dy = f(y) , where y is a deterministic variable, f is its derivative relation. The solution of an ODE is a definite function whose solution is unique for a given initial value.

SDE (Stochastic Differential Equation, Stochastic Differential Equation) describes the relationship between random variables over time. It is composed of differential equations of the form dX t​ =μdt+σdW t , where μ and σ are deterministic constants, W t​ is a random process (usually Brownian motion). The solution of SDE is also a stochastic process, which introduces the uncertainty of the initial value into the solution, so, given the same initial value and parameters, the solution of SDE is usually not unique. ODE plays a more critical role in describing deterministic systems, while SDE is more suitable for describing behaviors in stochastic systems .
insert image description here

Followed by the team : @ THU TSAIL Group : some progress on Diffusion Models,

Basic theory and algorithms
1.Score estimate for energy-based LVMs (ICML2021)
2.High-order denoising score matching (ICML 2022
3.Analytic-DPM - optimal variance estimate (ICLR 2022 0utstanding paper.ICML 2022)
4.DPM-Solver - the fastest inference algorithm (NeurlPs Oral, 2022)
5.U_ViT backbone - more scalable (CVPR 2023)

Novel design of diffusion models for various tasks
1.Energy-guided DPM for lmage-2-lmage translation (NeurlPs,2022)
2.Equivariant energy-guided DPM for Molecular design (ICLR 2023)
3.Generative behavior modeling for Offline RL (ICLR 2023)
4.UniDiffuser for Multimodal inference (ICML 2023)
5.ProlificDreamer for Text-2-3D content (arXiv:2305.16213, 2023)
6.ControlVideo for one-shot Text-2-Video editing (arXiv:2305.17098, 2023)

The last three jobs above are highlighted :

1. Multimodal Forecasting
insert image description here

2. ProlificDreamer: High-quality Text-to-3D (adapted from dreamFusion)

1.DreamCLIP, a single scene, direct gradient descent optimization
2.DreamFusion, a single scene to fit the pre-trained distribution, the method is score distillation 3.samplingProlificDreamer, scene distribution (a bunch of scenes) to fit the pre-trained distribution, the method is variational score distillation

insert image description here
3. ControlVideo: One Shot Text-to-Video Editing

insert image description here
Final summary:
insert image description here


10. Visual Self-Supervised Learning

Han Hu Microsoft Research Asia


The mainstream paradigm of self-supervised learning in vision has undergone a migration from contrastive learning approaches to generative approaches over the past year or so . The generative methods represented by BEiT/MAE/SwinV2 (SimMIM) have achieved excellent performance under the pre-training-fine-tuning paradigm, and more importantly, they have been proven to have better data and model scalability than previous methods It can also be well integrated with multimodal methods. This APR provides an overview of the major advances in visual self-supervised learning over the past year, including research on the pre-training method itself and its related properties.

insert image description here
insert image description here

Annual Progress in Self-Supervised Learning (2022-2023) :

Technology progress trend 1: Improvement of mask image modeling
Technology progress trend 2: Mask image modeling is found to be more friendly to large models Technology progress trend 3: Mask image modeling training
for small
models Technology progress trend 4: Mining mask Good properties of code image modeling
Technology development trend 5: Expansion to other modalities

insert image description here

insert image description here
insert image description here

Extend to other modals :
insert image description here

Summarize:insert image description here

11. Remote sensing target detection

Northwestern Polytechnical University

This report first summarizes and analyzes the challenges faced by remote sensing target detection , and then focuses on the main progress of remote sensing target detection in recent years, mainly including directed target detection , weakly supervised target detection , small sample target detection , target model recognition , and weak and small target detection wait

Several challenges :
insert image description here
Several algorithms for directed target detection :
insert image description here
insert image description here

Weakly supervised object detection :
insert image description here

Fine-grained identification :
insert image description here

Efficient Object Detection
insert image description here

Weak target detection :
insert image description here


博文为精选内容,完整ppt请留言索取:`

Will be updated in the next few days


博文为精选内容,完整ppt请留言索取

Tutorial1: From Transformer to GPT

Tutorial2: Diffusion Model

Workshop 1: Challenges and opportunities of large models for CV/PR

Workshop 4: Multimodal Cognitive Computing

Workshop 6: ChatGPT and Computer Vision

Workshop 7: Robot Embodied Intelligence

Workshop 10: Target Detection and Segmentation

Workshop 12: Multimodal Large Models and Prompt Learning

1. Zero-shot visual learning with pre-trained models and language augmentation

左旺孟 哈尔滨工业大学

In recent years, with the emergence of multi-modal pre-training models such as CLIP and Stable Diffusion, how to make full use of pre-trained large models for fine-tuning and prompt learning in various downstream tasks has become a research topic in computer and multi-modal learning in recent years. Hot issues and important development trends. Aiming at the above challenging issues, this report mainly includes three aspects: (1) Taking 3D point cloud classification as an example, discuss how to extend the image-language pre-training model to other visual modalities such as 3D point cloud; (2) Taking object Taking detection as an example, we discuss how to implement zero-shot learning for more complex vision tasks based on multi-modal pre-trained generative models. (3) Taking multi-label classification as an example, discuss how to use language data as visual supervision information to further enhance the performance of zero-shot visual learning; through the above analysis and introduction, it is expected that more types of pre-training models (such as: CLIP, Stable Diffusion) can It is more widely used in various visual modalities (such as: images, point clouds) and complex visual tasks (such as: classification, detection, segmentation, generation), and promotes the research and practice of multi-modal pre-training models in downstream tasks application

Contents:
1. Linguistic Enhanced Zero-Shot Multi-Label Image Classification

text as image

2. Extension of zero-shot classification for other modes

CLIP2Point: CLIP for 3D point cloud classification

3. Extension to other zero-shot tasks

ImaginaryNet: Image Synthesis for Object Detection
Ref-D: Zero-Shot Reference Image Segmentation

Common large models: BERT / T5, ChatGPT, GPT-4, LLaMA

insert image description here
insert image description here
insert image description here

TAL Prompting graphic pre-training framework :
insert image description here

insert image description here

Point cloud and depth map: Projection method that uses large models indirectly :
insert image description here

Then introduced ImageBIND and ImaginaryNet :
insert image description here
"lmaginaryNet: Learning Object Detectors without Real images and Annotations" Chinese Academy of Sciences Automation Research, Arxiv 2022: A new unsupervised learning method based on a generative model for training object detection models to solve the lack of real images Object detection without any real images or annotations: Using a generative model-based unsupervised training method to build some virtual image supervision through pixel-level generation, we can explore whether it is possible to achieve efficient without any real images or annotations. object detection.

Two main parts : a pixel-generating network to generate a virtual picture with a virtual object in it. The coordinates and areas of these targets in the picture are randomly generated, and there are no real labels. The other is an object detector , which is trained by performing object detection tasks on virtual images. For the object detector, the virtual pictures generated by the pixel generator and their target positions are used as training data to train the detection network, and the Detector generates target detection bounding boxes and is trained in an unsupervised situation.

Innovation : No need for real images and labeling information, use ImaginaryNet's generation model to generate virtual pictures and targets, let the generated pictures and virtual targets be regarded as supervisory signals, and realize unsupervised training of object detectors . We can train the algorithm from virtual pictures and virtual targets without real pictures and annotations, and continuously generate more virtual pictures and virtual targets through the generator for more efficient training.

Experiments on the data set show that in the absence of a large number of real images and GT, this method can also achieve object detection tasks and achieve good performance.

insert image description here

The general segmentation model SAM, and the Referring Diffusional Segmentor assisted by SAM , are used for interactive image segmentation. It divides the segmentation process into two parts: generative process and discriminative process. During the generation process , the algorithm combines the user-provided guidance information (such as keywords, pointers, contours, etc.) with the features of the image through the guidance mode, and uses the random walk model to generate the initial segmentation results. After entering the discriminative process , the algorithm uses a deep discriminative network to learn and optimize the segmentation results generated by random walks, and improve the accuracy of segmentation. In the specific implementation, the algorithm combines discriminative network training techniques with adversarial training to make the segmentation results more robust and accurate.

2. Multimodal pre-training and hint learning for knowledge augmentation

余宙-杭电子科技大

Multimodal models pre-trained from massive data have achieved amazing results on various multimodal tasks. How to further introduce "knowledge guidance" on the basis of the existing "data-driven" framework, so as to further improve the expression ability and generalization ability of the model is a hot and difficult point of current research. In this report, we introduce two multimodal learning frameworks that fuse knowledge:

1. ROSITA: Utilizes the intra-modal and inter-modal correlation knowledge implicit in multi-modal data to realize internal knowledge-enhanced multi-modal pre-training learning;

2. Prophet: Use the multimodal small model to prompt the large language model (GPT-3) to realize multimodal prompt learning with external knowledge enhancement

Ask questions, and existing solutions :
insert image description here

ROSITA framework :
insert image description here

Prophet framework :
insert image description here

3. Multi-modal interweaving-multi-modal fusion new method & new architecture

王云鹤-华为

  1. Fusion of different modalities: Multimodal interweaving Transformer
    • Multimodal Token Fusion for Vision Transformers, CVPR 2022

insert image description here
insert image description here

  1. GPT4Image: Image Recognition Method Based on Large Model Teaching Assistant
    GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception Tasks

insert image description here

  1. Single Modal Design: Minimalist Backbone Network VanillaNet
    • VanillaNet: the Power of Minimalism in Deep Learning, arxiv 2305.12972

insert image description here

4. LLM-Based Multimodal Prompt Learning: Framework, Prompt Data and Evaluation Criteria

邵婧 商汤科技

With the rapid development of large language models, text has become a natural medium to connect various modal information . Through the interactive learning of
text and other modal information, we can obtain more generalization in rich cross-modal field applications. Ability to perceive, understand and generate. Due to the huge gap in cross-modal data and tasks, how to do a good job of association alignment and effective fusion is still a huge challenge. Based on the recent progress of large models, this report starts from supporting various tasks such as understanding, cognition and generation of images, videos and 3D data , and introduces the data and framework construction and evaluation standard design of cross-modal prompt learning . Combined with application cases, in-depth discussion of the status quo and challenges of multi-modal large models

Mainly introduces the dolphin model :

insert image description here

5. Modal Gap and Learning with Interactive Prompts

朱霖潮 浙江大学

With the increase of multimodal data, multimodal analysis has become a research hotspot. In multimodal analysis, migration and alignment technology can align information of different modalities and perform multimodal migration to improve the effect and performance of tasks. This report introduces common modal gap problems in the multimodal field, and methods to reduce the modal gap , including cue-based transfer , multi-task learning , zero-shot learning , etc. The report will also combine experiments and application cases to deeply explore the application of multimodal analysis

  1. Visual-language pre-training for fine-grained semantic alignment

insert image description here

insert image description here
2. Automatic language description of visual scenes

insert image description here
insert image description here

Dataset and results :
insert image description here
insert image description here

  1. Efficient Multimodal Fusion

Propose a mutual query mechanism:
insert image description here

Ablation experiment and effect :
insert image description here
summary:

Modal Alignment :
The alignment information between modalities can be used for self-supervised training;
reducing the modal gap can significantly improve the transfer performance, and the non-parametric method can be used to reduce the modal gap and improve the level of alignment between visual knowledge and textual knowledge;

Modal Fusion
Hint-Based Modal Fusion

Workshop 13: Frontiers of 3D Vision Technology

Workshop 14: Visual Content Generation

Workshop 15: Self-Supervised Visual Representation Learning

Workshop 17: Visual Knowledge and Multiple Knowledge Representation

Workshop 19: Excellent Student Forum

The code is as follows (example):


The code is as follows (example):



Summarize

提示:这里对文章进行总结:

For example: the above is what we will talk about today. This article only briefly introduces the use of pandas, and pandas provides a large number of functions and methods that allow us to process data quickly and easily.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/131193289