Daily Academic Express 6.3

CV - Computer Vision  | ML - Machine Learning  | RL - Reinforcement Learning  | NLP Natural Language Processing  

Subjects: cs.CV

1.Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors

Title: Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive learning and Diffusion Priors

Credit: Paul S. Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Ethan Cohen, Aidan J. Dempster,

Article link: https://arxiv.org/abs/2305.18274

Project code: https://medarc-ai.github.io/mindeye-website/

Summary:

        We introduce MindEye, a novel fMRI-to-image approach that retrieves and reconstructs observed images from brain activity. Our model contains two parallel submodules dedicated to retrieval (using contrastive learning) and reconstruction (using diffusion priors). MindEye can map fMRI brain activity to any high-dimensional multimodal latent space, such as the CLIP image space, enabling image reconstruction using a generative model that accepts embeddings from this latent space. We comprehensively compare our method with other existing methods, using both qualitative side-by-side comparison and quantitative evaluation, and show that MindEye achieves state-of-the-art performance in both reconstruction and retrieval tasks. In particular, MindEye retrieves accurate original images even among highly similar candidates, suggesting that its brain embeddings preserve fine-grained image-specific information. This allows us to accurately retrieve images from large databases such as LAION-5B. We demonstrate through ablation that the performance improvements of MindEye over previous methods stem from specialized submodules for retrieval and reconstruction, improved training techniques, and trained models with orders of magnitude more parameters. Furthermore, we show that MindEye can better preserve low-level image features in reconstructions by using img2img together with the output from a separate autoencoder. All code can be found on GitHub.

2.RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

Title: RAPHAEL: Generating text to images via a large number of mixed diffusion paths

Authors: Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, Ping Luo

Article link: https://arxiv.org/abs/2305.18295

Project code: https://raphael-painter.github.io

Summary:

        Text-to-image generation has recently achieved remarkable success. We introduce a text-conditioned image diffusion model called RAPHAEL to generate highly artistic images that accurately depict text cues containing multiple nouns, adjectives, and verbs. This is achieved by stacking dozens of mixed-expert (MoE) layers (i.e., spatial-MoE and temporal-MoE layers), resulting in billions of diffusion paths (routes) from network input to output. Each path intuitively acts as a "painter" for painting a specific textual concept onto a designated image region at a diffusion time step. Comprehensive experiments show that RAPHAEL outperforms recent cutting-edge models such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2 in terms of image quality and aesthetic appeal. First of all, RAPHAEL has excellent performance in image switching of various styles such as Japanese comics, realism, cyberpunk, and ink illustrations. Second, a single model with 3 billion parameters, trained for two months on 1,000 A100 GPUs, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly outperforms its peers on the ViLG-300 benchmark by human evaluation. We believe RAPHAEL has the potential to push the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly developing field. More details can be found on the project webpage: this https URL.

3.Generating Images with Multimodal Language Models 

Title: Image Generation Using Multimodal Language Models

作宇:Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

Article link: https://arxiv.org/abs/2305.17216

Project code: http://jykoh.com/gill

Summary:

        We propose a method to fuse a frozen plain text large language model (LLM) with pretrained image encoder and decoder models via mapping between embedding spaces. Our model exhibits a broad set of multimodal capabilities: image retrieval, new image generation, and multimodal dialogue. Our method is the first to be able to condition arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to base LLM on off-the-shelf text-to-image generation models. This mapping network transforms the hidden representation of text into the embedding space of the visual model, enabling us to exploit the strong textual representation of LLM for visual output. Our method outperforms baseline generative models on tasks using longer and more complex languages. In addition to novel image generation, our model is also capable of retrieving images from a pre-specified dataset, and at inference time decides whether to retrieve or generate. This is done by learning a decision module conditioned on the hidden representation of the LLM. Compared with previous multimodal language models, our model exhibits a wider range of capabilities. It can process image and text inputs and generate retrieved images, generated images, and generated text—outperforming non-LLM-based generative models in several text-to-image tasks that measure context dependencies.

More Ai information: Princess AiCharm
insert image description here

Guess you like

Origin blog.csdn.net/muye_IT/article/details/131030976