Weekly report on large model papers丨Frontier scientific research trends from Tsinghua University, Meta AI, Microsoft, KAUST and other institutions

click on the blue word

e272091b2b7c04a1912b152753785af4.jpeg

Follow us

AI TIME welcomes every AI enthusiast to join us!

The large model can also be called the Foundation Model model. The model extracts knowledge through billion-level corpus or images, learns and produces a large model with billion-level parameters. The emergence of large models has ushered in a new era of AI research, and the results have improved significantly, surpassing the improvements achieved in many fields by designing specific algorithms for research problems.

This week, 10 excellent papers in the field of large-scale models were selected, from Tsinghua University, Meta AI, Microsoft, KAUST and other institutions.

For the convenience of everyone to read, only the title of the paper, the author, the summary of AI Hua students and other information are listed. If you are interested, you can click on the "Paper Details Page" to view the original text. New papers can also be viewed by logging into the applet.

1.MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language

Authors: Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

Link: https://www.aminer.cn/pub/6442336c4c80727584270e42/

AI Survey (Large Model Driven): This paper investigates the powerful multi-directional language generation capabilities of MiniGPT 4, including generating websites directly from handwritten text, recognizing joke elements in images, and extracting recipe information from images. Experimental results show that only training on raw image-text pairs can produce unnatural language output, including repetitive and segmented sentences. Therefore, the model is critical in terms of generative reliability and overall scalability. Furthermore, the model is also highly computationally efficient, using only about 5 million aligned image-text pairs for modeling. Our code, pretrained models, and collected databases are now available.

2.Safety Assessment of Chinese Large Language Models

Authors: Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang

Link: https://www.aminer.cn/pub/6441ff2eed329dcc6bb74b74/

AI Survey (Large Model Driven): This paper introduces a security evaluation benchmark for Chinese language models, covering 8 common security scenarios and 6 more challenging guided attack types. The benchmark is based on a simple process by providing test hints and assessing the safety of the results generated by the model. In the evaluation, we leverage the powerful evaluation capabilities of language models and develop them as security evaluators. Furthermore, we also find that instruction attacks are more likely to expose security issues in all LLMs. To promote the development of safe, responsible, and Ethical AI, we publicly released Safety Tips, including 1 million expanded tips and answers.

3. Tool Learning with Foundation Models 

Authors: Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, etc.

Link: https://www.aminer.cn/pub/643e0ad50746dc40e341a274/

AI Survey (Large Model Driven): This paper systematically investigates the problem of tool learning, proposes a general tool learning framework, defines a general tool learning framework, and discusses the directions and challenges of existing tool learning research. From the perspective of understanding user guidance, models should learn to decompose complex tasks into multiple subtasks, adjust their plans dynamically to effectively overcome each subtask, and achieve it by selecting the appropriate tool. Additionally, we discuss how to train models to improve tool usage and facilitate tool learning. In general, it is hoped that this paper will inspire future research on tool learning.

4.Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Authors: Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Jianfeng Gao

Link: https://www.aminer.cn/pub/6440ad89ed329dcc6b838a0f/

AI Survey (Large Model Driven): This paper introduces Chameleon, a plug-and-play design inference framework for solving the challenge of large-scale language models. Chameleon generates programs to construct various tools, including LLM models, remote vision models, web search engines, Python functions, and rule-based modules. As the basis of the natural language planner, Chameleon induces and executes the appropriate sequence of tools to generate the final response. The adaptability and effectiveness of Chameleon are demonstrated on two tasks: Science Test and TabMWP.

5. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

作者:Andreas Blattmann,Robin Rombach,Huan Ling,Tim Dockhorn,Seung Wook Kim,Sanja Fidler,Karsten Kreis

Link: https://www.aminer.cn/pub/643f5c4336af860e941aca50/

AI Survey (Large Model Driven): This paper investigates the application of Latent Diffusion Models (LDMs) to high-resolution video generation. First, we use the LDM paradigm for training, and then convert the image generator into a video generator by introducing temporal dimension into the latent space propagation model and pruning the encoded image sequence. We also compare the performance of these methods on multiple real-world applications, including simulation of driving data in the wild and authoring text-to-video modeling. Exploiting this property, we demonstrate that this approach can be effectively applied to text-to-video models of varying processing precision, thus opening up a future direction for content creation.

6.Learning to Compress Prompts with Gist Tokens

By Jesse Mu, Xiang Lisa Li, Noah Goodman

Link: https://www.aminer.cn/pub/643e0ad60746dc40e341a410/

AI Overview (Large Model Driven): Gisting is a transformer and decoder for compressing language models that takes up significant space in the input context window and requires recoding the model for computational efficiency. The authors propose a converter and decoder called gisting, which is able to convert cues into a small set of "gist" tokens for computational efficiency.

7. Generative Disco: Text-to-Video Generation for Music Visualization

Credit:Vivian Liu,Tao Long,Nathan Raw,Lydia Chilton

Link: https://www.aminer.cn/pub/643f5c3d36af860e941a8ee5/

AI Survey (Large Model Driven): This paper presents a generative neural network (AI) system for music visualization that generates multilingual models and text-to-image models. The user selects an area to play by distance, time, protagonist or style and parameterizes its visualization by defining start and end cues. The system has been shown to be pleasant, easy to explore, and highly expressive in industry studies. Authors' findings show that generative neural networks can improve authoring environments

8. Hyperbolic Image-Text Representations

Author:Karan Desai,Maximilian Nickel,Tanmay Rajpurohit,Justin Johnson,Ramakrishna Vedantam

Link: https://www.aminer.cn/pub/643f5c4336af860e941ad641/

AI Survey (Large Model Driven): This paper presents MERU, a comparative model that captures hypervariable representations in images and text. The hypervariable space has suitable geometric properties to embed tree-like data so that MERU can better capture the underlying hierarchy of image-text data.

9. Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Authors: Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, Xi Yin

Link: https://www.aminer.cn/pub/643e0ad60746dc40e341a425/

AI Survey (Large Model Driven): This paper proposes Latent Shift, a text-to-image generation method based on pre-training, which consists of an autoencoder and a U-Net propagation model. Learning a video propagation model is more efficient in latent space, because the whole pipeline is very complex and computationally expensive by first generating low-resolution videos, and then going through the sequence of frame overlapping and super-resolution models. To extend from image generation to video generation, previous work proposes adding additional modules such as 1D temporal evolution and temporal attention layers.

10. Visual Instruction Tuning

Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Link: https://www.aminer.cn/pub/643e0ad60746dc40e341a515/

AI Survey (Large Model Driven): This paper presents the first attempt to generate multimodal language-image guidance data using language-only GPT 4. Guided by these generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model for general vision and language understanding. Our early experiments show that LLaVA exhibits impressive multimodal chat capabilities on unseen images/commands, sometimes as in the case of multimodal GPT 4 on unseen images/commands, and with random multimodal Instruction results compared with a relative score of 85.1%.

carry

Awake

Click "Read the original text" to view the paper details page!

Recommendation of wonderful articles in the past

2b44c0c01916ff0fea372ca021feb6cf.jpeg

Remember to follow us! New knowledge every day!

 About AI TIME 

AI TIME originated in 2019. It aims to promote the spirit of scientific speculation, invite people from all walks of life to explore the essential issues of artificial intelligence theory, algorithms and scene applications, strengthen ideological collisions, and link global AI scholars, industry experts and enthusiasts. In the form of debate, explore the contradiction between artificial intelligence and the future of human beings, and explore the future of the field of artificial intelligence.

So far, AI TIME has invited more than 1,000 speakers at home and abroad, held more than 550 events, and more than 6 million people watched it.

8a2a0b2d7651ea7b1edbaa9609c5f21a.png

I know you

look in

oh

~

31519edd8b96caf7f15130c657932abd.gif

Click to read the original text  to view the paper details page!

Guess you like

Origin blog.csdn.net/AITIME_HY/article/details/130437290