Real-time tracking of scientific research trends丨New papers selected on 9.21 from Meta AI, Microsoft, Tsinghua University and other institutions

As a scientific researcher, you need to search and browse a large amount of academic literature every day to obtain the latest scientific and technological progress and research results.

However, traditional retrieval and reading methods can no longer meet the needs of scientific researchers.

AMiner AI is a literature knowledge tool that integrates retrieval, reading, and knowledge Q&A. Help you quickly improve the efficiency of retrieval and reading papers, obtain the latest research trends in the field, and make scientific research work more comfortable.
Insert image description here

If you want to have an in-depth conversation about a certain paper, you can directly copy the paper link to the browser or go directly to the AMiner AI page: https://www.aminer.cn/chat/g/explain

List of selected new papers on September 21, 2023:

1.End-to-End Speech Recognition Contextualization with Large Language Models

This article introduces a new approach to using Large Language Models (LLMs) for contextualized speech recognition models. By considering speech recognition as a mixed-modality language modeling task based on pre-trained LLM, we provide audio features as well as optional text tags for context to complete the transcription using the decoder. Therefore, during training, the system automatically learns how to exploit unstructured contextual information. Empirical results show that providing additional text context can significantly improve performance, reducing WER by 6%. Furthermore, we find that our method improves WER by 7.5% overall and improves on rare words compared to the baseline contextualized RNN-T system (trained on more than 25 times the speech dataset). A WER of 17%. Overall, we demonstrate that by adding an adapter with a small number of trainable parameters, we can unlock contextualized speech recognition capabilities for pretrained LLMs while maintaining the same text-only input functionality.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e6115e/?f=cs

2.The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

The article gives a specific problem, that is, how to conduct language modeling research when computing resources are limited. The article describes an experimental protocol to perform model comparisons by measuring equivalent computations by accelerator hours. In this way, restrictions on critical hyperparameters that affect total parameters or floating point operations can be avoided. The article also provides two benchmark models and shows through experiments that the improved LSTM model performs better in terms of scalability. Finally, we hope that this work will lay the foundation for meaningful and reproducible language modeling research.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e612a0/?f=cs

3.A Large-scale Dataset for Audio-Language Representation Learning

It explains the current problems in the field of speech representation learning, including insufficient scale of speech-text data sets, too simple content, and cumbersome collection process. To solve these problems, the research team proposed an innovative automatic audio subtitle generation process based on public tools or APIs, and built a large-scale, high-quality audio-language dataset (Auto-ACD), containing more than 1.9 million Audio-text pairs. To demonstrate the effectiveness of this dataset, popular models are trained on this dataset and demonstrate performance improvements on various downstream tasks (e.g., audio-language retrieval, audio subtitle generation, environment classification). In addition, the research team also built a novel test set and provided a benchmark for audio-text tasks.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e613ef/?f=cs

4.Controllable Dynamic Appearance for Neural 3D Portraits

The article mainly discusses the problem of controlling dynamic appearance in Neural Radiance Fields (NeRF). Recent advances in NeRF make it possible to reconstruct and recreate dynamic portrait scenes with control over head pose, facial expressions and viewing direction. However, training such a model requires maintaining photometric consistency over deformed regions (e.g., faces), i.e., faces must remain uniformly illuminated as head poses and facial expressions change. Even in a studio environment, it is difficult to maintain photometric consistency even between frames of video, so artifacts are prone to occur when reproducing dynamic portraits. To address this problem, the authors propose the CoDyNeRF system, which is capable of creating fully controllable 3D portraits under real-world capture conditions. CoDyNeRF learns to approximate lighting-related effects in a normalized space through a dynamic appearance model that is related to predicted surface normals and deformations of facial expressions and head poses. Surface normals are predicted using a 3D shape model that is a rough prior knowledge of the human head surface normal, since rigid and non-rigid deformations caused by changes in head posture and facial expressions make direct prediction of normals difficult. By training using only short videos of the subject captured on smartphones, the authors demonstrate the effectiveness of their method in free-viewpoint synthesis of portrait scenes with explicit head pose and expression control and realistic lighting effects.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e611c9/?f=cs

5.LMDX: Language Model-based Document Information Extraction and Localization

This paper points out problems in the application of language models in information extraction from semi-structured documents. These issues include LLM not including layout encoding, which is very important for high-quality extraction, and the lack of a benchmark mechanism to ensure that the answers are not fictitious. Due to the existence of these problems, LLM has not been successfully applied to semi-structured document information extraction tasks.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e61185/?f=cs

6.DreamLLM: Synergistic Multimodal Comprehension and Creation

The paper introduces a learning framework called DreamLLM, which implements multimodal large language models (MLLMs) for the first time, enhancing the capabilities of the model through the often overlooked synergy between multimodal understanding and creation. DreamLLM operation follows two basic principles. The first principle is to generate generative models of language and image posteriors by directly sampling in the original multimodal space. This approach circumvents the limitations and information loss problems of external feature extractors like CLIP, allowing for a more comprehensive understanding of multimodal data. Second, DreamLLM facilitates the generation of original interleaved documents while modeling text and image content as well as unstructured layout. This enables DreamLLM to efficiently learn all conditional, marginal and joint multi-modal distributions. As a result, DreamLLM becomes the first MLLM capable of generating free-form interleaved content. Through comprehensive experiments, we demonstrate that DreamLLM has superior performance as a zero-shot multimodal universal model, benefiting from enhanced learning synergies.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e613ee/?f=cs

7.FreeU: Free Lunch in Diffusion U-Net

The backbone of the traditional U-Net structure mainly plays a role in denoising during the denoising process, while skip connections mainly introduce high-frequency features, causing the network to ignore the semantic information of the backbone. The authors propose a simple yet effective method called "FreeU" to improve the quality of generative models without additional training or fine-tuning. By reasonably adjusting the contribution weights of U-Net's skip connections and backbone feature maps, the advantages of the two components of the U-Net structure can be fully utilized. The method achieves satisfactory results on image and video generation tasks and can be easily integrated into existing diffusion models, thereby improving the generation quality by modifying only two scaling factors.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e613ec/?f=cs

8.Kosmos-2.5: A Multimodal Literate Model

The paper introduces Kosmos-2.5, a multi-modal machine reading model for text-dense images. The model performs well on two different but collaborative transcription tasks: (1) generating spatially aware text chunks, where each text chunk is assigned its spatial coordinates in a picture, (2) generating text in Markdown format Structured text output that captures style and structure. Unified multi-modal text capabilities are achieved through a shared Transformer architecture, task-specific prompts, and flexible text representation. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-Markdown text generation. Furthermore, the model can be easily adapted to any text-intensive image understanding task with different cues via supervised fine-tuning, making it a versatile tool for practical applications involving text-rich images. This work also paves the way for future expansion of multimodal large-scale language models.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e6139a/?f=cs

9.Chain-of-Verification Reduces Hallucination in Large Language Models

The article introduces a problem that cannot be ignored in large language models - the generation of reasonable but wrong factual information, the so-called "illusion". The study explores the ability of language models to correct errors by carefully considering their responses. The authors developed a method called Chain-of-Verification (CoVe), in which the model first (i) drafts initial responses, then (ii) plans verification questions to check its draft, and (iii) independently answers these questions to Avoid bias from other responses and finally (iv) generate a final validated response. Experiments have shown that CoVe can reduce hallucinations in a variety of tasks, including list-based questions from Wikidata, closed-book MultiSpanQA, and generating long texts.

https://www.aminer.cn/pub/650ba7c03fda6d7f06e613ea/?f=cs


END

We have added the "Daily Selected New Papers" topic on the homepage of the AMiner website. You can click "Subscribe" and "Add to the Knowledge Base" to obtain all paper information!

Insert image description here

View all featured new papers: https://www.aminer.cn

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/133176934