Real-time tracking of scientific research trends丨New papers selected by Liu Ting and others on 8.16, with a ChatPaper review

As a scientific researcher, you need to search and browse a large amount of academic literature every day to obtain the latest scientific and technological progress and research results. However, traditional retrieval and reading methods can no longer meet the needs of scientific researchers.

ChatPaper is a literature knowledge tool that integrates retrieval, reading, and knowledge Q&A. Help you quickly improve the efficiency of retrieval and reading papers, obtain the latest research trends in the field, and make scientific research work more comfortable.

Insert image description here

Combined with the cutting-edge news subscription function, arXiv selects the most popular new papers of the day and forms a paper review, allowing everyone to understand the cutting-edge news more quickly.

If you want to have an in-depth conversation about a certain paper, you can directly copy the paper link to the browser or go directly to the ChatPaper page: https://www.aminer.cn/chat/g/explain

List of selected new papers on August 16, 2023:

1.The Five-Dollar Model: Generating Game Maps and Sprites from Sentence Embeddings

The paper introduces a lightweight text-to-image generation architecture called "Five-Dollar Model" that can generate low-dimensional images from encoded text hints. Even with small model and dataset sizes, the generated images maintain the semantic meaning of the textual cues. This paper applies the model to three small datasets: pixel art video game maps, video game character images, and scaled-down emoji images, and employs novel augmentation strategies to improve the model's performance on these limited datasets. The researchers used the CLIP VIT-B/32 model to evaluate the cosine similarity of text-image pairs to evaluate the performance of the model.

https://www.aminer.cn/pub/64d30f353fda6d7f06f6c9e6/

2.Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

To enhance the model's object perception, the model predicts the hand position, object position, and object's semantic label by using paired captions during training. At inference time, the model only requires RGB frames as input and is able to track and locate objects (although it is not explicitly trained for this). The object perception performance learned by the model can be demonstrated by evaluating the model's performance on zero-shot tests and using the learned representations as input for long-term video understanding tasks. Furthermore, by using noisy image-level detections as pseudo-labels in training, the model learns through video consistency to provide better bounding boxes and specifically describe these objects in relevant text descriptions. In summary, this model can serve as a plug-in replacement for egocentric video models, improving performance through visual-text alignment.

https://www.aminer.cn/pub/64dc49933fda6d7f06389f78/

3.Link-Context Learning for Multimodal LLMs

It was pointed out that current multimodal large language models (MLLM) and large language models (LLM) are still unable to recognize unseen images or understand new concepts during training despite using large-scale data sets during training. question. The researchers proposed the Link-Context Learning (LCL) method, which emphasizes enhancing the learning ability of MLLM from "causal relationship reasoning". By providing demonstrations with causal links, LCL guides models to not only identify analogical relationships but also understand potential causal connections between data points, thereby more effectively identifying unseen images and understanding new concepts. To evaluate this new approach, the researchers introduce the ISEKAI dataset, which contains only unseen image-label pairs generated for link-context learning. A large number of experiments have proven that compared with ordinary MLLM, LCL-MLLM shows strong capabilities in link-context learning of new concepts.

https://www.aminer.cn/pub/64dc49933fda6d7f06389f5c/

4.Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

In text-video retrieval, current methods face a key issue when adapting pre-trained text-image base models (such as CLIP) to the video domain, that is, how to effectively utilize CLIP's image encoder to capture rich semantic information in videos. To solve this problem, existing methods adopt complex cross-modal modeling techniques to fuse text information into video frame representations. However, in large-scale retrieval systems, this will lead to serious efficiency issues because the video representation must be specific to each Text queries are recalculated online. In this paper, to solve this problem, the authors abandon the problematic cross-modal fusion process and aim to learn semantically enhanced representations purely from videos so that the video representations can be computed offline and reused in different texts. Specifically, the authors first introduce a spatial-temporal “cue cube” into the CLIP image encoder and iteratively switch it within the encoder layer to effectively incorporate global video semantic information into the frame representation. The authors then propose the goal of applying assisted video subtitles to train frame representations to facilitate detailed video semantic learning by providing fine-grained guidance in the semantic space. By using a simple temporal fusion strategy (i.e., average pooling) on ​​enhanced frame representation, state-of-the-art performance is achieved on three benchmark datasets (MSR-VTT, MSVD, and LSMDC).

https://www.aminer.cn/pub/64dc49903fda6d7f06389c6e/

5.Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

The article illustrates the problems faced by large language model (LLM) evaluation. First, traditional natural language processing (NLP) tasks are inadequate due to the excellent performance of LLM. Second, existing evaluation tasks struggle to keep up with the wide range of LLM applications in real-life scenarios. To address these issues, existing studies have proposed various benchmarks to better evaluate LLM. To shed light on the numerous tasks on LLM assessment in academia and industry, the authors surveyed several papers on LLM assessment. They summarized the four core capabilities of LLM, including reasoning, knowledge, reliability, and security. For each competency, they present its definition, corresponding benchmarks, and metrics. Under this capability structure, similar tasks are merged to reflect corresponding capabilities, while new tasks can be easily added to the system. Finally, they provide recommendations for future directions in LLM assessment.

https://www.aminer.cn/pub/64dc49933fda6d7f06389f68/

6.Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Research explores the use of text injection in automatic speech recognition (ASR). Text injection, which refers to using unpaired plain text data to supplement paired audio text data, has shown promising improvements in reducing word error rates. The study also investigated the effect of text injection for auxiliary tasks (i.e., non-ASR tasks typically performed by E2E models). In the study, Joint End-to-End and Internal Language Model Training (JEIT) was used as the text injection algorithm to train an ASR model capable of performing two auxiliary tasks. The first auxiliary task is capitalization, which is reducing text to uppercase. The second auxiliary task is transition prediction, which attempts to determine whether a user has completed their conversational turn in a digital assistant interaction. The results show that this text injection method improves the performance of capitalization on long-tail data and improves the recall of transition detection.

https://www.aminer.cn/pub/64dc49903fda6d7f06389b6c/

7.RAVEN: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models

The paper discusses the capabilities of the retrieval-enhanced encoder-decoder language model in context learning and points out the limitations of the current state-of-the-art ATLAS model in context learning. The problem mainly arises from the mismatch between pre-training and testing, and the limitation of context length. To solve these problems, the authors propose the RAVEN model, which combines retrieval-enhanced mask language modeling and prefix language modeling. In addition, they proposed "fusion context learning" to enhance the model's few-shot performance, enabling the model to exploit more contextual examples without requiring additional training or model modifications. It has been proven through extensive experiments that RAVEN significantly outperforms ATLAS, achieving results comparable to state-of-the-art language models in some cases, albeit with a significantly reduced number of parameters. This work highlights the potential of retrieval-augmented encoder-decoder language models for contextual learning and encourages further research.

https://www.aminer.cn/pub/64dc49933fda6d7f06389f7c/

8.REFORMS: Reporting Standards for Machine Learning Based Science

This article points out the increasing use of machine learning in scientific research, but at the same time, the application of these methods is accompanied by failures in validity, reproducibility, and generalization ability. These failures can impede scientific progress, lead to false consensus on invalid claims, and undermine the credibility of machine learning-based science. The paper also points out that machine learning methods often fail when applied in similar ways across different disciplines. Based on this observation, the authors aim to provide clear standards for scientific reporting based on machine learning. Through an extensive review of past literature, the authors proposed the REFORMS (Reporting Standards For Machine Learning Based Science) checklist, which includes 32 questions and a set of guidelines. REFORMS was developed based on the consensus of 19 researchers in computer science, data science, mathematics, social sciences, and biomedical sciences. REFORMS can serve as a reference resource for researchers when designing and conducting studies, for reviewers to review papers, and for journals to implement transparency and reproducibility standards.

https://www.aminer.cn/pub/64dc49933fda6d7f06389f1b/

9.Backward Reasoning in Large Language Models for Verification

The paper introduces the method of using reverse inference for verification in large language models. The authors propose a new approach that verifies the correctness of the candidate answers by using a template in the question and providing candidate answers, asking the language model to predict the masked tokens. The authors further propose a method that combines forward and backward reasoning to estimate the probability of candidate answers. Experimental results show that this method achieves state-of-the-art performance on various reasoning tasks.

https://www.aminer.cn/pub/64dc49903fda6d7f06389ce0/

10.Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

The author proposed the Dancing Avatar method, which automatically generates each video frame through the pre-trained T2I diffusion model and maintains contextual relevance. The author solved the problem of maintaining the consistency of the figure and clothing in different poses, and maintaining the continuity of the background during various human movements. In order to ensure the consistency of character appearance throughout the video, the author designed an intra-frame alignment module to combine the text-guided synthesized character knowledge with the pre-trained T2I diffusion model. To maintain background continuity, the authors propose a background alignment process that combines insights from segmentation and image inpainting techniques. Furthermore, the authors propose an inter-frame alignment module that takes inspiration from the autoregressive process to enhance the temporal consistency between adjacent frames. Dancing Avatar offers significant advantages in terms of character and background fidelity and temporal coherence compared to existing state-of-the-art methods.

https://www.aminer.cn/pub/64dc49903fda6d7f06389cd7/

11.A Survey on Model Compression for Large Language Models

The sheer size and computational requirements of large language models (LLMs) pose challenges to practical deployment, especially in resource-limited environments. As these challenges become increasingly important, the field of model compression has emerged as an important area of ​​research to alleviate these limitations. This article provides a comprehensive survey focusing on model compression techniques specifically tailored for LLMs. To address the urgent need for efficient deployment, we explore various methodologies, including quantification, pruning, knowledge distillation, etc. Among these technologies, we highlight the latest advances and innovative methods, which have played an important role in promoting the development of LLM research. Furthermore, we explore the importance of benchmarking strategies and evaluation metrics for evaluating the effectiveness of compressed LLMs. This survey report is a valuable resource for both researchers and practitioners by providing insights into recent developments and practical applications. As LLMs continue to evolve, this survey aims to promote increased efficiency and real-world applicability, laying the foundation for future progress.

https://www.aminer.cn/pub/64dc49903fda6d7f06389c5f/

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/132358310