Real-time tracking of scientific research trends丨New papers selected by Zhu Songchun, Yu Yong, Juergen Gall and others on 8.23, with ChatPaper review

As a scientific researcher, you need to search and browse a large amount of academic literature every day to obtain the latest scientific and technological progress and research results.

However, traditional retrieval and reading methods can no longer meet the needs of scientific researchers.

ChatPaper is a literature knowledge tool that integrates retrieval, reading, and knowledge Q&A. Help you quickly improve the efficiency of retrieval and reading papers, obtain the latest research trends in the field, and make scientific research work more comfortable.

Insert image description here

Combined with the cutting-edge news subscription function, arXiv selects the most popular new papers of the day and forms a paper review, allowing everyone to understand the cutting-edge news more quickly.

If you want to have an in-depth conversation about a certain paper, you can directly copy the paper link to the browser or go directly to the ChatPaper page: https://www.aminer.cn/chat/g/explain

List of selected new papers on August 23, 2023:

1. UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding 阅读原文

This paper introduces a general large-scale multimodal model called UniDoc that can perform text detection, recognition, localization and understanding simultaneously. In the era of large language models (LLMs), the field of multimodal understanding has made tremendous progress. However, existing advanced algorithms have limitations in effectively utilizing the huge representational power and rich world knowledge possessed by large pre-trained models, and the beneficial connections between tasks in text-rich scenarios have not been fully explored. UniDoc is a novel multimodal model equipped with text detection and recognition capabilities, which existing methods lack. Furthermore, UniDoc exploits beneficial interactions between tasks to improve the performance of each task. To implement UniDoc, we perform unified multimodal instruction tuning on a large instruction following dataset. Quantitative and qualitative experimental results show that UniDoc achieves state-of-the-art results on multiple challenging benchmarks. To the best of our knowledge, this is the first large-scale multimodal model capable of simultaneous text detection, recognition, localization and understanding.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af4d2/

2. A Survey on Large Language Model based Autonomous AgentsRead the original text

This paper is an overview of research on autonomous agents based on large language models. Previous research often focused on training agents in isolated environments with limited knowledge, which is far from the human learning process, thus making it difficult for agents to achieve human-like decision-making. In recent years, large language models (LLMs) have shown great potential in achieving human-level intelligence by acquiring large amounts of network knowledge. This has triggered a surge in research on autonomous agents based on LLM. To fully exploit the potential of LLM, researchers have designed various agent architectures for different applications. In this paper, we conduct a systematic review of these studies as a whole. Specifically, we focus on building LLM-based agents, for which we propose a unified framework that covers most of the previous work. . In addition, we provide an overview of various applications of LLM-based artificial intelligence agents in the fields of social sciences, natural sciences, and engineering. Finally, we discuss common strategies for evaluating LLM-based artificial intelligence agents. Based on previous research, we also propose several challenges and future directions in this field.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af42e/

3. ProAgent: Building Proactive Cooperative AI with Large Language Models 阅读原文

This paper introduces a new framework called ProAgent, which leverages large language models to help agents be more forward-looking and proactive in cooperation with humans or other agents. Traditional cooperative agent methods mainly rely on learning methods, and policy generalization relies heavily on past interactions with specific teammates, which limits the agent's ability to readjust its strategy when facing new teammates. ProAgent can foresee the future decisions of teammates and develop enhanced plans for itself, showing excellent cooperative reasoning capabilities and being able to dynamically adapt to improve the effectiveness of cooperation with teammates. Furthermore, the ProAgent framework is highly modular and interpretable and can be seamlessly integrated into various coordination scenarios. Experimental results show that ProAgent outperforms five self-game-based and population-based training methods in the Overcook-AI framework. In cooperation with human agent models, its performance improves by more than 10% on average, surpassing the current state-of-the-art. Advanced method COLE. This advancement is consistent across diverse scenarios involving interactions with AI agents and human opponents with different characteristics. These findings inspire future research into human-robot collaboration.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af3cd/

4. Refashioning Emotion Recognition Modelling: The Advent of Generalised Large Models 阅读原文

The abstract of this paper is about the revolution in emotion recognition modeling: the emergence of general purpose large-scale models. Since its birth, emotion recognition or affective computing has gradually become an active research topic due to its wide range of applications. Over the past few decades, emotion recognition models have gradually migrated from statistical shallow models to neural network-based deep models, which can significantly improve the performance of emotion recognition models and consistently achieve the best results in different benchmarks. Therefore, in recent years, deep models have been considered as the first choice for emotion recognition. However, the emergence of large language models (LLMs) such as ChatGPT is amazing due to its unprecedented capabilities in zero/small amount of learning, context learning, thought chaining, etc. In this paper, we comprehensively investigate the performance of LLM in emotion recognition, including diversity in context learning, few-number learning, accuracy, generalization, and interpretation. In addition, we provide some insights and raise other potential challenges, hoping to stimulate a broader discussion about enhancing emotion recognition in the new era.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af4c4/

5. Furnishing Sound Event Detection with Language Model AbilitiesRead the original text

This paper explores the capabilities of language models (LMs) in visual cross-modality and applies them to sound event detection (SED). The authors propose an elegant method to align audio features and text features to complete sound event classification and temporal localization. The framework includes an acoustic encoder, a comparison module for aligning corresponding representations of text and audio, and a decoupled language decoder that directly leverages its semantic capabilities to generate timing and event sequences from audio features. Compared with traditional methods that require complex processing and only exploit limited audio features, this model is more concise and comprehensive. The authors studied different decoupled modules to demonstrate their effectiveness for timestamp capture and event classification. Evaluation results show that the proposed method achieves accurate results on sequence generation for sound event detection.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af492/

6. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models 阅读原文

This paper introduces a legal reasoning benchmark called "LegalBench", which was built by experts in multiple fields to study the reasoning capabilities of large language models (LLMs) in the legal field. LegalBench contains 162 tasks covering six different types of legal reasoning. The tasks are designed by legal professionals so that they either measure legal reasoning skills that are useful in practice or that are of interest to lawyers. In order to promote interdisciplinary discussions about LLM in the legal field, the paper also shows how popular legal frameworks can be mapped to LegalBench tasks, thereby providing a common vocabulary for lawyers and LLM developers. The paper also presents an empirical evaluation of 20 open source and commercial LLMs and demonstrates the type of research exploration that LegalBench can facilitate.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af44d/

7. A Survey on Self-Supervised Representation LearningRead the original text

This paper is a survey on self-supervised representation learning. In the field of modern machine learning, learning meaningful representations is at the core of many tasks. In recent years, many methods have been introduced that allow learning image representations without supervision. These representations can be used for downstream tasks such as classification or object detection. The quality of these representations approaches supervised learning without the need for labeled images. This survey paper provides a comprehensive review of these methods using a unified representation, points out their similarities and differences, and proposes a classification method that links them to each other. Furthermore, our survey summarizes the latest experimental results reported in the literature in the form of a meta-study. This survey is intended to provide a starting point for researchers and practitioners wishing to delve deeper into the field of representation learning.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af446/

8. How Much Temporal Long-Term Context is Needed for Action Segmentation? Original text

This paper explores an issue regarding the need for temporally long context in video action segmentation. Although the Transformer can model the long temporal context of a video, this becomes computationally infeasible in long videos. Recent temporal action segmentation methods thus combine spatiotemporal convolutional networks with self-attention of local temporal windows. Although these methods achieve good results, their performance is limited by their inability to capture the full context of the video. In this work, the authors try to answer how much long-term context is needed for temporal action segmentation by introducing a Transformer-based model that leverages sparse attention to capture the full context of the video. The authors compare this model with state-of-the-art methods on three current datasets for temporal action segmentation: 50Salads, Breakfast, and Assembly101. Experimental results show that in order to obtain optimal temporal action segmentation performance, modeling the full context of the video is necessary.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af3e0/

9. Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models 阅读原文

This paper mainly discusses the development of federated learning in the era of large models, and proposes a domain-specific federated learning framework for multi-modal large models. This framework allows multiple enterprises to jointly train large-scale models in vertical fields using private domain data to achieve intelligent services. The author discusses in depth the strategic changes in the intelligence foundation and goals of federated learning in the era of large models, as well as the new challenges faced, including heterogeneous data, model aggregation, performance and cost trade-offs, data privacy and incentive mechanisms, etc. The paper also describes through a case study how leading enterprises provide distributed deployment and effective coordination for urban safety operation management through multi-modal data and expert knowledge, as well as data quality improvement and technological innovation based on large-scale model capabilities. Preliminary experimental results show that enterprises can enhance and accumulate intelligent capabilities through multi-modal model federated learning, jointly create a smart city model, and provide high-quality smart services covering energy infrastructure security, residential community security, and urban operations management. The established federated learning cooperation ecosystem is expected to further integrate industrial, academic and research resources, realize large-scale models in multiple vertical fields, and promote large-scale industrial applications and cutting-edge research on artificial intelligence and multi-modal federated learning.

https://www.aminer.cn/pub/64e5846c3fda6d7f063ac938/

10. ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World DataRead the original text

This paper mainly studies the performance of visual language models (VLMs) when processing real-world multi-modal data with high pairwise complexity (such as medical data). Unlike previous VLMs that were mainly trained on online image-caption datasets, in this type of real-world data, each image (such as an X-ray) is usually associated with text that describes multiple different attributes in fine-grained areas of the image. (e.g. physician reports), resulting in data with high pairing complexity. However, whether VLMs are able to capture fine-grained relationships between image regions and text attributes when trained on such datasets has not been evaluated. The two main contributions of this paper are: first, through systematic evaluation, it is confirmed that the performance of standard VLMs in learning region-attribute relationships decreases as the pairwise complexity in the training data set increases; second, the authors propose the ViLLA method , to solve this problem. ViLLA captures fine-grained region-attribute relationships in complex datasets through two components: (a) a lightweight self-supervised mapping model that decomposes image-text samples into region-attribute pairs, (b) a Contrastive VLM, learns representations from generated region-attribute pairs. The authors demonstrate in experiments in four domains (synthetic, product, medical, and natural images) that ViLLA outperforms comparable VLMs in fine-grained reasoning tasks such as zero-shot object detection and retrieval.

https://www.aminer.cn/pub/64e5846c3fda6d7f063ac920/

11. ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation 阅读原文

This paper mainly studies how to use large language models (LLM) for lifelong sequence behavior understanding to improve the performance of recommendation systems. The author first clarifies and defines the lifelong sequence behavior understanding problem of LLM in the recommendation field, that is, LLM cannot extract useful information from long text environments of user behavior sequences, although the length of these environments is far from reaching the context limit of LLM. To solve this problem and improve the performance of LLM in recommendation tasks, the authors propose a new framework called ReLLa, which can perform recommendation tasks in fragmented and minority settings. In fragmented recommendation, the authors perform Semantic User Behavior Retrieval (SUBR) to improve the data quality of test samples, which greatly reduces the difficulty of LLM in extracting necessary knowledge from user behavior sequences. Among a few recommendations, the authors further designed Retrieval Enhancement Instruction Tuning (ReiT) by using SUBR as a data enhancement technique for training samples. Specifically, the authors develop a hybrid training dataset that includes original data samples and their retrieval-enhanced versions. The authors conduct extensive experiments on a real-world public dataset (i.e., MovieLens-1M) to demonstrate the superiority of ReLLa over existing baseline models and its ability to understand lifetime sequence behavior.

https://www.aminer.cn/pub/64e5846c3fda6d7f063ac8e0/

12. Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models 阅读原文

This paper mainly studies the semantic firewall problem of large language models (LLMs), and proposes an attack method called "self-deception" that can bypass the semantic firewall of LLMs. The author proposes an automatic "jailbreak" method that induces LLM to generate prompts that can bypass the firewall, thereby achieving the purpose of bypassing the semantic firewall. The authors conducted experiments in six languages ​​and targeted the three most common violation types: violence, hate, and pornography. Experimental results show that the attack method is highly effective. The author believes that manipulating AI behavior through carefully designed prompts will become an important research direction in the future.

https://www.aminer.cn/pub/64e5849c3fda6d7f063af489/

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/132470729