Real-time tracking of scientific research developments | Open-world multi-task agent with memory-enhanced multi-modal language model - JARVIS-1, 11.13 selected new papers

As a scientific researcher, you need to search and browse a large amount of academic literature every day to obtain the latest scientific and technological progress and research results.

However, traditional retrieval and reading methods can no longer meet the needs of scientific researchers.

AMiner AI is a literature knowledge tool that integrates retrieval, reading, and knowledge Q&A. Help you quickly improve the efficiency of retrieval and reading papers, obtain the latest research trends in the field, and make scientific research work more comfortable.

Insert image description here

If you want to have an in-depth conversation about a certain paper, you can directly copy the paper link to the browser or go directly to the AMiner AI page:https://www.aminer.cn/ chat/g/explain

List of selected new papers on November 13, 2023:

1.Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model

This paper introduces a new method called Instant3D to generate 3D assets quickly and with high quality. Different from existing methods, Instant3D adopts a two-stage approach, first using a fine-tuned 2D text-to-image diffusion model to generate a sparse quad-structured consistent view in one go, and then using a transformer-based sparse View reconstructor that regresses NeRF directly from generated images. Through extensive experiments, the authors demonstrate that their method can generate high-quality, diverse, and Janus-free 3D assets in 20 seconds, which is two orders of magnitude faster than previous optimization-based methods, which took 1 to 10 hours.

https://www.aminer.cn/pub/65518a95939a5f4082a65ebe/?f=cs

2.Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

This paper studies a principle fine-tuning paradigm for downstream task adaptation - orthogonal fine-tuning (OFT). Although OFT shows good generalization ability, it still uses a considerable number of trainable parameters due to the high dimensionality of the orthogonal matrix. To address this issue, the authors examine OFT from an information transfer perspective and identify some key requirements to achieve better parameter efficiency. Inspired by the Cooley-Tukey fast Fourier transform algorithm to achieve efficient information transmission, the author proposed an efficient orthogonal parameterization using a butterfly structure. Applying this parameterization to OFT creates a novel parameter-efficient fine-tuning method called Orthogonal Butterfly (BOFT). By generalizing OFT as a special case, BOFT introduces a general orthogonal fine-tuning framework. Finally, the authors conduct extensive empirical research on adapting large-scale visual transformers, large-scale language models, and text-to-image diffusion models to a variety of visual and language downstream tasks.

https://www.aminer.cn/pub/65518ab0939a5f4082a66b9e/?f=cs

3.Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs

This paper introduces a new framework called Lumos for training language agents. The framework adopts a unified data format and a modular architecture based on open source large language models (LLM). Lumos consists of three distinct modules: Planning, Grounding, and Execution. The planning module decomposes the task into a series of high-order, tool-independent sub-goals, which are then made concrete through the low-order actions of the grounding module. These actions are performed by execution modules, using a range of ready-made tools and APIs. To effectively train these modules, high-quality sub-goal and action annotations are collected and used to fine-tune the open-source LLM for various tasks such as complex question answering, network tasks, and mathematical problems.

Leveraging this unified data and modular design, Lumos not only achieves comparable or superior performance to current state-of-the-art agents, but also exhibits several key advantages: (1) Lumos outperforms in complex question answering and networking tasks GPT-4/3.5 base agent while performing comparably with the significantly larger LLM agent on mathematical tasks; (2) Lumos outperforms open source agents trained using traditional training methods and chain thinking; (3) Lumos can effectively Generalizes to unseen interaction tasks, outperforms larger LLM base agents and even outperforms the performance of specialized agents.

https://www.aminer.cn/pub/65518952939a5f4082a5d9c9/?f=cs

4.Hiformer: Heterogeneous Feature Interactions Learning with Transformers for Recommender Systems

This paper introduces the Hiformer model for interactive learning of heterogeneous features in recommendation systems. Feature interaction is the key to building recommendation systems, but in large-scale applications, learning feature interaction is very challenging due to sparse and large-scale input feature spaces; at the same time, manually crafting effective feature interactions is infeasible due to the exponential solution space of. The author proposes to use a Transformer-based architecture and attention layer to automatically capture feature interactions. Although the Transformer architecture has achieved great success in fields such as natural language processing and computer vision, it has not been widely used in feature interaction modeling in the industry. The authors aim to bridge this gap. They identify two key challenges in applying the vanilla Transformer architecture to large-scale recommendation systems: (1) the Transformer architecture cannot capture heterogeneous feature interactions in the self-attention layer; (2) the service latency of the Transformer architecture may be too high to Deployment in recommender systems. The author first proposed a heterogeneous self-attention layer, which is a simple and effective method to modify the self-attention layer in Transformer to consider the heterogeneity of feature interactions. Then, they introduced Hiformer (Heterogeneous Interaction Transformer) to further improve model expressiveness. Through low-rank approximation and model pruning, Hiformer has fast online deployment inference speed. A large number of offline experimental results confirm the effectiveness and efficiency of the Hiformer model. The author has successfully deployed the Hiformer model into a real-world large-scale application ranking model on Google Play, with a significant improvement of 2.66% in key engagement metrics.

https://www.aminer.cn/pub/655189d8939a5f4082a60f25/?f=cs

5.FMViT: A multiple-frequency mixing Vision Transformer

This paper introduces an efficient hybrid vision converter architecture called FMViT. Due to the quadratic time complexity and memory complexity of self-attention, existing visual transformers (ViT) have challenges compared with traditional convolutional neural networks (CNN) in actual industrial deployment scenarios (such as TensorRT and CoreML) . Although there have been some attempts to solve this problem by designing CNN-Transformer hybrid architectures, their overall performance has not met expectations. To address these issues, we propose an efficient hybrid ViT architecture named FMViT. This method enhances the expressive ability of the model by combining high-frequency features and low-frequency features of different frequencies, allowing it to effectively capture local and global information. In addition, we also introduce deployment-friendly mechanisms such as convolutional multi-group reparameterization (gMLP), lightweight multi-head self-attention (RLMHSA), and convolutional fusion blocks (CFB) to further improve the performance of the model and reduce Computational overhead. Our experiments demonstrate that FMViT outperforms existing CNN, ViT, and CNNTransformer hybrid architectures in terms of latency/accuracy across a variety of vision tasks. On the TensorRT platform, FMViT achieves 2.5% higher top-1 accuracy on the ImageNet dataset than Resnet101 (83.3% vs. 80.8%) while maintaining similar inference latency. In addition, FMViT’s performance is comparable to EfficientNet-B5, but the inference speed is 43% faster. On CoreML, FMViT's top-1 accuracy on the ImageNet dataset is 2.6% higher than MobileOne, and the inference latency is comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at https://github.com/tany0699/FMViT.

https://www.aminer.cn/pub/65518961939a5f4082a5dfd7/?f=cs

6.JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

This paper introduces an open source world multi-task agent called JARVIS-1, which uses a memory-augmented multi-modal language model to achieve human-like planning and control. In the open source world, processing multi-modal observations (visual observations and human instructions) is a critical milestone for more powerful general-purpose agents. Existing methods can handle open-world missions of a certain length, but they still face challenges in situations where the number of missions may be infinite and mission completion capabilities cannot be gradually improved over game time. JARVIS-1 is an open-source world agent that can perceive multi-modal input (visual observations and textual instructions), generate complex plans and perform embodied control in the popular challenging open-world Minecraft universe. Specifically, JARVIS-1 is built on a pre-trained multi-modal language model that maps visual observations and textual instructions to plans. The plan will eventually be dispatched to the target condition controller. We equip JARVIS-1 with a multi-modal memory that leverages pre-trained knowledge and actual game survival experience for planning. In experiments, JARVIS-1 performed nearly perfectly on more than 200 different tasks on the Minecraft universe benchmark, ranging from entry-level to intermediate levels. In the long-term Diamond Pickaxe mission, JARVIS-1 achieved a completion rate of 12.5%, more than five times the previous record. Furthermore, we show that, thanks to multimodal memory, JARVIS-1 is able to self-improve following a lifelong learning paradigm, inspiring broader intelligence and increased autonomy. The project page can be found here: https://craftjarvis-jarvis1.github.io.

https://www.aminer.cn/pub/65518a1f939a5f4082a62ced/?f=cs

7.PolyMaX: General Dense Prediction with Mask Transformer

This paper introduces PolyMax: a general-purpose dense prediction method based on Mask Transformer. Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as pixel-wise classification (discrete output) or regression (continuous output). The pixel-wise prediction paradigm has been popular due to the popularity of fully convolutional networks. However, on the recent frontier in segmentation tasks, with the emergence of transformer architectures, particularly mask transformers, the community has witnessed a paradigm shift from pixel-wise prediction to cluster prediction, directly predicting labels for masks rather than pixels. Nonetheless, methods based on the pixel-by-pixel prediction paradigm still dominate in dense prediction tasks that require continuous output, such as depth estimation and surface normal prediction. Inspired by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster prediction method to general dense prediction tasks. This allows us to unify the dense prediction task with the mask transformer framework. Notably, the PolyMax model demonstrates state-of-the-art performance on three benchmarks on the NYUD-v2 dataset. We hope that our simple yet effective design will inspire more research on leveraging mask transformers for more intensive prediction tasks. Code and models will be made available externally.

https://www.aminer.cn/pub/6551898a939a5f4082a5f1a7/?f=cs

8.Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

This paper introduces a multimodal autoregressive model called Mirasol3B for handling time-aligned and non-time-aligned modalities. A major challenge in multimodal learning is the need to combine heterogeneous modalities (e.g. video, audio, text). For example, video and audio are fetched at a much higher rate than text and are roughly aligned in time. They are usually not synchronized with text, which serves as global context, such as a title or description. Furthermore, video and audio inputs are much larger in volume and naturally grow with increasing video length, which naturally requires allocating more computational resources to these modalities and makes long-distance dependency modeling more difficult.

The authors decouple multimodal modeling into separate, focused autoregressive models that process inputs based on the characteristics of the modalities. They proposed a multimodal model called Mirasol3B, which consists of an autoregressive component for handling time-synchronized modalities (audio and video), and an autoregressive component for handling non-time-synchronized but still serialized contexts. modal. To solve the problem of long sequences of video and audio input, the author proposes to further divide the video and audio sequences into continuous segments and process their representation autoregressively. To this end, they propose a combination mechanism that jointly models audio and video information within one time frame. The combiner learns audio and video features from the raw spatial-temporal signals and then learns to fuse these features to produce a compact yet expressive representation of each segment. The method achieves state-of-the-art performance on multiple multimodal benchmarks, outperforming larger models. It effectively addresses the high computational demands of media input by learning compact representations, controlling the sequence length of audio and video feature representations, and modeling their temporal dependencies.

https://www.aminer.cn/pub/6551895f939a5f4082a5debc/?f=cs

9.FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

This paper introduces FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores, an optimization method for fast Fourier transform (FFT) convolution operations for long sequence tasks. Existing long-filter convolutional models exhibit state-of-the-art inference capabilities on many long sequence tasks, but they lag behind optimal transformer models in practical runtime. The main bottleneck causing this gap is the Fast Fourier Transform (FFT), which makes long convolutions in O ( N l o g N ) O(N logN) O(Nlog N) runs within time complexity, but performs poorly in terms of hardware utilization. To address this problem, the author proposed a convolution operation optimization method called FlashFFTConv. FlashFFTConv uses matrix decomposition to calculate FFT, and uses the matrix multiplication unit to perform long sequence kernel fusion to reduce I/O. At the same time, the author also proposed two sparse convolution algorithms: 1) partial convolution and 2) frequency sparse convolution, which can be implemented simply by skipping blocks in matrix decomposition, further saving memory and computing resources. Experimental results show that FlashFFTConv improves the speed of accurate FFT convolution by 7.93 times, and makes Hyena-GPT-s achieve 2.3 points better perplexity on PILE under the same computing budget, making M2-BERT-base better on GLUE. The score reached 3.3 points higher. Additionally, FlashFFTConv achieves 96.1% accuracy on the Path-512 task, while no previous model has achieved better than 50% on this high-resolution vision task. At the same time, partial convolution makes it possible for long sequence models to handle the longest human genes (2.3M base pairs), while frequency sparse convolution maintains or improves model quality while accelerating pre-trained models.

https://www.aminer.cn/pub/655189e5939a5f4082a613e4/?f=cs

10.ADaPT: As-Needed Decomposition and Planning with Language Models

This paper introduces the ADaPT method for solving execution difficulties in complex tasks. The method utilizes large language models (LLMs) for interactive decision-making tasks and dynamically decomposes and plans complex subtasks to adapt to the environment and the capabilities of the LLM. ADaPT addresses the shortcomings of existing methods in terms of task complexity by recursively decomposing subtasks to adapt to task complexity and LLM capabilities. Experimental results show that ADaPT's success rate in tasks such as ALFWorld, WebShop, and TextCraft is significantly higher than existing baseline methods, up to 28.3%. Through in-depth analysis, the paper illustrates the importance of multi-level decomposition and demonstrates that ADaPT can dynamically adjust the capabilities and task complexity of the executor LLM.

https://www.aminer.cn/pub/6551898c939a5f4082a5f245/?f=cs

11.Prompt Engineering a Prompt Engineer

This paper studies hint engineering, an important task for optimizing the performance of large language models (LLMs). The authors propose a new meta-hint framework, named PE2, to more effectively guide LLM for automatic hint engineering. The framework includes key components such as stepwise reasoning templates and context specification to improve performance. Furthermore, the authors are inspired by common optimization concepts such as batch size, step size, and momentum, introduce their verbal representations into meta-cues, and study their impact. PE2 outperformed previous auto-prompt engineering baselines across multiple benchmarks, demonstrating its versatility. Additionally, PE2 is able to meaningfully edit erroneous or incomplete hints and propose non-trivial counterfactual reasoning capabilities.

https://www.aminer.cn/pub/65518957939a5f4082a5dbca/?f=cs

12.FinGPT: Large Generative Models for a Small Language

This paper introduces FinGPT: Large Generative Models for a Small Language. Large language models (LLMs) perform well in natural language processing and many other tasks, but most open models have very limited support for small languages, and LLM work tends to focus on languages ​​for which there is almost unlimited data for pre-training. In this article, the authors examine the challenges of creating an LLM in Finnish, one of the languages ​​spoken by less than 0.1% of the world's population. The authors have assembled a Finnish corpus including web crawls, news, social media and e-books. The author used two methods to pre-train the model: 1) trained seven monolingual models (186M to 13B parameters) from scratch, called FinGPT; 2) continued to perform the multilingual BLOOM model on a mixture of original training data and Finnish After pre-training, a model with 176 billion parameters was obtained, called BLUUMI. To evaluate the model, the authors introduce FIN-bench, which is a Finnish task version of BIG-bench. The authors also evaluated other model qualities such as toxicity and bias. The authors' models and tools are publicly available at https://turkunlp.org/gpt3-finnish.

https://www.aminer.cn/pub/65518945939a5f4082a5d446/?f=cs

13.Language Models can be Logical Solvers

This paper explores the application of language models to logical reasoning. Logical reasoning is a fundamental aspect of human intelligence and is a key component of problem solving and decision making. Recent technological advances have made it possible for large language models (LLMs) to exhibit reasoning capabilities, but complex logical reasoning remains a challenge. The current state-of-the-art approach is to use LLMs to parse natural language logic problems into symbolic representations, and then employ an external logic solver to input the symbolic representation and output the answer. Although this approach is impressive in terms of performance, any parsing errors may cause the execution of the external logic solver to fail, making it impossible to answer the logic question.

In this paper, the authors introduce a new language model called LoGiPT, which directly simulates the reasoning process of a logic solver and avoids parsing errors by learning to strictly adhere to the solver syntax and semantics. LoGiPT is fine-tuned on a newly constructed instruction-tuned dataset that reveals and optimizes the implicit reasoning process of the deductive solver. Experimental results show that LoGiPT outperforms existing solver-augmented language models and few-shot hinting methods on two public deductive reasoning datasets, even when competing against competing LLMs such as ChatGPT or GPT-4.

https://www.aminer.cn/pub/65518a79939a5f4082a653a8/?f=cs


END

Insert image description here

We have added the "Daily Selected New Papers" topic on the homepage of the AMiner website. You can click "Subscribe" and "Add to the Knowledge Base" to obtain all paper information!

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/134396907