Real-time tracking of scientific research trends丨7.18 Selected new papers from Microsoft, Tsinghua and other institutions

As a scientific researcher, you need to search and browse a large amount of academic literature every day to obtain the latest scientific and technological progress and research results. However, traditional retrieval and reading methods can no longer meet the needs of researchers.

ChatPaper, a document knowledge tool that integrates retrieval, reading, and knowledge question-and-answer. Help you quickly improve the efficiency of searching and reading papers, obtain the latest research trends in the field, and make scientific research work more easily.
insert image description here

Combined with the cutting-edge dynamic subscription function, select arXiv's popular new papers of the day to form a summary of papers, so that everyone can understand cutting-edge trends more quickly.
If you want to have an in-depth dialogue on a certain paper, you can directly copy the link of the paper to your browser or go directly to the ChatPaper page: https://www.aminer.cn/chat/g/

List of Featured New Papers for July 18, 2023:

1. TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT paper details page

Link: https://www.aminer.cn/pub/64b60eaa3fda6d7f06eaed33/?f=cs

ChatPaper Review: Paper discusses the time and effort required to analyze and manipulate tables using tables in real world databases. Advances in large language models (LLMs) have made it possible to interact with forms using natural language input, bringing this capability closer to reality. The authors propose TableGPT, a unified fine-tuned framework that enables LLMs to understand and manipulate tables and use external functional commands. TableGPT introduces the ability to seamlessly interact with tables, and realizes various functions such as question answering, data manipulation (such as insertion, deletion, query, and modification operations), data visualization, analysis report generation, and automatic prediction. TableGPT strives to provide convenience and ease of use to users, enabling them to easily utilize tabular data. At the core of TableGPT is the new concept of global table representation, which enables LLMs to deeply understand the entire table except meta-information. By simultaneously training both tabular and textual modalities of LLMs, TableGPT achieves deep understanding of tabular data and is able to perform complex operations through command chains. Importantly, TableGPT has the advantage of a self-contained system rather than relying on external API interfaces. Furthermore, it supports efficient data flow, query rejection (where appropriate), and private deployment, enabling faster domain data fine-tuning and ensuring data privacy, enhancing the adaptability of the framework to specific use cases.

2. INVE: Interactive Neural Video Editing paper details page

Link: https://www.aminer.cn/pub/64b60e7d3fda6d7f06ea80e3/?f=cs

ChatPaper roundup: Illustrates two major problems with existing video editing solutions: slow speed and insufficient support for certain editing use cases. To address these challenges, the researchers employed an efficient network architecture and a hash-grid coding-based approach that greatly increased the processing speed. In addition, they also learn bidirectional functions between image atlases and vectorized editing, enabling more editing operations between atlases and frames. Compared with existing solutions, INVE is able to shorten the learning and inference time and support a wider variety of video editing operations. Through comprehensive quantitative and qualitative analyses, the advantages and improved performance of INVE over existing solutions for interactive video editing are demonstrated.

3. Language Conditioned Traffic Generation paper details page

Link: https://www.aminer.cn/pub/64b60eaa3fda6d7f06eaea41/?f=cs

ChatPaper Summary: Illustrates the importance of simulators in autonomous driving development, and one of the main challenges currently faced: the lack of realistic, scalable, and interesting content. We also introduce a new method, LCTGen, which exploits language as a source of supervision for dynamic traffic scene generation. The LCTGen model combines a large-scale language model and a Transformer-based decoder architecture to select possible locations from a map dataset and generate an initial traffic distribution as well as the behavior of each vehicle. Through experiments, LCTGen demonstrates higher realism and fidelity in unconditional and conditional traffic scene generation, outperforming previous work.

4. CoTracker: It is Better to Track Together paper details page

Link: https://www.aminer.cn/pub/64b60e7d3fda6d7f06ea80be/?f=cs

ChatPaper review: The paper points out that traditional video motion prediction methods either estimate the instantaneous motion of all points in a given video frame through optical flow, or track the motion of each point in the video independently. This is true even for powerful deep learning methods that can track points when they are occluded. Tracking points independently ignores possible strong correlations between points, such as they belong to the same object, which may affect performance. Therefore, this paper proposes an architecture called CoTracker, which is able to jointly track multiple points throughout a video. This architecture combines some ideas from the fields of optical flow and tracking to design a new, flexible and powerful model. It is based on a Transformer network, which simulates the temporal correlation of different points through a dedicated attention layer. The Transformer iteratively updates estimates for multiple trajectories. It can be applied to very long videos in a sliding window fashion, and an unrolled training loop is designed for this case. It can track one or more points at the same time, and supports adding new tracking points at any time. The result is a flexible and powerful tracking algorithm that demonstrates superior performance on nearly all benchmarks. Therefore, this paper addresses the problem of multipoint tracking in videos.

5. Diffusion Models Beat GANs on Image Classification paper details page

Link: https://www.aminer.cn/pub/64b60eaf3fda6d7f06eaf562/?f=cs

ChatPaper Review: Illustrates a Unified Representation Learning Approach - Diffusion Models Outperform Generative Adversarial Networks (GANs) on Image Classification Tasks. Diffusion models are a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc., and generate images with high fidelity, diversity, and novelty by training a U-Net to predict and remove noise image. The authors found that the intermediate feature maps of U-Net can serve as embeddings of discriminative information and can be used for classification tasks. The authors explore the best ways to extract and use these embeddings for classification tasks and show promising results on the ImageNet classification task. The authors also study the performance of diffusion models under transfer learning schemes on multiple fine-grained image classification datasets, and compare these embeddings with those generated by other architectures and pre-training methods.

6. Retentive Network: A Successor to Transformer for Large Language Models paper details page

Link: https://www.aminer.cn/pub/64b60eaa3fda6d7f06eaecfd/?f=cs

ChatPaper review: The paper proposes a network architecture called RetNet for building large language models. The model simultaneously achieves training parallelism, low-cost inference, and good performance. The paper first theoretically derives the connection between loops and attention, and then proposes a preservation mechanism for sequence modeling that supports three computing paradigms: parallel, loop, and block loop. Specifically, parallelism for training is achieved through parallel representations, and low-cost inference is achieved through recurrent representations, improving decoding throughput, latency, and GPU memory utilization without performance loss. Efficient modeling of long sequences with linear complexity is achieved through chunked recurrent representations, where each chunk can be encoded in parallel, and chunks are summed up via loops. Experimental results on language modeling demonstrate that RetNet achieves promising results in terms of scalability, parallel training, low-cost deployment, and efficient inference.

7.Planting a SEED of Vision in Large Language Model paper details page

Link: https://www.aminer.cn/pub/64b60eaa3fda6d7f06eaeaa5/?f=cs

ChatPaper review: This study illustrates the problem of using image tokenizers in large language models. Previous research on image taggers has stalled, with frameworks using quantified visual markers due to poor performance and convergence in terms of multimodal understanding (compared to BLIP-2 et al.) or generation (compared to stable diffusion etc.) and lost its importance. Despite these limitations, we remain confident in its ability to naturally unify visual and textual representations, facilitating scalable multimodal training of LLMs. In this study, we identify two key principles of SEED's architecture and training that effectively simplify subsequent alignment with LLM. First, image labeling should be independent of 2D physical patch locations and should arise in 1D causal dependence, exhibiting an intrinsic interdependence consistent with the left-to-right autoregressive prediction mechanism in LLM. Second, image tagging should capture high-level semantics consistent with the level of semantic abstraction in words, and be optimized during the tagger training phase for improvements in discrimination and reconstruction. Thus, off-the-shelf LLMs are able to incorporate our SEED for image-to-text and text-to-image generation with efficient LoRA adaptation. Comprehensive multimodal pre-training and instruction tuning may lead to better results, which is the focus of future research. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5 million publicly available image-text pairs. Our preliminary study highlights the great potential of discrete visual labeling in versatile multimodal LLM, and the importance of appropriate image labelers in broader research.

8. AlpaGasus: Training A Better Alpaca with Fewer Data paper details page

Link: https://www.aminer.cn/pub/64b60eaf3fda6d7f06eaf561/?f=cs

ChatPaper review: The paper illustrates that when using large language models for instruction tracking, there are commonly used instruction fine-tuning datasets that contain many low-quality instances with wrong or irrelevant responses, which is misleading and detrimental to instruction fine-tuning . The paper proposes a simple yet effective data selection strategy, using a powerful language model such as ChatGPT to automatically identify and remove low-quality data. To this end, the paper introduces AlpaGasus, and only fine-tunes the 9k high-quality data screened from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca model on multiple test sets, and its 13B version achieves its teacher language model (i.e. Text-Davinci-003) > 90 >90\\ % on test tasks>90performance. It also provides 5.7 times faster training speed, shortening the training time of the 7B version from 80 minutes (Alpaca) to 14 minutes. Overall, AlpaGasus demonstrates a novel data-centric instruction fine-tuning paradigm that can be generally applied to instruction fine-tuning data, enabling faster training and better instruction tracking models.

9. BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs paper details page

Link: https://www.aminer.cn/pub/64b60eaa3fda6d7f06eaecd4/?f=cs

ChatPaper review: It illustrates the problems of current language models (LM) on multimodal input, that is, they only construct a coarse-grained mapping and lack the ability to locate specific parts of the input. In order to improve the user experience and expand the application scenarios of multimodal LM, this research proposes a multimodal LM named BuboGPT, which has the ability of visual localization, and can perform cross-modal interaction between vision, audio and language, providing a visual A fine-grained understanding of objects and other given modalities. By being able to pinpoint the exact location of an object in an image when generating a response or describing an object, BuboGPT enables precise visual localization. The contributions of this research include: 1) a ready-made visual localization module based on SAM, which can extract entities in sentences and find corresponding masks in images; 2) a two-stage training scheme and instruction dataset to endow text-image- Joint understanding of audio. Experiments demonstrate that BuboGPT has excellent multimodal understanding and visual localization capabilities when interacting with humans. It exhibits consistently good performance regardless of whether the provided modality combinations are aligned or not.


How to use ChatPaper?

The method of using ChatPaper is very simple. Open the AMiner homepage and enter the ChatPaper page from the navigation bar at the top of the page or the lower right corner.
insert image description here
On the ChatPaper page, you can choose to have a dialogue based on a single document or a dialogue based on the entire library (personal library), and you can choose to upload a local PDF or directly search for documents on AMiner.

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/131807885