This weekly issue provides an overview of the latest news in audio and video technology.
News contributions: [email protected].
The field of AI is changing with each passing day, and RLHF has gradually become an obsolete technology, but the new route is still unclear: should we adopt human-free feedback, or continue to improve the RLHF mechanism?
After AlphaFold, Google DeepMind once again shockingly released the AI model AlphaMissense, which successfully predicted 71 million "missense mutations" and is expected to overcome the problem of human genetics.
Google DeepMind has proposed a new optimization framework OPRO, which can guide large language models to gradually improve solutions and achieve various optimization tasks only through natural language descriptions.
On the battlefield of multi-modal large models, some people have already caught wind of it. According to foreign media reports, OpenAI’s new multi-modal model Gobi seems to be in preparation. The showdown between Google and OpenAI seems to be imminent.
Following various input-end multi-modal large language models, the Chinese team at the National University of Singapore recently open sourced a "grand unification" multi-modal large model that supports any modal input and any modal output, which has become popular in the AI community.
This paper globally evaluates the performance of 31 large language models (LLMs) on the task of interpreting radiology reports and deriving diagnostic information (impression) from radiology findings. This is one of the most comprehensive evaluations of global LLMs for natural language processing (NLP) in radiological sciences known to date. This study fills the current knowledge gap in this field by benchmarking mainstream LLMs developed overseas and in China on this critical radiology NLP task.
Currently, large language models (LLM) have shown excellent capabilities in handling various downstream tasks in the NLP field. In particular, pioneering models such as GPT-4 and ChatGPT have been trained on large amounts of text data, giving them powerful text understanding and generation capabilities, the ability to generate coherent and context-sensitive responses, and high versatility in various NLP tasks. sex.
Northeastern University releases Sttracker: a spatiotemporal tracker for 3D single target tracking
Compared with inputting two frame point clouds, this paper inputs multiple frame point clouds to encode the spatiotemporal information of the target, implicitly learns the target's motion information, can establish correlation between different frames, and efficiently track the target in the current frame. . At the same time, unlike directly using point features for feature fusion, the point cloud features are first cropped into multiple patches, then a sparse attention mechanism is used to encode patch-level similarities, and finally multi-frame features are fused. A large number of experiments show that this method achieves competitive results on challenging large-scale benchmark test sets (62.6% in KITTI, 49.66% in NuScenes).
Learning LVI-SAM from coarse to fine: analysis of the essence of the original paper
This article is the third part of the LVI-SAM learning series. Read the original text of the paper before in-depth analysis of the LVI-SAM source code. You can clarify your ideas when encountering difficulties in analyzing the source code, avoid detours, and improve the efficiency of source code analysis.
BIT open source TDLE: 2D lidar exploration using regional division for hierarchical planning
Exploration systems are critical to increasing robot autonomy. Due to the unpredictability of the future planning space, existing methods either adopt inefficient greedy strategies or require a large amount of resources to obtain the global solution. In this work, this paper addresses the challenge of obtaining global exploration routes with minimal computational resources. The hierarchical planning framework dynamically divides the planning space into sub-regions and arranges their order to provide global guidance for exploring problems. Specific exploration targets are selected using metrics consistent with the sub-regional order, thereby taking into account the estimation of spatial structure and extending the planning space to unknown regions. Extensive simulations and field tests demonstrate the effectiveness of our approach compared with existing 2D LiDAR-based methods.
Zhejiang University's Gao Fei team released Robo-Centric Esdf: a fast and accurate overall collision assessment tool for arbitrary-shaped robot planning.
Hardware continues to roll in, the content field heats up, and the 3D track is making waves again
The explosion of AI at the beginning of the year brought the first wave of turmoil, which aroused the industry's great attention to 3D content creation; in June, Apple launched Vision Pro, claiming to have entered the "spatial computing era", and 3D has entered the "spatial computing era" One of the key keys is that the 3D content market is once again in turmoil.
Quest 3, which can utilize mesh data and depth data, will greatly enhance the scanning experience, enabling realistic virtual objects with a three-dimensional feel, as well as realistic interactions with virtual objects.
According to public information from the U.S. Federal Communications Commission, the new smart glasses device, registered as Luxottica Group and product name Ray-Ban Stories, has passed FCC certification. This means that the second-generation Ray-Ban Stories, a collaboration between Meta and Ray-Ban parent company Luxottica Group, is expected to be officially unveiled at the Connect conference on September 27.
Meta AR/VR patent shares use of wrist-mounted wearable device to detect gestures
Meta believes that using gestures to scroll through lists and browse content in XR instead of using a controller will enhance the user mobile experience. Therefore, the team applied for a patent called "Scrolling and navigation in virtual reality." Among them, in addition to recognizing gestures through the hand tracking of the headset, Meta said that gestures can also be detected through wrist-mounted wearable devices.
Intel releases new chip, 288-core Xeon is on the way
In the early morning of September 20th, Beijing time, Intel held a grand "Intel Innovation" event in San Francisco. At the beginning of the meeting, Intel CEO Pat Gelsinger first said that AI represents the arrival of a new era and creates huge opportunities. Today, chips form a $574 billion industry and drive an estimated $8 trillion global technology economy.
Chips are moving to the atomic level
The world can't stop talking about chips, but the excitement is about the ingredients—the atomic-sized transistors that, when carved, layered, and latticed into semiconductor nanouniverses, give microchips their unfathomable virtuosity. In comparison, chips are just clearly visible little chunks carved out of silicon.
Jim Keller’s new thoughts on chips
Keller, a former "chip guru" at tech giants like Intel and Tesla, is using his years of experience to develop processors made up of a grid of cores called Tensix cores. These devices include network communications hardware that "talks" to other processors directly over the network rather than through DRAM.
Accelerating diffusion-based text-audio generation using consistent distillation
Diffusion models support most text-to-audio generation. However, these models are slow to reference due to the iterative querying of the underlying denoising, making them unsuitable for scenarios with inference time or computational constraints. This work modifies the recently proposed consensus framework to train a single neural network that only requires a TTA model.
https://arxiv.org/pdf/2309.10740v1.pdf
Sound source localization is all about cross-modal alignment
Humans can easily perceive the direction of sound sources in a visual scene, which is called sound source localization. Current research on sound source localization based on learning mainly explores this issue from the perspective of localization. However, existing techniques and existing benchmarks do not take into account a more important aspect of the problem, namely cross-modal semantic understanding, which is crucial for true sound source localization. Cross-modal semantic understanding is important for understanding audiovisual events with mismatched semantics, for example, silent objects or off-screen sounds. To explain this, this paper proposes a cross-modal alignment task as a joint task of sound source localization to better learn the interaction between audio and visual modalities.
https://arxiv.org/pdf/2309.10724v1.pdf
Analysis of Audition RMS calculation principle
Decibel (deci-Bel, dB) is a relatively common concept in speech. I often hear others say how many dB the sound is, but sometimes I find that dB is positive sometimes and negative sometimes. This sound cannot be heard by those over 25 years old. This article has talked about the differences between several dBs. Positive dB is measured with a decibel meter, and negative dB is viewed with audio software (such as Audition). So what is the dB displayed by audio software such as Audition? What's the calculation? This article introduces this simple problem.
The world's first unified architecture, full bit rate wireless audio codec standard L2HC was officially released today, supporting a maximum transmission bit rate of 1920Kbps, exceeding Apple AAC, Sony LDAC, Qualcomm-led aptX Lossless and other standards. According to reports, Huawei FreeBuds Pro 3 is the first product to support the L2HC intelligent lossless audio codec standard. It is the world’s first 1.5Mbps lossless sound quality experience and supports 64K-1920Kbps, 96kHz/24bit audio.
With the comprehensive development of the Internet and terminal devices, live broadcast has become more and more common in daily life. More and more people are beginning to interact with anchors during live broadcasts as a way of entertainment. However, frequent lags on some live broadcast platforms and a single reward special effect will greatly reduce the user's live broadcast experience. LiveVideoStack invited Jiang Min from Tencent Cloud to introduce to us how Tencent Cloud applies cloud rendering in live broadcast scenarios to bring a better experience to live broadcasts.
Unity cloud native distributed runtime
The advent of the metaverse era has put forward many requirements for real-time 3D engines. As the most widely used 3D real-time content creation engine in the game industry, Unity has proposed the Unity cloud-native distributed runtime solution to deal with these new challenges. LiveVideoStack 2023 Shanghai Station invited Shu Runxuan, a solution engineer from Unity China, to share practical cases, problems faced, and solutions to the solution, and introduced Unity’s current ideas for other solutions.
Shallow compression, also known as mezzanine compression, is a video compression level that can effectively reduce video bandwidth while maintaining the overall video quality. The compression ratio is usually 2:1 to 8:1. According to this compression ratio, both 4K and 8K programs can be transmitted using the 10G interface, which greatly reduces the cost of network equipment. LiveVideoStackCon 2023 Shanghai Station invited teacher Yang Haitao to introduce to us the practices and explorations of the AVS standard group and hardware manufacturers such as Shanghai HiSilicon in the field of shallow compression of lossless quality level videos.
Caton Media Xstream: Redefining Live Content Delivery Services
As the public Internet becomes more complex, the basic prototype of best effort can no longer meet the increasing number of real-time content delivery services that require QoS guarantees. However, traditional solutions such as dedicated lines and satellites have problems such as high deployment costs and long deployment cycles, and cannot respond quickly to various needs. LiveVideoStackCon invited Wei Ling from Caton Technology to introduce us to the Caton Media Xstream platform solution.
Pan-entertainment's overseas expansion has increasingly become a golden track that is booming and making rapid progress.
A new era of audio and video: How does AIGC subvert tradition?
Over the past three years, we have witnessed disruptive changes in the way humans live and work. From short videos and interactive live broadcasts to online education and cloud meetings, audio and video technology has not only penetrated into every corner, but has also profoundly affected the way all walks of life operate.
VR entertainment and hardware developer Via Technology Global Co., Ltd. (hereinafter referred to as Via) has developed an offline VR shooting arcade. The team hopes to replace the traditional shooting machine with VR technology so that players can experience it without basketball. Shooting fun.
If you want to participate in an audio and video technology conference, now is the time: LiveVideoStackCon 2023 Shenzhen Station Conference, tickets are on sale at a 10% discount for a limited time, and there are more discounts for group participation. Register now and meet you in Shenzhen.
●Time: November 24-25, 2023
●Location: Shenzhen Sentosa Hotel (Jade Branch)
●How to obtain tickets: Scan the QR code on the poster above, or consult: 13520771810 (same number on WeChat) for details.
●Official link: https://sz2023.livevideostack.com/topics
▲ Click " Read the original text " ▲
Jump to the official website of LiveVideoStackCon 2023 Shenzhen Station for more information