Audio and Video Technology Development Weekly | 312

This weekly issue provides an overview of the latest news in audio and video technology.

News contributions: [email protected].

82777be17495d87dcf2b0efd565a2731.png

Why is RLHF the key to LLM training? AI expert takes stock of five replacement solutions and explains in detail the upgrade of Llama 2 feedback mechanism

The field of AI is changing with each passing day, and RLHF has gradually become an obsolete technology, but the new route is still unclear: should we adopt human-free feedback, or continue to improve the RLHF mechanism?

Inspired by ChatGPT, Google DeepMind predicts 71 million genetic mutations! AI deciphers the genetic code of human genes in Science

After AlphaFold, Google DeepMind once again shockingly released the AI ​​model AlphaMissense, which successfully predicted 71 million "missense mutations" and is expected to overcome the problem of human genetics.


"Take a deep breath" to make large models perform better! Google DeepMind uses large language models to generate prompts, or does AI understand AI better?

Google DeepMind has proposed a new optimization framework OPRO, which can guide large language models to gradually improve solutions and achieve various optimization tasks only through natural language descriptions.

8ffcc314bf446b9d1d737183503c525d.png

Is GPT-5 coming? OpenAI was revealed to have accelerated the training of the multi-modal large model Gobi, killing Google Gimini in one fell swoop!

On the battlefield of multi-modal large models, some people have already caught wind of it. According to foreign media reports, OpenAI’s new multi-modal model Gobi seems to be in preparation. The showdown between Google and OpenAI seems to be imminent.

Crack all modes and get infinitely close to AGI! The Singaporean Chinese team opens up the all-purpose "Great Unification" multi-modal large model

Following various input-end multi-modal large language models, the Chinese team at the National University of Singapore recently open sourced a "grand unification" multi-modal large model that supports any modal input and any modal output, which has become popular in the AI ​​community.

What is the potential application of LLM in radiological science? Dozens of research institutions jointly tested 31 large models

This paper globally evaluates the performance of 31 large language models (LLMs) on the task of interpreting radiology reports and deriving diagnostic information (impression) from radiology findings. This is one of the most comprehensive evaluations of global LLMs for natural language processing (NLP) in radiological sciences known to date. This study fills the current knowledge gap in this field by benchmarking mainstream LLMs developed overseas and in China on this critical radiology NLP task.

Better than GPT-4, the 2 billion parameter model can do arithmetic problems with an accuracy of almost 100%

Currently, large language models (LLM) have shown excellent capabilities in handling various downstream tasks in the NLP field. In particular, pioneering models such as GPT-4 and ChatGPT have been trained on large amounts of text data, giving them powerful text understanding and generation capabilities, the ability to generate coherent and context-sensitive responses, and high versatility in various NLP tasks. sex.

990e8ec491cfa68ac0c0fb2d7863f8b2.png

Northeastern University releases Sttracker: a spatiotemporal tracker for 3D single target tracking

Compared with inputting two frame point clouds, this paper inputs multiple frame point clouds to encode the spatiotemporal information of the target, implicitly learns the target's motion information, can establish correlation between different frames, and efficiently track the target in the current frame. . At the same time, unlike directly using point features for feature fusion, the point cloud features are first cropped into multiple patches, then a sparse attention mechanism is used to encode patch-level similarities, and finally multi-frame features are fused. A large number of experiments show that this method achieves competitive results on challenging large-scale benchmark test sets (62.6% in KITTI, 49.66% in NuScenes). 

Learning LVI-SAM from coarse to fine: analysis of the essence of the original paper

This article is the third part of the LVI-SAM learning series. Read the original text of the paper before in-depth analysis of the LVI-SAM source code. You can clarify your ideas when encountering difficulties in analyzing the source code, avoid detours, and improve the efficiency of source code analysis.

BIT open source TDLE: 2D lidar exploration using regional division for hierarchical planning

Exploration systems are critical to increasing robot autonomy. Due to the unpredictability of the future planning space, existing methods either adopt inefficient greedy strategies or require a large amount of resources to obtain the global solution. In this work, this paper addresses the challenge of obtaining global exploration routes with minimal computational resources. The hierarchical planning framework dynamically divides the planning space into sub-regions and arranges their order to provide global guidance for exploring problems. Specific exploration targets are selected using metrics consistent with the sub-regional order, thereby taking into account the estimation of spatial structure and extending the planning space to unknown regions. Extensive simulations and field tests demonstrate the effectiveness of our approach compared with existing 2D LiDAR-based methods.

Zhejiang University Gao Fei team released: a fast and accurate overall collision assessment tool for arbitrary shape robot planning

Zhejiang University's Gao Fei team released Robo-Centric Esdf: a fast and accurate overall collision assessment tool for arbitrary-shaped robot planning.

eccba0a9481c81872d1161a00a082e9b.png

Hardware continues to roll in, the content field heats up, and the 3D track is making waves again

The explosion of AI at the beginning of the year brought the first wave of turmoil, which aroused the industry's great attention to 3D content creation; in June, Apple launched Vision Pro, claiming to have entered the "spatial computing era", and 3D has entered the "spatial computing era" One of the key keys is that the 3D content market is once again in turmoil.

Quest 3 online documentation reveals that it will provide a better MR 3D space interactive experience

Quest 3, which can utilize mesh data and depth data, will greatly enhance the scanning experience, enabling realistic virtual objects with a three-dimensional feel, as well as realistic interactions with virtual objects.

Meta’s second-generation smart glasses Ray Ban Stories passed FCC certification and are expected to be released on September 27

According to public information from the U.S. Federal Communications Commission, the new smart glasses device, registered as Luxottica Group and product name Ray-Ban Stories, has passed FCC certification. This means that the second-generation Ray-Ban Stories, a collaboration between Meta and Ray-Ban parent company Luxottica Group, is expected to be officially unveiled at the Connect conference on September 27.

Meta AR/VR patent shares use of wrist-mounted wearable device to detect gestures

Meta believes that using gestures to scroll through lists and browse content in XR instead of using a controller will enhance the user mobile experience. Therefore, the team applied for a patent called "Scrolling and navigation in virtual reality." Among them, in addition to recognizing gestures through the hand tracking of the headset, Meta said that gestures can also be detected through wrist-mounted wearable devices.

0cbd1b8a9de135f87460ca3e59d91225.png

Intel releases new chip, 288-core Xeon is on the way

In the early morning of September 20th, Beijing time, Intel held a grand "Intel Innovation" event in San Francisco. At the beginning of the meeting, Intel CEO Pat Gelsinger first said that AI represents the arrival of a new era and creates huge opportunities. Today, chips form a $574 billion industry and drive an estimated $8 trillion global technology economy.

Chips are moving to the atomic level

The world can't stop talking about chips, but the excitement is about the ingredients—the atomic-sized transistors that, when carved, layered, and latticed into semiconductor nanouniverses, give microchips their unfathomable virtuosity. In comparison, chips are just clearly visible little chunks carved out of silicon.

Jim Keller’s new thoughts on chips

Keller, a former "chip guru" at tech giants like Intel and Tesla, is using his years of experience to develop processors made up of a grid of cores called Tensix cores. These devices include network communications hardware that "talks" to other processors directly over the network rather than through DRAM.

1cbfabc0adcdbdddbda861a8a9cfc33f.png

Accelerating diffusion-based text-audio generation using consistent distillation

Diffusion models support most text-to-audio generation. However, these models are slow to reference due to the iterative querying of the underlying denoising, making them unsuitable for scenarios with inference time or computational constraints. This work modifies the recently proposed consensus framework to train a single neural network that only requires a TTA model.

https://arxiv.org/pdf/2309.10740v1.pdf

Sound source localization is all about cross-modal alignment

Humans can easily perceive the direction of sound sources in a visual scene, which is called sound source localization. Current research on sound source localization based on learning mainly explores this issue from the perspective of localization. However, existing techniques and existing benchmarks do not take into account a more important aspect of the problem, namely cross-modal semantic understanding, which is crucial for true sound source localization. Cross-modal semantic understanding is important for understanding audiovisual events with mismatched semantics, for example, silent objects or off-screen sounds. To explain this, this paper proposes a cross-modal alignment task as a joint task of sound source localization to better learn the interaction between audio and visual modalities.

https://arxiv.org/pdf/2309.10724v1.pdf

Analysis of Audition RMS calculation principle

Decibel (deci-Bel, dB) is a relatively common concept in speech. I often hear others say how many dB the sound is, but sometimes I find that dB is positive sometimes and negative sometimes. This sound cannot be heard by those over 25 years old. This article has talked about the differences between several dBs. Positive dB is measured with a decibel meter, and negative dB is viewed with audio software (such as Audition). So what is the dB displayed by audio software such as Audition? What's the calculation? This article introduces this simple problem.

my country's standards take the lead in breaking through the limitations of wireless audio transmission, and the world's first unified architecture, full bit rate wireless audio codec standard L2HC is released

The world's first unified architecture, full bit rate wireless audio codec standard L2HC was officially released today, supporting a maximum transmission bit rate of 1920Kbps, exceeding Apple AAC, Sony LDAC, Qualcomm-led aptX Lossless and other standards. According to reports, Huawei FreeBuds Pro 3 is the first product to support the L2HC intelligent lossless audio codec standard. It is the world’s first 1.5Mbps lossless sound quality experience and supports 64K-1920Kbps, 96kHz/24bit audio.

cb61fc5ef1acc5f47957b0ec838c2bd9.png

Exploration of technology combining real-time cloud rendering and live broadcast application scenarios

With the comprehensive development of the Internet and terminal devices, live broadcast has become more and more common in daily life. More and more people are beginning to interact with anchors during live broadcasts as a way of entertainment. However, frequent lags on some live broadcast platforms and a single reward special effect will greatly reduce the user's live broadcast experience. LiveVideoStack invited Jiang Min from Tencent Cloud to introduce to us how Tencent Cloud applies cloud rendering in live broadcast scenarios to bring a better experience to live broadcasts.

Unity cloud native distributed runtime

The advent of the metaverse era has put forward many requirements for real-time 3D engines. As the most widely used 3D real-time content creation engine in the game industry, Unity has proposed the Unity cloud-native distributed runtime solution to deal with these new challenges. LiveVideoStack 2023 Shanghai Station invited Shu Runxuan, a solution engineer from Unity China, to share practical cases, problems faced, and solutions to the solution, and introduced Unity’s current ideas for other solutions.

Overview of the AVS Perceptual Lossless Compression Standard - Visually Lossless Quality Level Video Shallow Compression

Shallow compression, also known as mezzanine compression, is a video compression level that can effectively reduce video bandwidth while maintaining the overall video quality. The compression ratio is usually 2:1 to 8:1. According to this compression ratio, both 4K and 8K programs can be transmitted using the 10G interface, which greatly reduces the cost of network equipment. LiveVideoStackCon 2023 Shanghai Station invited teacher Yang Haitao to introduce to us the practices and explorations of the AVS standard group and hardware manufacturers such as Shanghai HiSilicon in the field of shallow compression of lossless quality level videos.

Caton Media Xstream: Redefining Live Content Delivery Services

As the public Internet becomes more complex, the basic prototype of best effort can no longer meet the increasing number of real-time content delivery services that require QoS guarantees. However, traditional solutions such as dedicated lines and satellites have problems such as high deployment costs and long deployment cycles, and cannot respond quickly to various needs. LiveVideoStackCon invited Wei Ling from Caton Technology to introduce us to the Caton Media Xstream platform solution.

80f76def323c42331f2ca1dcc8fe7ca5.png

Pan-entertainment’s journey to overseas markets: The road is long and difficult, but technology leads the way

Pan-entertainment's overseas expansion has increasingly become a golden track that is booming and making rapid progress.

A new era of audio and video: How does AIGC subvert tradition?

Over the past three years, we have witnessed disruptive changes in the way humans live and work. From short videos and interactive live broadcasts to online education and cloud meetings, audio and video technology has not only penetrated into every corner, but has also profoundly affected the way all walks of life operate.

Officially authorized by the NBA, can Via's VR shooting arcade detonate the offline entertainment market?

VR entertainment and hardware developer Via Technology Global Co., Ltd. (hereinafter referred to as Via) has developed an offline VR shooting arcade. The team hopes to replace the traditional shooting machine with VR technology so that players can experience it without basketball. Shooting fun.

4047a063e97cd2bd453dc5c00bc7feae.png

The "Lecturer Team" has recruited more than half of its members, and treasured lecturers are waiting for you to pick!

If you want to participate in an audio and video technology conference, now is the time: LiveVideoStackCon 2023 Shenzhen Station Conference, tickets are on sale at a 10% discount for a limited time, and there are more discounts for group participation. Register now and meet you in Shenzhen.
●Time: November 24-25, 2023
●Location: Shenzhen Sentosa Hotel (Jade Branch)
●How to obtain tickets: Scan the QR code on the poster above, or consult: 13520771810 (same number on WeChat) for details.
●Official link: https://sz2023.livevideostack.com/topics

4e9c7d73a485288a797e3d2b5da20468.jpeg

Click " Read the original text " 

Jump to the official website of LiveVideoStackCon 2023 Shenzhen Station for more information

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/133256533