Audio and Video Technology Development Weekly | 313

This weekly issue provides an overview of the latest news in audio and video technology.

News contributions: [email protected].

dad0bc9f2480f1b36b2f9557afe67a86.png

New breakthrough in brain-computer interface at UC Berkeley! Songs can be reproduced using brain waves, a blessing for people with language impairments

Another step in the field of brain-computer interface is that songs can be reconstructed in reverse using brain waveform diagrams. Another major breakthrough beyond text decoding!

Video of Tesla’s “Optimus Prime” robot goes viral! End-to-end AI brain blessing, challenging difficult yoga

The latest video of Tesla's humanoid robot "Optimus Prime" has been released. With the support of end-to-end neural networks, it can accurately classify objects and find body balance. Many netizens exclaimed that it will change mankind.


Enable formal logic, deactivate four arithmetic operations, MAmmoT makes LLM a mathematical generalist

A new data set combined with the two methods of thinking chain and thinking program can actually make the mathematical reasoning of open source LLM catch up with closed source large models such as GPT-4.


Midjourney founder: Pictures are just the first step, AI will completely change learning, creativity and organization

Midjourney is an amazing company where 11 people change the world and create great products. It is destined to become a good story in the early years of Pre AGI. MidJourney is the hottest image generation engine at the moment. Despite fierce competition from OpenAI's DALL·E 2 and the open source model Stable Diffusion, it still maintains the absolute lead in generating multiple styles of effects. Zhang Peng, founder of Geek Park, had a conversation with David Holz, founder of Midjourney.

d9a084f25526d9efb7652978a211a992.png

ChatGPT is far ahead with 1.5 billion monthly active users! 50 companies have been fighting for 6 months, 80% of them started from scratch

Generative AI big PK! 50 companies competed on the stage, and ChatGPT was far ahead, with monthly active users reaching 1.5 billion. Recently, a foreign website took stock of all generative AI data for almost a year, and finally found that ChatGPT is far ahead.


The most significant update of ChatGPT is here: multi-modality will be online, you can speak and read

ChatGPT has undergone an important update. Whether it is a GPT-4 or GPT-3.5 model, it can now perform analysis and dialogue based on images.


Don’t be afraid of text in images, TextDiffuser provides higher quality text rendering

The field of Text-to-Image has made tremendous progress, especially in the era of AIGC (Artificial Intelligence Generated Content). With the rise of the DALL-E model, more and more Text-to-Image models have emerged in the academic community, such as Imagen, Stable Diffusion, ControlNet and other models. However, despite the rapid development of the Text-to-Image field, existing models still face some challenges in stably generating images containing text.

f9a2ca3f023bab5c642b5b8cdc630f2e.png

13 low-light enhancement benchmarks on the list! Tsinghua University teamed up with ETH and other open source Retinexformer: there are details in both light and dark | ICCV 2023

Researchers from Tsinghua University, University of Würzburg, and ETH Zurich recently published a new paper at ICCV 2023, formulating a simple but principled single-stage Retinex-based framework (ORF). Comprehensively surpassing the dark light enhancement sota model, the Retinexformer architecture solves problems such as overexposure, artifacts, and low light in an end-to-end, single-stage manner! 

Lidar or visual perception, who can meet at the top?

One topic that cannot be avoided in autonomous driving is which one is better, lidar or camera. This issue has been debated endlessly in the industry. The two major factions have different opinions and can give a lot of reasons why one should be used instead of the other. , in fact, if we want to understand why there is this controversy, we must first understand what are the principles behind these two major technical routes, and what are the advantages and disadvantages of each.

University of Toronto and others released: Probabilistic object-aware variational SLAM in a semi-static environment

Simultaneous localization and mapping (SLAM) is critical for long-term robotic missions in slowly changing scenes. Failure to detect scene changes can lead to inaccurate maps and ultimately the loss of the robot. Traditional SLAM algorithms assume that the scene is static. Recent research considers dynamic scenes, but requires scene changes to be observed in consecutive frames. Semi-static scenes, where objects appear, disappear, or move slowly over time, are often overlooked, yet they are critical to long-term operations. We propose an object-aware factor graph SLAM framework for tracking and reconstructing semi-static object-level changes. By fusing object-level information, our approach can robustly handle semi-static scenes and maintain accurate maps over long periods of time. Experimental results demonstrate the effectiveness and superiority of our proposed framework in handling slowly changing scenarios. Our work contributes to the advancement of SLAM technology in real-world scenarios with diverse and dynamic environments.

7df9acf82ef9e659b6b8e638e41b7f72.png

Varjo’s first consumer VR headset announces permanent price cut; leaked Xbox files show XR attracts attention

Recently, dynamic holographic technology service provider Envisics announced the completion of Series C financing, thanks to new investment from M&G Investments and follow-up investment from Van Tuyl Companies.

Hardware continues to roll in, the content field heats up, and the 3D track is making waves again

Two years ago, for many people in the industry, the 3D content market was a track with hundreds of billions of imagination. Financing at the level of 100 million yuan is a microcosm of the fact that the 3D content creation market has attracted capital attention. Now, as the epidemic ends and the market enters a cooling-off period, the financing environment has changed drastically. In this environment, what many 3D content start-ups lack is not technology but the market. Take 3D reconstruction as an example. According to VR Gyro, the current domestic 3D reconstruction application market has basically not opened up. Most start-ups are mainly focused on overseas market demand, mainly in North America.

Meta AR patent introduces waveguide configuration to reduce rainbow artifacts

Most users and eyewear manufacturers desire AR glasses that are shaped like sunglasses. Although this sounds simple enough, one problem has always troubled researchers: stray light. The more open the AR glasses are, the more light from extra directions and light sources can enter the system. Due to the diffraction structure, the eye-tracking combiner mounted on AR glasses may diffract visible light from the real world, resulting in rainbow artifacts in perspective views, especially when users view bright light sources from specific angles. This artifact may degrade the image quality of transparent views.

Microsoft AR/VR patent explores microlens array for wide-range chief ray angle control

Microsoft believes that Micro LED has the characteristics of small size, light weight, high brightness, and high packaging density, and may be particularly suitable for head-mounted displays that require high resolution, small size, and light weight. In a patent application titled "Microlenses providing wide range chief ray angle manipulation for a panel display", Microsoft introduced a microlens that provides wide range chief ray angle manipulation for a panel display, and a display configured with the microlens array system.

c62faa77d2f1e0cf4184a777a94d8bf0.png

Crushing H100, Nvidia’s next-generation GPU is revealed! The first 3nm multi-chip module design, unveiled in 2024

H100 is in short supply, and the next generation of more powerful GPUs is already on the way. According to reports, Nvidia's new generation chip B100 will use TSMC's 3nm process and multi-chip design, and is expected to be launched in 2024.

The biggest challenge of inference chips

Led by Transformers and other large language models (LLMs), software algorithms have made rapid progress while the processing hardware responsible for executing them has been left behind. Even the most advanced algorithmic processors do not have the performance required to elaborate the latest ChatGPT query in a time frame of one or two seconds. To make up for the lack of performance, leading semiconductor companies build systems consisting of a large number of the best hardware processors.

After 1,568 days of "breakout", Huawei carved out a "Chinese core"

The mobile phone chip technology of Huawei Mate 60 Pro caused shock. CCTV "named" Huawei and said: Huawei's Mate 60 series new phones use "Chinese cores" and more than 10,000 parts and components have been domestically produced.

3395fea9c5b2045a2dbaee628aca4165.png

Research on distribution alignment in multilingual speaker recognition

Multi-type speaker recognition is becoming an increasingly popular application due to its ability to better reflect the complexity of the real world. However, a major challenge is the significant variation in vector distribution for different types of speakers. Although distribution alignment is a common approach to address this challenge, previous research has mainly focused on aligning source and target domains, and the performance on multi-type data is unclear. This paper conducts a comprehensive study on mainstream distribution alignment methods that need to align multiple distributions in multi-type data. We conduct qualitative and quantitative analyzes of various methods. Our experiments on the CN-Celeb dataset show that within-between distribution alignment (WBDA) performs relatively well. However, we also found that none of the investigated methods consistently improved performance across all test cases. This suggests that simply aligning the distribution of speaker vectors may not fully address the challenges posed by multi-type speaker identification. Further investigation is needed to develop a more comprehensive solution.
https://arxiv.org/pdf/2309.14158v1.pdf

Expressive voice-driven facial animation synthesis with controllable emotion

Highly realistic facial animation generation is in high demand but currently remains a challenging task. Existing voice-driven facial animation methods can produce satisfactory mouth movements and lip synchronization, but still fall short in flexibility for expressive emotional expression and emotional control. This paper proposes a new deep learning-based method for generating expressive facial animations from speech that can exhibit a broad spectrum of facial expressions with controllable emotion types and intensities. This paper proposes an emotion control module that learns the relationship between emotion changes (e.g., type and intensity) and corresponding facial expression parameters, making emotionally controllable facial animation possible, where the target expression can be continuously adjusted as needed.

Diagnose voice conditions with call audio

Yuya Hosoda, assistant professor at the Center for IT-Based Education (CITE) at Toyohashi University of Technology, developed a method to estimate the pitch of human vocal fold vibrations from call audio.

Laser-based system enables non-contact medical ultrasound imaging

Researchers at MIT Lincoln Laboratory and their collaborators at the Center for Ultrasound Research and Translation (CURT) at MGH (Massachusetts General Hospital) have developed a new medical imaging Equipment: Noncontact Laser Ultrasound (NCLUS, Noncontact Laser Ultrasound). This laser-based ultrasound system provides images of features inside the body, such as organs, fat, muscles, tendons and blood vessels. The system can also measure bone strength and potentially track disease stages over time.

a251f9c567b42eacd185c70dbb50364e.png

nsdi23 | Bolt: Sub-RTT congestion control for ultra-low latency

Data center networks tend to increase line rates to 200Gbps and above to meet the performance requirements of applications such as NVMe and distributed ML. As the bandwidth-delay product (BDP) increases, more and more transmissions can be accommodated within a few BDPs. These transmissions are not only more sensitive to congestion performance, but also pose more challenges to congestion control (CC) because they leave little time for CC to make correct decisions. Therefore, CC is under greater pressure than ever to achieve minimal queuing and high link utilization, leaving no room for imperfect control decisions. The paper finds that in order for CC to make fast and accurate decisions, it is crucial to use accurate congestion signals and minimize control loop delays. The paper solves these problems by designing Bolt, which attempts to push congestion control to the theoretical limit by leveraging the power of a programmable data plane.
https://www.usenix.org/conference/nsdi23/presentation/arslan

The application practice of real-time audio and video technology in celebrities accompanying live broadcasts

iQIYI's celebrity-watching live streaming service launched in recent years has created a new experience for real-life celebrities to interact with audiences in close real-time around film, television, drama and variety shows, and has gradually attracted the attention of users. In terms of technology implementation, iQiyi has done its best through in-depth cooperation with third-party audio and video service providers, ultimately minimizing costs and maximizing effects. LiveVideoStackCon 2023 Shanghai Station invited Shi Xingdong from iQiyi to share with you the overall technical architecture of iQiyi’s star-watching live broadcast business, as well as iQiyi’s drama copyright management, reuse of existing infrastructure, high availability guarantee, etc. Some optimization considerations have been made.

"Creating multiple windows" and "decompressing flat" - the next generation of streaming media is multi-view and panoramic video

With the rise of all the major players, the streaming media industry has been on the fast track of involution. How to increase user scale by better meeting user needs is an important issue currently faced by major enterprises. Tiledmedia believes that with the rise of concepts such as the metaverse, "creating multiple windows" and "decompressing the flat" are the key answers. LiveVideoStackCon 2023 Shanghai Station invited Ma Gaoyang from Tiledmedia to introduce the meaning of "creating multiple windows" and "decompression flat" as well as some practical technical examples.

171564e4659a1fb9f81505973a32297e.png

Generative Image Dynamics: Generative Image Dynamics

This paper proposes a method for image space prior modeling of scene dynamics. The prior is learned from a collection of motion trajectories extracted from real video sequences containing natural oscillatory motion. Given an image, the proposed training model uses a frequency co-diffusion sampling process to predict long-term motion representations for each pixel in the frequency domain, called neural stochastic motion textures. Together with the image-based rendering module, these trajectories can be used in many downstream applications, such as converting static images into seamlessly looping dynamic videos, or allowing users to interact with objects in real pictures.

https://generative-dynamics.github.io/static/pdfs/GenerativeImageDynamics.pdf

64cdf972d6eb4e9519ebb265efa85886.png

Conversation with Wan Jiaan Huang Cuiping: Audio and video technology is entering the fourth stage, and the popularity of IoT is unstoppable

Ten years ago, the Internet of Things entered a stage of rapid development, and the breakthrough of 5G technology further provided thrust for the general trend of "Internet of Everything". According to loT Analytics forecast, the number of global IoT connections will increase by 121% in 2025 compared with 2021, with a CAGR of 22% from 2021 to 2025. Today, as the "Internet of People" becomes saturated, more and more attention is turning to IoT technology.

Smart Outbound Calls: Leading the Future of Credit Services

In the telecommunications industry, telephone services are used in all walks of life. All companies that need marketing and customer service have such service needs. However, as the telecom industry continues to develop, various pain points continue to emerge. In response to these problems, Qingdao Dongting Intelligent Technology Co., Ltd. developed the intelligent outbound call robot GO, which provides new solutions in various scenarios for the credit field. In the telecommunications industry, telephone services are used in all walks of life. All companies that need marketing and customer service have such service needs. However, as the telecom industry continues to develop, various pain points continue to emerge. In response to these problems, Qingdao Dongting Intelligent Technology Co., Ltd. developed the intelligent outbound call robot GO, which provides new solutions in various scenarios for the credit field.

2688fb2ee403ddeda02a93095495d628.png

The "Lecturer Team" has recruited more than half of its members, and treasured lecturers are waiting for you to pick!

If you want to participate in an audio and video technology conference, now is the time: LiveVideoStackCon 2023 Shenzhen Station Conference, tickets are on sale at a 10% discount for a limited time, and there are more discounts for group participation. Register now and meet you in Shenzhen.
●Time: November 24-25, 2023
●Location: Shenzhen Sentosa Hotel (Jade Branch)
●How to obtain tickets: Scan the QR code on the poster above, or consult: 13520771810 (same number on WeChat) for details.
●Official link: https://sz2023.livevideostack.com/topics

772606ef0aec5e92829206a00df18bb5.jpeg

Click " Read the original text " 

Jump to the official website of LiveVideoStackCon 2023 Shenzhen Station for more information

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/133472265