Audio and Video Technology Development Weekly | 305

Once a week, an overview of the dry goods in the field of audio and video technology.

News contribution: [email protected].

d47b829cc8bba61ff63cf3574cd988d8.png

The great god returns to the academic world: He Yuming announces to join MIT

"As a FAIR Research Scientist, I will join the EECS faculty at the Massachusetts Institute of Technology (MIT) Department of Electrical Engineering and Computer Science in 2024."

A well-known scholar in the field of AI, He Yuming, the inventor of ResNet, recently announced on his personal website that he is about to return to academia.

3649493fe23d6cb142466814a175c3c7.png

Meta's new open source model AudioCraft explodes! Automatically generate music from text

On August 3, the global social and technology giant Meta (the parent company of Facebook, Instagram, etc.) announced the open source text generation music model Audiocraft. It is reported that Audiocraft is a hybrid model composed of MusicGen, AudioGen and EnCodec. Background audio such as bird calls, car horns, footsteps, or more complex music can be generated using only text, which is suitable for business scenarios such as game development, social networking, and video dubbing.

From "generative AI" to "productivity", Amazon cloud technology draws the focus

Relying on customer demand insight and technology accumulation in the past few years, Amazon Cloud Technology has integrated a large number of AI capabilities into easy-to-use products, hoping to deliver technological progress to all walks of life in the simplest way. At this technology event, Amazon Cloud Technology launched seven new generative AI functions in one go.

Data created by humans is too expensive! Developers Quietly Use AI Synthetic Data to Train Models

Now, developers are quietly using AI-generated data to train AI models. The reason is - the data created by humans is really too expensive!

In the past, most AI models were trained on human data, but now, more and more companies (including OpenAI, Microsoft, and startups such as Cohere) are using this AI-generated "synthetic data" , or struggling to figure out how to use AI-generated data.

Report: See Trends by Numbers, See the Future——Discover New Opportunities in the Content Industry

The scale of China's content application ecological coverage has grown steadily, and structural changes have shown both growth in video form, scale and stickiness. Consumption of in-depth information and content has increased, which in turn affects brand recognition and transformation at the enterprise level. At the same time, AIGC is changing productivity and the content industry is ecologically diverse. Integration into the bureau + platform empowerment in depth, content assets have become one of the core assets of the enterprise, and content operation is a must.

c388fc9f306e3a7d8d8b5b2935090910.png

How to design an AI chip? Practice from Meta!

Machine learning (ML) has become ubiquitous in online events. These models have grown substantially in size and complexity in recent years, helping to improve the accuracy and validity of forecasts. At the same time, however, this growth poses significant challenges to the hardware platforms used to train and infer these models at scale. Total cost of ownership (TCO) is one of the main constraints on bringing models into production in the data center, and power is a significant part of the TCO for these platforms. As a result, performance per TCO (and performance per watt) has become an important metric for all hardware platforms targeting machine learning.

The MCU market where three heroes stand side by side

A microcontroller (Microcontroller Unit, MCU) is a type of microcomputer chip that integrates functions such as a central processing unit, memory, input and output interfaces, and a timer. Since its inception in the 1970s, MCU technology has achieved great success in various fields and plays a vital role in today's digital age. Amazingly, a small microcontroller accounts for more than 80% of the processor market! As semiconductor companies around the world are participating in the R&D and production of MCU technology, the MCU market presents a pattern of diversification and fierce competition.

6663fa29698034d2c15a2adfa0235b0c.jpeg

When autonomous driving companies get rid of the dependence on high-precision maps, what is the significance of SLAM algorithms in the driving process?

This article is composed of several Zhihu Gaozan answers, and I hope it will be helpful to readers who are concerned about the application of SLAM algorithms in the field of autonomous driving.

Can real-time semantic RGB-D SLAM be implemented on embedded systems in dynamic environments?

Most existing visual SLAM methods rely heavily on static world assumptions, which can easily fail in dynamic environments. This paper presents a real-time semantic RGB-D SLAM system in dynamic environments, which is capable of detecting both known and unknown moving objects. To reduce computational cost, it performs semantic segmentation only on keyframes to remove known dynamic objects, and keeps static maps for robust camera tracking. Furthermore, we propose an efficient geometric module to detect unknown moving objects by clustering depth images into several regions and identifying dynamic regions by their reprojection errors.

Hundreds of millions of thoughts on dynamic visual SLAM

Visual SLAM in a dynamic environment has always been the focus and difficulty of research, but recently there are fewer and fewer papers on dynamic SLAM. I feel that the main reason is that the framework of dynamic SLAM has been solidified, and it is difficult to make major innovations. The existing templates basically use target detection or semantic segmentation networks to eliminate dynamic feature points, and then use geometric consistency for further verification. The author is also thinking about breakthroughs recently, so I plan to analyze the current mainstream solutions in depth, hoping to find inspiration.

68eae9c41bff29a4600afe697966e867.png

F-LIC: Learning-based image compression with FPGA-based fine-grained pipelines

Recently, Learning Image Compression (LIC) has shown remarkable capabilities both in terms of compression ratio and reconstructed image quality. By employing a variational autoencoder framework, LIC can surpass the internal prediction of VVC, the latest traditional coding standard. To speed up encoding, most LIC frameworks use floating point operations on the GPU. However, if the encoding and decoding are performed on different platforms, the mismatch of floating-point operation results on different hardware platforms will cause decoding errors. Therefore, a LIC using fixed-point arithmetic is highly desirable.

This paper presents the FPGA design of 8-bit fixed-point quantized LIC. Different from existing FPGA accelerators, this paper proposes a fine-grained pipeline structure to achieve higher DSP efficiency. In addition, cascaded DSP and zero-skip unwrapping functions have been developed to improve hardware performance.

CVPR 2023 | B-spline texture coefficient estimation in screen image super-resolution

With the rapid development of multimedia applications, Screen Content Image (SCI) has frequently appeared in people's daily life. However, resolution mismatch often occurs between the display device and SCI, and SCI has the characteristics of thin and sharp edges, which are very different from natural images. However, most super-resolution methods are applied to natural images. Therefore, this paper proposes a super-resolution method for SCI. In this paper, we propose a B-spline texture coefficient estimator (BTC) to continuously represent SCI using INR to predict the coefficients, knots, and dilation parameters of B-spline curves from low-resolution (LR) images. Then, the coordinates of the query points are projected into the space represented by 2D B-splines and fed to the MLP. Utilizing the positive constraint and tight support of B-spline basis functions, the distortion caused by undershoot/overshoot is reduced at the discontinuity of SCI.

378f2adf81e594c169d7b65b7af67298.jpeg

Zoom officially supports AV1!

In the Zoom update on July 28, local time, the enhancements section in the official Release notes shows, "In order to provide higher quality video without increasing bandwidth usage, Zoom is introducing a new video codec to free Account Use". Now Zoom on Windows, macOS, Linux, Android, and iOS all support AV1, the "next-generation encoder".

https://support.zoom.us/hc/en-us/articles/17763841523213-Release-notes-for-July-24-2023   

BILIVVC encoder made its debut in the MSU International Video Encoder Contest and won many good results

BILIVVC won the third place under the YUV-SSIM index of 1fps and 5fps gears. The performance of the BILIVVC encoder ranks among the best among many participating encoders.

Based on the H266 core, the BILIVVC encoder implements most of the coding tools supported by the VVC standard, and at the same time optimizes these coding tools a lot. Compared with the implementation of the reference code, the performance of each tool on BILIVVC Performance is more efficient.

Codec revolution based on AI and NPU - collaborative innovation of VPU and NPU

In this rapidly changing digital media era, Codec technology plays a vital role in video and audio processing. The rise of AI has brought unprecedented opportunities and challenges to Codec. At the same time, the development and collaborative innovation of VPU and NPU has enabled Codec to better adapt to complex scenarios and needs, and achieve a higher level of image and sound processing capabilities.

LiveVideoStackCon2022 Beijing Station invited Mr. Kong Dehui, Director of Multimedia Technology of Center Microelectronics, to discuss the impact of AI and NPU on Codec from multiple perspectives, including algorithm optimization, performance improvement, and energy efficiency improvement. Gain an in-depth understanding of the key factors and potential opportunities of AI and NPU-based Codec revolution, and further promote innovation and development in the field of digital media.

2173cf23c9a82e2561c6efccf7bd27cb.png

Streaming Media East 2023 | About VVC

VVC (Versatile Video Codec) is a hybrid video coding based on HEVC. By improving the existing technology and adding a series of tools that are not available in HEVC and past codecs, its performance is improved subjectively and objectively compared with HEVC. 30% and 40+%. VVC is aimed at a series of scenes such as 8k, 360°, HDR, etc., which is why it is named as a multi-functional video codec.

Application of VVC in cloud and browser playback

Versatile Video Coding (VVC) is the latest international video coding standard jointly developed by ITU-T and ISO/IEC. Although VVC has a broad feature set and can be applied in a variety of fields, compared to its predecessor High Efficiency Video Coding (HEVC), VVC can significantly reduce the bit rate by about 50% while maintaining the same subjective video quality. After the standardization work was completed in July 2020, many activities have started in order to integrate VVC into practical applications.

This paper shows how to implement a practical workflow using VVC in a streaming application. We show how the Fraunhofer VVenC VVC encoder can be applied to Bitmovin's cloud-based encoding solution. It also details how VVC affects practical decisions, such as choosing the best bitrate ladder, and compares cost and performance with other encoders. Finally, it demonstrates how the Fraunhofer VVdeC decoder is combined with WebAssembly to realize the possibility of playing VVC video in the browser in real time.

https://dl.acm.org/doi/10.1145/3510450.3517305

cad60b85457d8d8a9e98934503e47f88.png

Apple's new spatial audio patent | Provides spatial audio navigation system for wearable device users

The U.S. Patent and Trademark Office has officially granted Apple a patent related to spatial audio navigation, which will be used on future AirPods, smart glasses, and the more lightweight Vision Pro. The system plays directional audio through binaural audio devices to provide users with navigational cues to help them find their way through malls, other venues or city parks. The system can also provide audio navigation to the driver of the vehicle.

Interspeech2023 | A Phoneme-to-Word Transcoder Based on Joint Speech Representation Learning for Cross-Language Speech Recognition

The goal of cross-lingual speech recognition (Cross-lingual Speech Recognition) is to use the pronunciation information of high-resource languages, apply it to low-resource languages, and improve the performance of low-resource language speech recognition. There are more than 7,000 languages ​​in the world, most of which have insufficient annotation data. To meet the challenge of low-resource speech recognition, cross-language speech recognition becomes an effective solution. Recent studies have shown that based on unsupervised pre-training technology, a general-purpose speech representation model can be constructed by large-scale training on labeled and unlabeled data in available languages, and transferred to the target low-resource language through fine-tuning, achieving achieved remarkable results.

Academic Newsletter| CN-Celeb-AV: Release of multi-scene audio-visual multi-modal data set

Recently, the Speech and Language Technology Team of Tsinghua University and Beijing University of Posts and Telecommunications released the Chinese Celebrity Multi-Scene Audio and Video Multimodal Dataset (CN-Celeb-AV) for researchers in the fields of audio and video multimodal identity recognition (AVPR) use. This dataset contains more than 419,000 video clips from 1,136 Chinese celebrities, covering 11 different scenarios, and provides two sets of standard evaluation sets, complete and incomplete.

Application and Challenges of Call Noise Reduction Algorithms on Mobile Phones and IOT Devices

With the upgrading of electronic products, users have higher and higher requirements for call quality. The call noise reduction algorithm plays a key role in the call quality. The improvement of computing resources has enabled deep learning models to run on portable low-power chips, and the reduction in device costs has allowed IoT devices to start using bone conduction sensors. How can deep learning and traditional algorithms be combined? How to make full use of the bone conduction sensor? How to translate the results of objective testing into real user experience? This is also a new challenge for call algorithms in the new era. LiveVideoStackCon 2022 Beijing Station invited teacher Wang Linzhang to share with us the application and challenges of call noise reduction algorithms on mobile phones and IOT devices. 

3fb3d19981d6fc704ed14b0f1ab58801.png

The 15th XR video mode—3.5D rectangular video mode

This year (2023), with the release of Apple Vision Pro, video perspective (VST) has a benchmark product. According to the value of VST itself, I predict that there will be 3 new fused reality video modes on the market in the future. 3.5D rectangular video mode, see-through 3D panoramic video mode, BR/MR virtual and real splicing mode. 

Summary of Apple Vision Pro Chinese Development Tutorials (Phase 3)

This article contains 7 video tutorials including Quick Look for Exploring Spatial Computing, Taking SwiftUI to the Next Dimension, and Safari for Spatial Computing.

Microsoft AR/VR Patent Shares Micro-LED Display Devices with Improved Display Substrate and Backplane Substrate

Due to its advantages in resolution, size, efficiency, and burn-in resistance, Micro-LED is becoming an important area of ​​focus for AR/VR headset manufacturers. In fact, Microsoft is also concerned and has applied for a patent called "Micro-led display".

fae7ec0e822e6b1bb26113875665f35c.png

Research on the semiconductor process control equipment industry: the localization rate is less than 5%, and the replacement space is large

Semiconductor process control equipment mainly includes "front-end inspection for wafer manufacturing" and "middle-end inspection for advanced packaging". Traditional integrated circuit technology is mainly divided into front-end and back-end. With the continuous development and progress of the integrated circuit industry, back-end packaging technology has developed to wafer-level packaging, thereby deriving advanced packaging technology.

815dc40afa4c34d9a8902917c8b4690e.png

The latest interview with the chief scientist of OpenAI: Two suggestions for model entrepreneurship, security and alignment, and Transformer is good enough?

OpenAI Chief Scientist Ilya Sutskever recently had a short conversation with his friend Sven Strohband. The interview mainly mentioned the following issues: belief in deep learning, imagination of AGI, whether Transformer is good enough, shocking emergent ability, security and alignment, and two suggestions for model entrepreneurs.

Live broadcast + X - a new trend in the live broadcast industry

Human beings are constantly pursuing feelings and experience, trending the rapid development of audio and video technology, and audio and video services are strongly demanded by various industries with an unprecedented trend. Today, live broadcasting is already a term familiar to everyone. The live broadcasting business and ecology, as well as key supporting technologies, are continuously evolving and iterating, and are full of vitality. This LiveVideoStackCon 2023 Shanghai Station invited Huawei Cloud Lu Zhenyu to share with you how to make "old trees grow new shoots" in the live broadcast industry.

Dialogue Cloud Cong Jiang Xun: Large-scale models are not a competition between companies or countries, but may be the key to a community with a shared future for mankind

Today, the positioning of the human-machine collaborative operating system has continued to the era of large models. Jiang Xun said that although there is no clear reference to the "top strategy" within the company, the importance of this matter has been very high. We don't use the word "most". In terms of priority, it is indeed a very important strategy. high priority.

In contrast to the development of domestic large-scale models, most companies are still in the stage of chasing hot spots and have not made substantial progress. Is Yuncong also chasing hot spots? Jiang Xun gave a negative answer. He said that we are still studying our human-machine collaborative operating system. Based on GPT technology, the IQ of the human-machine collaborative operating system will become higher and higher. It will not only increase its upper limit, but also reduce costs. To a large extent, the system can better serve customers, and the competitive advantage will also increase.

b476db6075643bc745a6223bbdd15181.png

LiveVideoStackCon 2023 Shenzhen has started

The theme of LiveVideoStackCon 2023 Shenzhen Station Audio and Video Technology Conference is "Immersion · New Vision". After nearly ten years of rapid development, the multimedia ecology is developing towards refinement and optimization, paying more attention to details and costs, and involution and sea-going have become pressure outlets. On the one hand, in an environment where the existing market and business competition are still quite fierce, enterprises have begun to pay more attention to how to reduce costs, pursue higher profits, and provide users with better services and experiences; on the other hand, for continuous More and more new technologies and scenarios emerging, gradually exploring and using them to create more business, products and commercial value are the goals that enterprises continue to pay attention to.

This time in Shenzhen, we plan to invite dozens of experts in the field of audio and video from home and abroad to gather together to share their professional insights with you.

29c35f5d7e88296b05f8326f75df9eb9.png

f96f05e1c57347cf78a7a20524fc80b0.jpeg

Click " Read the original text " 

Jump to the official website of LiveVideoStackCon 2023 Shenzhen Station for more information

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/132137865