Audio and Video Technology Development Weekly | 298

Once a week, an overview of the dry goods in the field of audio and video technology.

News contribution: [email protected].

cdc1f1583af676c7adebf1005415ae61.png

AI art in Beijing 798, looking forward to the future of artificial intelligence and the environment

This article puts forward a very interesting hypothesis, through the collaboration and practice of artificial intelligence and artists to generate a narrative around the earth, to open up the imagination of being in the AI ​​era and ecology. In this collaboration, how can we reimagine the environment in which we live together? How can we gain a new understanding of our living environment and even ourselves through this kind of collaboration? How to explore the common basis of existence of the two, fungi, geology, the atmosphere, the sky, the ocean... We think at the fuzzy boundary of the continuous evolution of this common life, and constantly raise new perspectives and questions, This is exactly the new aesthetic form and imaginative space created between Gaia and cyborg.

AlphaDev broke through the ten-year algorithm bottleneck and boarded Nature, followed by GPT-4 in two steps 

This Jingwei Venture Capital article pointed out that recently, AlphaDev, an artificial intelligence project of Google's DeepMind team, has developed a new data sorting method, which can increase the speed of the sorting algorithm by about 70% by itself. At the same time, for the hash ( Hash) algorithm, also found a way to increase the speed by 30%. For the first time in over a decade, the C++ sorting library has changed. This latest study also boarded Nature.

Enthusiastic netizens who can't sit still, try to prove their strength by guiding ChatGPT. In just one day, under the guidance of dialogue, GPT-4 was able to obtain almost the same idea as AlphaDev in two steps. Netizens can't help but sigh: Everyone still underestimates GPT-4.

‍The world model of LeCun is here! Meta shocked the release of the first "humanoid" model, which completed half a picture after understanding the world, and self-supervised learning was expected by everyone‍

Meta shock released a "human-like" artificial intelligence model I-JEPA, which can analyze and complete missing images more accurately than existing models.

Even today's most advanced AI systems have been unable to break through some key limitations. In order to break through this layer of shackles, Meta's chief AI scientist Yann LeCun proposed a new architecture.

His vision is to create a machine that can learn an internal model of how the world works, so it can learn more quickly, plan for complex tasks, and respond to new and unfamiliar situations at any time.

The image joint embedded prediction framework I-JEPA model launched by Meta today is the first AI model in history based on a key part of LeCun's world model vision.

I-JEPA learns by creating an internal model of the external world. In the process of completing images, it compares abstract representations of the images, rather than comparing the pixels themselves. I-JEPA has shown strong performance on multiple computer vision tasks and is much more computationally efficient than other widely used CV models.

ChatGPT heavy update! The price hit a "fracture", new API function calls, and the context soared by 4 times

OpenAI has released a major update to the GPT series, the core of which is the new function calling (Function calling) capability of the API.

In this update, OpenAI focuses on function calls: developers do not need to manually select functions, they only need the model to describe the functions that need to be used. The mechanism of the plug-in is the same.

These models have been fine-tuned to detect when a function needs to be called, and can also generate a JSON response that conforms to the function's signature. In other words, function calls enable developers to more reliably retrieve structured data from models.

The sixth anniversary of Transformer: Even the NeurIPS Oral was not obtained, and 8 authors have founded several AI unicorns

In this article, some people joined OpenAI, some founded startups, and some stuck to Google AI. It was they who jointly started today's era of AI development.

Nvidia RTX 4060 graphics card launching this month

According to Nvidia's official website, the RTX 4060 will be launched on June 29. The RTX 4060 was announced last month alongside the RTX 4060 Ti, which was originally scheduled to launch in July. The national bank price of this graphics card starts at 2399 yuan.

The performance of RTX 4060 graphics card is 1.7 times that of RTX 3060 after turning on the unique frame generation technology of RTX 40 series graphics cards, and 1.2 times that of RTX 3060 after turning off frame generation technology.

Bilibili large-scale AI model reasoning practice

This article introduces that Bilibili has significantly improved the efficiency of computing resources, reduced resource costs, ensured service response time and stability, and reduced the cost of AI service development and deployment through self-developed InferX reasoning framework + Triton model service deployment. Support various types of business landing.

806897d61bcfdf94df08e5d6a59f5851.png

Text directly generates music, Meta's new open source model MusicGen explodes!

On June 13, Meta (the parent company of Facebook, Instagram, etc.) announced the open source of a new language model, MusicGen, which allows users to directly generate music through text.

In addition to using text to generate music, MusicGen also supports users to upload sample music to enhance the accuracy of music generation. For example, an upbeat electronic dance track with syncopated drums, airy pads, and intense musical crests. Then upload a similar song "I Can't Stop" and click Generate.

After experiencing MusicGen, it is easy to use and powerful, but consumes very little resources. The generated music basically conforms to the text prompt, with transparent sound quality, stable audio, and strong sound peak jitter. In short, the treble is sweet, the mid-range is accurate, and the bass is stable. It is suitable for making background music such as rock, dance, classical, pop, and nostalgic.

ChatGPT as a bond analyst! Fintech giant releases BondGPT, serving a $10 trillion market!

Recently, LTX, a subsidiary of Broadridge (NYSE: BR ), the global financial technology leader, announced that it has created BondGPT through GPT-4, which is mainly used in the bond market to help customers answer various bond-related questions, enhancing the bond market by US$10.3 trillion. Flows and Price Discovery in the U.S. Corporate Bond Market. Currently, BondGPT has been put into use.

It is reported that in order to enhance the output accuracy of ChatGPT and meet the needs of financial business scenarios, LTX will input the real-time bond data in Liquidity Cloud into the GPT-4 large language model to help financial institutions, hedge funds, etc. simplify the bond investment process and provide investment Combination suggestions.

For example, what are some car bonds with yields between 5% and 8% that mature after 2030? ; Which telecom bonds have yielded the most over the past 30-day period? ; In the past 5 years, which retail companies have the highest bond yields? ; I have US$1 million and want to invest for 5 years, what are the options for high-yield bonds?

ef61472ee569259e83d852582c3d7ef5.png

153 billion transistor chips released, AMD officially challenged Nvidia

At AMD's press conference, the most concerned is undoubtedly the company's Instinct MI 300 series. Because in the AI ​​​​era dominated by NVIDIA GPUs, everyone hopes that this series of chips from AMD will become the strongest competitor of the trillion-chip giant. From the parameters provided by Lisa, the new chips of the MI 300 series are extremely competitive.

"Artificial intelligence is a decisive technology shaping next-generation computing and AMD's largest strategic growth opportunity." Lisa Su emphasized.

 bfbaa92299cc2901dbad137ca0b1ce9a.jpeg

Create high-quality computer vision applications using Superb AI's suite and the NVIDIA TAO toolkit

This post demonstrates how to use Superb AI Suite to prepare high-quality computer vision datasets compatible with TAO Toolkit. It introduces the process of downloading datasets, creating new projects on Suite, uploading data to projects through Suite SDK, using Superb AI's Auto-Label capability to quickly label datasets, exporting labeled datasets, and building TAO Toolkit to configure and use data. 

https://developer.nvidia.com/blog/create-high-quality-computer-vision-applications-with-superb-ai-suite-and-nvidia-tao-toolkit/

1bf5918652647ae7f84b7791c541ee68.png

Text2NeRF: Text-Driven Neural Radiation Field-Based 3D Scene Generation

This paper presents the Text2NeRF model, a text-driven 3D scene generation framework obtained by combining NeRF's pre-trained Vinsen graph diffusion model. Specifically, the main contributions of this paper are:

A text-driven framework for photorealistic 3D scene generation is proposed, which combines diffusion models with NeRF representations to support zero-shot generation of various indoor/outdoor scenes from various natural language cues;

Introduce the PIU strategy to gradually generate new content with view consistency for the 3D scene, and build a support set to provide multi-view constraints for the NeRF model during the view-by-view update process;

Depth loss is adopted to achieve depth-aware NeRF optimization, and a two-stage depth alignment strategy is introduced to remove the estimated depth bias in different views.

Relationship between images and matrices

As mentioned in this article, digital images are composed of many pixels (Pixels), just like the body is composed of cells. When we adjust visual elements through software such as Photoshop, we are essentially adjusting pixels. Every step we take will eventually affect all pixels or a specific area of ​​pixels. So when adjusting an image, it is not changing the parameters of the entire image, but adjusting the parameters of each pixel.

If you want to explore the mystery behind the image, you will find that the end of the video, image, pixel, resolution, fps and other elements closely related to image formation is linear algebra. That's right, it's the linear algebra you're learning now! You can use this article to understand some concepts related to images.

acb290b3ee73e9df82451d1ca664dce4.png

What is the relationship between video encoding format and encapsulation format? What are the common encoding formats for cameras?

After reading this article, you will know two major questions: 1. What is the relationship between video decoding format and packaging format? 2. What are the common encoding formats in the camera field?

Visual Captioning: Enhancing Videoconferencing with Dynamic Visuals Using Large Language Models

Recent advances in video conferencing have dramatically improved remote video communications with features such as live captioning and noise cancellation. However, in various situations, dynamic visual augmentation helps to better convey complex and nuanced information. For example, when discussing what to order at a Japanese restaurant, your friends can share visuals that help you order sukiyaki with more confidence. Or when talking about your recent family trip to San Francisco, you might want to show a photo from your personal photo album.

In "Visual Captioning: Verbal Communication Through Immediate Visual Augmentation," presented at ACM CHI 2023, a system for synchronizing video communication with real-time visual augmentation using linguistic cues is introduced. A large language model is fine-tuned to proactively suggest relevant visuals in open-vocabulary conversations using a dataset curated for this purpose. Open-sourced Visual Captions as part of the ARChat project, which aims to quickly prototype enhanced communications through real-time transcription.

https://aigoogleblog.com/2023/06/visual-captions-using-large-language.html

The new Mac Studio and Mac Pro can connect up to eight external 4K displays

In a new support document, Apple introduced the external display situation of the new Mac Studio and Mac Pro: With the M2 Ultra, both Macs can connect up to eight external 60Hz 4K displays.

The new Mac Studio has 1 HDMI 2.1 port, and the new Mac Pro has 2 HDMI 2.1 ports. Users can expand these ports to connect an 8K display at 60Hz or a 4K display at 240Hz; the M2 Ultra chip supports up to 6 external Pro Display XDRs.

Mac Studio with the M1 Ultra chip can connect up to 5 external displays.

c6b684a893c811994da02d73a47345da.png

Lyra, a machine learning-based speech codec

Lyra is a machine learning-based speech codec that improves performance by introducing prediction variance regularization to reduce sensitivity to outliers. Lyra uses the autoregressive model WaveNet for process modeling and significantly improves performance through input noise suppression. Experiments with Lyra have shown that its quality is similar to or better than traditional codecs running at double rate, and is suitable for low-rate video calls and consumer devices.

Real-Time Audio at Meta Scale: REAL-TIME AUDIO AT META SCALE

This article presents methods for tackling the toughest audio challenges at the meta-scale, and digs into the reliability of audio to make sure it actually works. Finally, a look to the future and one of the most exciting areas in RTC, large group calls in the Metaverse.

Before starting a big immersive call, first make sure you've got the fundamentals right. Excessive delays in the call reduce interactivity, causing participants to repeatedly confirm the content of the call, which is not a natural conversation. Many calls are made over low-bandwidth connections, and even the best WiFi networks can get congested, so robust packet loss (a feature or algorithm in data communications that is robust or robust to packet loss) is also important A factor. To avoid background noise and own voice reverberation, full-duplex, high-quality acoustic eco-cancellation and non-stationary noise suppression are required. The provision of full-band stereo audio brings users one step closer to the goal of achieving a high-quality experience. The next step in this goal is to enable immersive audio experiences, such as special audio, which is the key to creating the magic of immersion.

Audio Format--MP3 Format Introduction

This article introduces a variety of audio and video files and encoding formats, including but not limited to MP4, AVI, MKV, H.264, AAC, MP3, etc. Through an in-depth understanding of these common audio and video files and encoding formats, users can better understand their application in video transmission and storage, so as to better deal with actual application scenarios and problems. At the same time, this knowledge can also help users better understand and master the basics of audio and video development, and improve the quality of audio and video for users.

Application of deep learning in sound source localization

This paper points out that, in general, SSL is reduced to the estimation of the direction of arrival (DoA) of the source, i.e. it focuses on the estimation of the azimuth and elevation angles, not the distance to the microphone array. SSL has many practical applications such as, for example, sound source separation, automatic speech recognition (ASR), speech enhancement, and room acoustic analysis.

2469a9da236f60c4be616e6e14e4093c.png

WebRTC support has been merged into OBS (discuss-webrtc)

https://groups.google.com/g/discuss-webrtc/c/tNPuUiT2bTs/m/bLth7DlsAAAJ

6f68d28f16d08456060c598a98e331b9.png

Parallel Cloud—Opening the Passage to the Metaverse

This article believes that the metaverse is a virtual world parallel to the real world, and it is a new generation of Internet. XR with true three-dimensional, interactive, and immersive features is the ultimate digital media form for building the metaverse. How to break the tight coupling between XR terminal devices and XR content, and realize online access on any platform and any terminal, Cloud XR is the only way to open the channel to the Metaverse. Parallel Cloud is an internationally leading Cloud XR concept advocate and technology pioneer, committed to providing industry partners and developers with low-code, out-of-the-box, and efficient deployment of Cloud XR PaaS platform products. It has attracted nearly a thousand enterprise users and tens of thousands of independent developers around the world. The products and solutions are maturely used in education and training, digital twins, medical rehabilitation, virtual live broadcast, digital human, cloud activities, cloud games and other scenarios.

How the IoT is changing the game for the Sustainability Metaverse

One perspective in this article is that virtual universes, arguably the most profound fruits of digital transformation available today, rely on data in a very fundamental way. So, of all the technological advancements of the past 30 years, it was the data-centric technologies that enabled the Metaverse. Despite the important breadth and depth implied here, it turns out that the Internet of Things (IoT) is not only the most disruptive advancement, but the most critical one to enable the Metaverse.

https://techcommunity.microsoft.com/t5/green-tech-blog/how-iot-is-a-game-changer-for-the-sustainability-metaverse/ba-p/3291430

7dec7d1ac506dba239a9708de2ece00d.png

AI Chip Industry Special Report: Entrepreneurial Fission of Domestic AI Chips

The report believes that releasing the value of computing power will play a role in promoting the overall economic development of the country. For every 1 point increase in the computing power index, the digital economy and GDP will grow by 3.5‰ and 1.8‰ respectively. It can be seen that the higher the national computing power index, the stronger the economic stimulus. In the industry, the application of artificial intelligence has generated many demands, and the most direct track is the digital transformation of enterprises. According to IDC, worldwide, technology investment by enterprises in the artificial intelligence (AI) market has grown from US$61.24 billion in 2019 to US$92.40 billion in 2021, and is expected to increase by 26.6% to US$117.00 billion in 2022 (year-over-year) , and is expected to exceed $200 billion by 2025, a growth rate higher than the overall growth rate of enterprise digital transformation (DX) spending.

2043d38adc90122d27ead1e738d605fe.png

The State of Media Technology Funding in 2023

The author's point of view is that the formula for corporate funding and investing often goes like this: Identify a problem and solve it with a profitable solution, a great team, and great growth potential. But in the media tech space, most companies are fighting something else: moving customers to a SaaS model and winning over media companies that want to build everything in-house. Given these inherent challenges, where does investing look like a good idea for companies in the industry?

While the billions of dollars poured into content development over the past few years have gotten attention, behind the scenes a range of video technology vendors and engineers are building the infrastructure to deliver that content. This infrastructure is used not only by old-school media companies, but also by disruptors like Netflix and next-gen media companies.

According to S&P Global Market Intelligence, venture capital (VC) financing in technology, media and telecom increased from 41% of total funding round value in 2019 to 45% in 2022. The entertainment industry is considered more resilient than other industries during economic downturns. So, what does it take to get funded? Business fundamentals, growth rate, technology and sound business model.

https://www.sreamingmedia.com/Articles/Editorial/Featured-Articles/The-State-of-Media-Technology-Financing-2023-158121.aspx

9fe323e46f42573d4100570810745ebc.png

2023LiveVideoStackCon Shanghai Station has entered the full price period

eb98d2ded962261110421eda4cc6a63f.png

Scan the QR code in the picture or click " Read the original text " 

Check out more exciting topics of LiveVideoStackCon 2023 Shanghai Station

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/131278146