Dialogue with Dingding audio and video expert Feng Jinwei: Big models are not everything

1e5a07e6e6ae2b33a54ba3d44212e232.jpeg

Curated by: LiveVideoStack

In the field of audio and video technology, the ICASSP conference is a weathervane conference for the industry, and it is also a feast for practitioners in the speech field to study the development of next-generation technologies. Recently, major companies at home and abroad have successively released news of papers entering the conference, and two papers from the Dingding Hummingbird Audio Lab have also boarded ICASSP2023.

Among them, the laboratory proposed a new research on "using an AI model to simultaneously eliminate the three kinds of interference sounds of echo, noise and reverberation", which can save calculation and bandwidth, lower delay, and better sound quality. We are very interested in what problems this research can solve, what effects it can achieve, and what scenarios it can be used in.

In addition, with the advent of AIGC technology, every industry is facing a revolution in human-computer interaction experience. Another focus of our attention is the impact of the development of large models on the field of audio and video. Based on these two factors, we have recently communicated with Dr. Feng Jinwei, the head of the Dingding Hummingbird Laboratory, an expert in the domestic audio and video field.

During the conversation with Feng Jinwei, we learned that their team not only used the self-developed AI model in noise reduction, echo cancellation and reverberation, but has now landed in Dingding conference software, Rooms and Dingding conference all-in-one machine On F1/F2, and in the process of technology openness, they also created an original microphone array technology to solve the problem of inaudible hearing when sitting too far away in offline meetings, which has attracted widespread attention in the industry.

When talking about AIGC and the future technological revolution, Feng Jinwei said that the impact of AIGC on the industry is not so great. First, the large model temporarily lacks an effective solution for real-time audio and video; second, it is different from AIGC inference and generation. At the level of acoustics and underlying algorithms, we pay more attention to using AI to "restore reality", and restore the collected audio and video information as much as possible to the sense of presence in an offline meeting, such as solving the three major problems in audio processing-echo cancellation and noise reduction and de-reverb.

In Feng Jinwei's view, the positioning of the Hummingbird Lab is to be able to combine application scenarios in the large-scale model track, such as the intelligent summary of the meeting, and at the same time, it is a team that is good at engineering and can develop acoustic principles. , Signal processing to the technology of the whole process of software and hardware products. These techniques are related to AI, but not big models. "The development and maturity of AI technology represented by deep learning will provide a new direction for the breakthrough of key audio and video technologies. For those problems that cannot be solved by traditional technologies, the difficulty of problem solving can be reduced through the integration of AI, such as AI reduction. Noisy." Feng Jinwei said, this is also the exploration direction of the DingTalk meeting in the underlying technology.

This article is compiled from the dialogue between LiveVideoStack and Feng Jinwei, and the following has been edited and deleted.

1. What new surprises will the arrival of AI bring to the audio and video industry? Parallel Clouds and the Metaverse

  1.  LiveVideoStack: What do you think of emerging concepts and technologies such as ChatGPT, large models, and AIGC that are currently booming?

Feng Jinwei: First of all, we must affirm the value brought by AIGC. Unlike the gust of wind in the Metaverse, AIGC can bring real value to many industries, such as helping copywriters improve their creative efficiency. Now AI has a preliminary general artificial intelligence AGI, which is a qualitative difference from before.

Although human beings currently only have preliminary general artificial intelligence, the development of technology is not linear, but leaping. People often talk about strange years and singular points. Einstein's 1905 was a technologically singular year. From God's perspective, we may be in the middle of a technologically singular year in the past two years.

Back to the audio and video industry, I think AIGC has not had such a big impact on it at present.

First, there must be an impact. DingTalk is also exploring application scenarios, such as meeting summaries. After the large model is embedded in audio and video, the most direct change is the extraction and summary of meeting content.

Second, the large model temporarily lacks an effective solution for the real-time performance of audio and video. The acceptable delay for audio and video applications is tens of milliseconds or at most 200 milliseconds, which is very demanding. The audio algorithm processing is frame by frame (for example, the frame length is 10 milliseconds), that is, streaming processing, which requires that each frame of data must be processed within 10 milliseconds, and the processed data is sent to the lower end of the audio and video system. After one module, continue to process the data of the next 10 milliseconds, and start again and again, but the current AIGC obviously does not have such streaming processing capabilities to support it.

Third, at the level of acoustics and underlying algorithms, the big model is currently good at retrieval, reasoning, and generation, but in audio and video applications, it pays more attention to using AI to "restore reality" to achieve online meetings and offline meetings The same "presence", for example, we use the AI ​​model to solve the three major problems of audio processing - echo cancellation, noise reduction and reverberation, that is, the intelligent 3A algorithm.

We will continue to pay attention, maybe at some point, the technological singularity will bring unexpected applications.

  1. LiveVideoStack: You just talked about the 3A algorithm. This time, one of the selected top conference papers is also on this topic. Based on your research findings, what is the biggest difference between traditional algorithms and AI algorithms? Does this technology make it into your products?

Feng Jinwei: We have a paper this time on "Deep Narrowband Network for Joint Elimination of Echo, Noise and Reverberation in Real-time Full-Band Voice Communication". What we do behind it is a "one model, multi-task" research . This research is used to verify that AI has the ability to handle these three kinds of interfering sounds in one model at the same time, which is also in line with the consensus in the field of AI that multi-task learning can learn general expressions and improve generalization ability.

Most of the previous technologies deal with the three kinds of interference sounds, namely echo, noise and reverberation separately. Each individual module can easily lead to a decrease in robustness while accumulating the amount of calculation and algorithm delay, and it is impossible to perform global monitoring on the audio link. optimization.

Our experiments show that in three public test sets, compared with state-of-the-art models dedicated to subtasks, our model improves the performance by 57% in the remote single-speaking scenario and double-speaking scenario. After denoising and reverberation, the voice quality has increased by 5% and 8%, and some research results have also been applied to our products.

I think the difference between traditional algorithms and AI algorithms lies in the different paths of data modeling. One is relatively simple modeling based on mathematical analytical expressions, such as Gaussian distribution. For more suitable scenarios such as steady-state noise signals, traditional algorithms The processing effect is still acceptable; the other is data driven modeling. The powerful modeling capabilities of deep learning enable AI algorithms to handle tasks in more complex scenarios, especially when the training data is abundant enough. This is also the current AI algorithm. There is a reason for a qualitative improvement in the effect, such as the removal of non-stationary noise, and the echo cancellation under delay jitter. The traditional method generally has a relatively small amount of calculation and good explainability. I think these two approaches are also complementary.

We are currently doing research on the expansion of this technology, such as eliminating the background noise caused by colleagues talking in the workplace scene, which is also one of the pain points of the current conference software, and putting all the algorithms into the same framework, To save computation, reduce latency, etc.

  1. LiveVideoStack: So what do you think of the relationship between the two AIs, and your next investment plan?

Feng Jinwei: I think these two kinds of AI are not mutually exclusive. One is the intelligent exploration at the application layer, and the other is the support for the underlying technology in professional scenarios.

At present, I think AI is a relatively important application in audio and video, and it is also an application that we have implemented this year, which is the conference summary or summary. Our Dingding slash "/" invitation test at the end of May also includes this ability. It can help you generate verbatim shorthand, and can automatically generate summaries and to-dos by chapters and topics. In this way, for a long meeting of two or three hours, you can read the smart minutes in 3 minutes. 

The scene capabilities of these AIs, such as transcription, summarization, and audio 3A technologies (de-reverb, noise, echo) are not mutually exclusive. The 3A technology is supported by a strong underlying technology for the scenarios mentioned above, and the content recognition will be more accurate after the sound is clearer. So even if AIGC comes, these underlying technologies still need to be continuously optimized and tackled, and resources still need to be continuously invested.

In addition, AI technology has many potential applications in audio, such as no-reference sound quality assessment, Personalized SE, NetEQ, LPC and audio super-resolution technology, etc. AI technology can also solve some problems that cannot be solved by traditional methods, for example, the echo generated when the network delay fluctuates or when the equipment moves. We hope that through this series of work, we can truly help users achieve barrier-free communication of information, which is also the most valuable part of technology in our opinion.

2. Uniqueness of technical route and open strategy

  1. LiveVideoStack: Your technical route sounds different from other companies. What do you think about the topic of technology in commercial companies, and have you had any examples in the past two years?

Feng Jinwei: The work of the Hummingbird Audio Lab is product-oriented. The research and development of new technologies should aim at landing products, and the establishment of new technology projects is also aimed at solving user problems.

As a commercial company and a technology company, products must be competitive before they have the basis for commercialization, so most of our working time is focused on products. Another part of our work is to be able to polish technology in depth and develop technologies that are "half a step ahead of the market" instead of just doing basic theoretical research. Theoretical research is something that university laboratories or government research do. positioning.

There has been a lot of practical progress since the lab was founded. For example, we have introduced artificial intelligence technology into the entire audio chain, so that the AI ​​model is not only used in noise reduction, echo removal and reverberation, but also how to use it for packet loss compensation, audio super resolution and codec .

At present, DingTalk's self-developed AI noise reduction algorithm has been implemented, and will be applied to various product forms of DingTalk conferences one after another. This is also the first conference platform in the industry to implement full-band voice AI noise reduction. So far, only DingTalk and Google Meet, domestic and foreign conferencing software, use full-band voice communication, but Google Meet has not yet done AI noise reduction.

The technical characteristics of DingTalk's self-developed noise reduction algorithm are large amount of noise reduction and small amount of calculation, but high voice fidelity. Damage to speech while reducing noise is a problem that exists in almost all AI noise reduction technologies currently on the market, so we tried to protect the speech components as much as possible when designing the algorithm.

In addition, we have developed an innovative conference equipment transfer technology to improve the convenience of meeting in the conference room. Imagine that you use your mobile phone to join the meeting at the beginning. When you arrive at the meeting room, you don’t need to enter the lengthy meeting code on the meeting room equipment to join the meeting. You only need to click the button on your mobile phone, and the meeting will be automatically transferred to Conference room equipment is up.

Third, in the process of polishing the product experience, our team also proposed an original microphone array technology. After publishing a series of papers in INTERSPEECH and ICASSP, many papers followed our research. At present, this technology has also been implemented in our hardware product conference all-in-one machine F2, and it is also open to ecological partners. After a rigorous test, an international well-known brand decided to cooperate with us, and their products will be on the market soon.

  1. LiveVideoStack: In the process of using technology to support products, such as the experience of implementing AI noise reduction to products this time, do you have any painful memories?

Feng Jinwei: Yes, in fact, there are both successful experiences and painful memories.

Successful experience, for example, our all-in-one video conferencing machine F1 has been fully launched to the market within 6 months from 0, and the current market share is about 1/3, which is very successful. This is technology and products, and The result of seamless cooperation of business teams.

We all know that research does not guarantee certain results, and it itself has great uncertainty. This time, the AI ​​noise reduction product actually has some twists and turns. The effect in the early stage is not obvious, and everyone suspects whether the direction is wrong. Later, the team did not give up and continued to improve the data and network framework, and finally got a satisfactory result. We also compared this result with competing products at home and abroad, and the noise reduction effect can enter the first echelon of the industry.

Of course, there are also some regrets. When we do technology, sometimes after a period of research efforts, we have achieved results, but for various reasons, there is no productization in the end. This makes us feel sorry, because we hope that our technology can make More users benefit from it.

  1. LiveVideoStack: After the successful implementation of these technologies, which industry partners have been opened to them, and what are their evaluations? Is there anything that stands out in your memory?

Feng Jinwei: Open to many ecological partners, such as Logitech, Intel, Lenovo are using our algorithms and modules.

Especially in August last year, we opened a complete set of algorithms and engineering solutions to Insta360, and they were very satisfied with our long-distance sound pickup, intelligent noise reduction, and sound source localization technologies. We hope that through the opening of algorithm capabilities and technical modules, more partners in the industry chain can quickly reuse them to realize intelligent upgrades of equipment.

By the way, we also provide a complete set of comprehensive services for our ecological partners, providing on-site services, and passing the industry's advanced certification evaluation to ensure that the performance of our partners' products meets design expectations. This is also a difference in our ecological cooperation. The place.

After learning about our technology and service model, some ecological partners decisively choose to cooperate with DingTalk, and some customers come here because of the reputation and trust us unconditionally in technology. I am deeply impressed by these cases.

  1. LiveVideoStack: The last question, no matter how powerful the technology is, you can’t build a car behind closed doors. As a technician, how do you see the relationship between technology, products and business?

Feng Jinwei: In my opinion, technology is only a necessary condition for business success, but not a sufficient condition. There are many cases in history.

First, the technology of technology companies must be advanced in order to win the market, because a feature of many technology industries is that the winner takes it all, that is, winner takes it all. The chip industry is a good example. There are only one or two heads, and technological development is changing with each passing day. Therefore, our students who do technology research and development often feel a sense of crisis.

Second, there is still a problem of focus and concentration in technology, because technical resources are always limited, and demands can never be fulfilled. From my point of view, the focus on technical product experience is the most important thing, but this kind of focus does not mean that technical people create a car behind closed doors, but it must be combined with the strategy of the entire DingTalk and DingTalk audio and video business department, combined with the real needs of customers. Need pain points, understand which are the main lines and which affect the essence. Things at this level need to be done in depth.

Finally, Feng Jinwei also shared with us the "A Brief History of Semiconductors" he read recently, or was inspired by the history of technology development in the book, and he described to us his imaginary blueprint for the audio and video industry.

1d166436e4219e70ba531c4dba6bc716.png

Scan the QR code in the picture or click " Read the original text " 

Check out more exciting topics of LiveVideoStackCon 2023 Shanghai Station

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/131198659