How to "play" online K songs——Interview with Cheng Le, Audio and Video Architect of Tear

cdcca59929e1478ceac02b60b2c34468.jpeg

Editor's note: The online karaoke business has been developing for ten years, and Cheng Le has also been in the audio and video field for ten years, or even longer. Why choose the field of online karaoke? How to walk through the "long season" and usher in new scenery? How to share some meat in the difficult bone of "online karaoke"? Beneath this series of questions, there is only one simple answer: interest. The following is Cheng Le’s narration——

01Interest  determines everything

Around middle school, I started to have a strong interest in audio and video. From radios, tape players, CDs, VCDs, DVDs, mp3s, to MP4s, tablets, cameras, etc., which became popular in college, they were basically obtained by saving money, and I often had trouble with my parents for this. contradiction.

c9a9cb70b978c11f1717b564e5c64adb.pngCheng Le

When I was in college, I had a lot of time to spend. It was just before the outbreak of smartphones, and MP4 devices were popular. In those years, I basically sold old ones and bought new ones. I have been iterating with the latest devices, from only supporting a single 480p Xvid video format to 720p rmvb went to 1080p H264, and even rolled to 4K at the end of its life (it was still before 2010). At that time, I was also very active on the imp3 forum (now closed), discussing the decoding performance of the new chip solution and the advantages and disadvantages of various encoding formats;

d6cebff8e3fb123f52b24be08ef35a21.jpegIn November 2018, iMP3 officially announced the shutdown

Also participate in the evaluation promotions of various manufacturers, write evaluations, and you can start with half the price of the machine. Under the influence of the forum at that time, I also started to burn headphones. Which headphones can achieve deep bass, mid-range, and high-pitched sweetness , how much better the lossless Ape Flac is than WMA MP3, etc., are all in the scope of dabble.

My second job after graduation was to make TV boxes. Although the final result was not very good, there were very few worries at that time, and my life was happy and pure. At that time, what I considered every day was how to improve the compatibility of local playback, how to package MP4/flv/mkv/ts formats, how to make Blu-ray navigation, how to analyze and render ASS/PGS subtitles, and how to adapt each hardware decoder. , How to improve the stability of network playback and so on.

The state at that time was like a serious game player starting to develop a new game according to his own wishes , which was quite satisfying in terms of work experience.

Later, when mobile live broadcast and short video began to rise, I used the audio and video experience accumulated before to turn it into mobile live broadcast and short video SDK. At that time, many customers should still be using it. Later, I started to do the real-time language chat K song scene of tearing songs, which was also transferred from Party B to Party A.

In short, I have been working all the way after graduation, and I am quite lucky to work in my field of interest. In retrospect, interest is very necessary and effective for crossing the threshold of audio and video , so I hope that students who want to enter this industry can develop an interest in audio and video.

02  In the karaoke track, kill a bloody road

Next, let's talk about business. I joined Framefun in 2019, and then I started to take over the company's audio and video related technologies and the overall experience of K songs. The main focus is on the client side. Most of the server-side audio and video capabilities use third-party services, and there will be fewer things to do by yourself.

So, compared with other major manufacturers, where is our technological advantage?

First of all, we have a business team with strong combat effectiveness, who can quickly try and make mistakes in various new brain hole games, and often new to users. Tear Song has also done a lot of exploration in the gameplay of karaoke. For example, the first two-player pick-up play, and the later multi-player pick-up play, are all working towards a direction that is easier to socialize.

478fededd9d6072c2ea907629fad57aa.png

The various ways of tearing songs basically hit the interests of young people

Anyone who knows about tearing songs can see that tearing songs is based on real-time karaoke gameplay . Karaoke is a social ice-breaking game. Everyone has a common singing hobby, and it is easier to open up topics and accumulate social relationships. After having social relationships, users will be more sticky.

The second is the K song experience part . Compared with other voice chat apps that mainly focus on business, we have a dedicated audio and video team that can fulfill many needs by ourselves, so that the third party will not be useless without us; in addition, we have gradually established A set of subjective and objective evaluation mechanism can push the third party to optimize the focus, and then we can integrate advantages to flexibly select the best supplier.

In the past few years, all RTC manufacturers lacked investment in the pan-entertainment karaoke scene. Our acquisition and rendering solution has advantages in low-latency ear return, ear return compatibility, and vocal accompaniment alignment ; AEC and singing scores have also done their own optimization, but as each company increases resources input, the gap will narrow. Like the AEC effect, the third party has generally made great progress in the past two years. In this case, we will finally make adjustments based on the overall subjective and objective evaluation results.

Step on the pit, grow, step by step

53e86737f09e20bdd2642a4829c158e3.jpeg

The main technical difficulty is the full-stack requirements that small teams will face. Since I have been doing audio and video engineering development myself, a large part of the singing experience is also based on hard-core algorithms, such as sound effects, singing scores, echo cancellation effects , and so on.

For singing and scoring , it was difficult to find a suitable third-party technical service at the beginning of 19. At that time, I asked a part-time job to help me create a set of algorithms, but there have always been relatively big problems with the accuracy of the effect. I spent more than half of it at the time. I gnawed four or five papers in a month, and made a relatively large optimization of the scoring algorithm, and the accuracy has also been significantly improved, which can roughly meet our entertainment singing needs.

Echo cancellation has also been tried here. At that time, the education industry was in full swing, and the focus of RTC manufacturers was basically in the field of conferences and education, and the demand for our entertainment karaoke products was relatively low. The most obvious problem with echo cancellation at that time was that the suppression of the human voice was very severe in the double-speaking scene, and the human voice was seriously muffled and even lost some syllables.

This kind of problem is not particularly serious for the meeting scene, and it is enough to be able to hear what the other party is saying clearly, but in the karaoke scene, this kind of damage to the human voice will lead to a very poor sense of hearing . Try to preserve the details of the vocals as much as possible . At that time, we also tried to extract the AEC algorithm in WebRTC, and then shielded the non-linear processing part when singing, and only performed linear processing, and the residual echo was suppressed by the precise mixing of the accompaniment.

bf92b16c7276b3eca85a575fdbac43bf.jpeg

Under the overall effect of various RTC manufacturers at that time, the experience of this solution in most cases will still be much better. Of course, after education was abolished, various companies began to pay attention to the pan-entertainment market, and the experience in this field has made a qualitative leap. Now we also purchased a third-party AI echo cancellation algorithm.

At present, our audio and video side is relatively independent of the business. Most of the optimization iterations may not be closely related to the business. These parts will be released independently, and then released with the version shuttle on the business side. Some of them here are the optimization of the experience problems mentioned by the product, and some of them are done by our own comprehensive user feedback and statistical information. Of course, there are also some business-related development contents, such as singing and singing, some scenes that require singing and scoring, etc. These parts will evaluate the needs together with the business and incorporate them into the project management progress of the business.

6fecd3c0180e20d81218dec22aaac52d.jpeg

Publicity and technology are also the direction we are striving to improve in the development process. What is more important to retain users is product strength, and technology also serves product strength. We still have a long way to go in this area.

Unavoidable cost reduction and efficiency increase

ae881229f8ae850aba1b22c82afea714.png

When it comes to reducing costs and increasing efficiency, in fact, as a small team, it is mainly to give full play to its own advantages, and to find ways to find partners to make up for non-advantages and directions that cannot be taken into account in terms of cost . For example, in the real-time karaoke scene, the collection and rendering and model adaptation work that has a great impact on the user experience and can be done by ourselves, we have been accumulating since 2019; and RTC transmission optimization, AEC processing, server erection It is to establish a set of laboratory evaluation system and select the best service provider to meet the demand.

In terms of cost, the biggest cost of real-time karaoke is the cost of RTC service. We are currently integrating RTC services of various companies with our own collection and rendering , so as to achieve the lowest switching cost, and multiple RTCs exist online at the same time. The experience is also the same. Under this plan, we will have a better bargaining advantage, and we can also grasp the initiative in bargaining. In addition, the caching mechanism for CDN resources on the end and the on-demand optimization of RTC resources can also reduce part of the cost.

03  Be in the present, seize the future

Singing is human nature, and socializing is also human nature. In particular, the younger generation will have more individual needs for self-expression and social identity, and they will have more sense of identity and energy investment in forming a group in a virtual community. The social vertical category based on K songs will continue to grow. of excavable space.

a5caf1143471a6c01bbe3c7802fe17a7.jpeg

I think that the future development direction of karaoke should also be in the direction of easier precipitation of social relationships. For example, a more accurate matching and recommendation strategy allows users who are temperamental to efficiently recognize and accumulate relationships ; another example is AI-based automatic adaptation and automatic composition, allowing talented non-professional users to efficiently produce their own featured works at a low cost. Show off your talent at a low cost; for the music field, the AI-based accompaniment vocal separation technology is relatively mature, and the current effect can basically reach a practical level.

Another example is AI-based echo cancellation and noise reduction, which can reach heights that traditional algorithms cannot achieve. Like the K song social field where we are, if we can automatically classify and recommend songs and users' singing based on AI, it will be a more significant direction. It is hoped that future technological development can solve the current delay problem of real-time chorus, so that users far away can easily harmonize.

Finally, I would like to share what I brought in this LiveVideoStackCon: I mainly share the pitfalls I have stepped on in the field of audio and video in the past few years, mainly based on Android/iOS mobile technology; I will also share K songs Some special technical points in the scene; I will also talk about how the optimization of audio and video technology becomes an indicator that the bosses can recognize, and the common jamming problems in the voice chat scene.


*Article source:

Douban "Spider-Man: Across the Universe"

853443aa48318dcf7bdcfac5f5c2f151.png

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/131255322