How to do a good in-game real-time voice experience

Welcome to the cloud + community to get more Tencent's massive technical practice dry goods~.

Author: Zhang Xiaoyu, Senior Architect of Tencent Cloud Game Industry

Posted in Cloud+Community by Tencent Game Cloud

In-game voice communication needs

As early as 2015, iiMedia Research has statistics related to mobile game social interaction: nearly 40% of players choose a game because of social factors; 15.6% of players choose to leave a game because of poor sociality. Therefore, how to improve the social attributes of games has become an important part of the game planning of major game manufacturers. From squad battles, team dungeons, to game rankings, PVP team competition, guilds, clans and other gameplay, all are an effective way to enhance the social nature of the game, thereby improving game stickiness, player activity, and retention rate. With the sharp increase in the performance of mobile devices, mobile games have also developed from casual games with simple scenes to serious games such as competitive games and large-scale MMOs that pursue more operation and game experience. Communication between players in this type of game is essential. Fewer functional requirements. However, compared with end games, the typing system in mobile games is more inconvenient, and text messages cannot meet the real-time needs of communication; in addition, due to the high requirements for mobile phone network and performance due to heavy games, unstable networks and limited Computing resources are a problem that every mobile terminal cannot completely solve at present, and out-of-game voice communication software such as mobile QQ cannot satisfy the situation of resource constraints without affecting the experience of players in the game. Therefore, the integration of mobile games Lightweight voice chat capabilities have become an inevitable choice for mobile game manufacturers.

Challenges to Voice Capability in Mobile Games

Real-time voice communication technology is no longer a new topic in the industry. From traditional VOIP call center manufacturers to voice service providers in the field of cloud communication, there are a large number of ready-made SDKs for various APPs to integrate their real-time voice capabilities. However, for the real-time communication requirements in mobile games, it is not only possible to simply implement real-time calls, but more challenges are as follows:

1. Impact on the core gameplay of the game

In recent years, the performance of mobile devices has been greatly enhanced, but compared with players' pursuit of game experience, the current performance of consumer-grade smart devices can only meet the basic needs of high-quality games, and there is not too much surplus. If the real-time voice capability occupies too much CPU and memory of the device, it will lead to a decline in the experience of the game itself. Although social needs are an important direction for players to choose a game, the game experience is the fundamental factor that determines whether a game can survive. Since 2015, MOBA and FPS games that require extremely high network latency have also been moved to the mobile terminal. Faced with the extremely unstable network environment and expensive 4G traffic costs on the mobile terminal, game manufacturers have optimized game packets in order to , Reduce network traffic, reduce network delay, try hard, if the network quality of the game will be affected overnight before liberation due to the addition of voice capabilities, it will obviously be worth the loss. In addition, whether the addition of voice capabilities will greatly increase the size of the first game download package is also the top priority of major game publishers.

2. Sound processing in mobile gaming environment

Compared with client game scenarios, the biggest advantage of mobile games is that we can play games on the move. However, in terms of real-time voice, the convenience of this “movement” also introduces more advantages for clear voice transmission. Problems: The noisy background sound on the subway or on the road affects the normal voice quality; the distance between the mobile phone and the mouth fluctuates, and the sound is loud and quiet; the background sound of the multi-person talking and the overlapping of the background sound of the game cause the energy to be too large and cause the popping sound; A large number of echoes caused by the player's mobile phone sound. These are complex sound processing issues that do not occur frequently in the relatively simple environment and headset scenarios of end-game games.

3. Capability coverage of multi-voice scenarios in mobile games

At present, among the more serious mobile games, competitive games such as MOBA, chicken-eating and MMO games occupy the mainstream. The real-time combat attributes of such games urgently require the introduction of voice capabilities in mobile games, and the voice in casual chess and card games has gradually become an enhanced player. important means of communication. In new and popular social games, voice has become one of the basic capabilities of the game. However, various games have different requirements for voice capabilities: competitive games require players on the same team in the game to connect to the microphone to minimize the impact on mobile device performance and network while ensuring basic communication needs; MMO games There are many players, such as the team voice in the PVP scene, the team voice in the PVE scene, and the gameplay similar to the in-game anchor channel has recently been paid attention by many MMO games; social games such as Werewolf are more concerned about the real-time voice quality, smooth Caton-free communication is a necessary condition for the long-term operation of games; casual games also use real-time voice and voice messages as auxiliary social means to increase player activity. In addition, there are also many players who do not have the ability to listen to voice in the mobile environment, so the voice message-to-text function provided on WeChat is also one of the essential capabilities of mobile games.

4. Mobile game voice globalization capability

In 2017, the scale of domestic mobile game users has approached 600 million, and the growth rate has slowed down year by year. The demographic dividend has disappeared, and the competition has become increasingly fierce, which also forces major manufacturers to move towards the route of high-quality games. In contrast, domestic games have a lot of room for development overseas, and more high-quality mobile games are quickly promoted and released globally after being recognized by domestic players. Do foreign players have the same voice requirements as domestic players? In the research report of Greg Wadley et al. (Voice in Virtual Worlds: The Design, Use, and Influence of Voice Chat in Online Play), a clear answer is given: Globally, in-game voice can be greatly to enhance the player's gaming experience. The globalization of games poses new challenges to in-game voice—how to provide a smooth and clear voice experience for players across regions—unstable networks, long-distance transmission, and the deployment and operation of voice service infrastructure around the world are all A headache for game makers.

Common technical solutions for real-time voice

In many studies related to speech, there have been a lot of methods for speech processing in different environments, and some network problems of streaming media protocols and how to solve them in the Internet environment have also been suggested by predecessors. This section introduces some general solutions such as voice preprocessing and streaming media protocols for the challenges faced by real-time voice in the above mobile game environment.

Speech preprocessing routine

Figure 1 General flow of speech preprocessing

1. Voice noise reduction

In the complex environment of the mobile device, a large amount of noise will be received while receiving the voice signal, so the voice noise reduction technology is a necessary means to improve the sound quality and increase the accuracy of voice recognition. Noise reduction technology is generally divided into noise reduction methods under single-microphone system and multi-microphone system. Among them, multi-microphone system has high requirements on the direction and distance between multiple microphones. Commonly used mobile devices do not have such perfect multi-microphone. channel design, so filtering noise reduction or noise thresholding methods under a single microphone are more commonly used in mobile device noise processing. Noise processing is relatively simple on mobile devices. First, many high-end smart devices have built-in dedicated noise reduction chips for the operating system to call, and mobile operating systems also have built-in many efficient noise reduction algorithms for developers to call, such as NoiseSuppressor in Android.

2. Voice Activation Detection (VAD, Voice Activiy Dection)

The purpose of voice activation detection is to determine whether a piece of sound is background noise or speech, and this technology is often used as the basis for various speech follow-up processing and speech recognition technologies. In the ubiquitous noisy environment of mobile games, accurate recognition of speech signals is particularly critical. On the one hand, through the recognition of the voice signal, the part of the voice without voice can be removed, the size of the voice transmission file can be reduced, and the CPU memory consumption of other voice processing methods can also be reduced; on the other hand, the accurate extraction of voice signals can also be effective. Improve the accuracy of speech recognition to text.

VAD processing process

Figure 2 VAD processing process

Due to the characteristics of speech itself and the difference between it and its relative background sound: high energy and discontinuity, the analysis of short-term energy combined with short-term zero-crossing rate in the time domain analysis method of sound can effectively distinguish whether the sound is speech or noise; Speech signals in the domain can also be identified by features such as cepstrum and spectral entropy; with the development of machine learning, hidden Markov models, decision tree models and even the latest deep neural networks are also applied in the VAD field to improve noise. The accuracy of VAD in the environment. Compared with several methods, the time domain analysis has the lowest hardware requirements, and the frequency domain analysis is the fastest, while the model method is relatively complex, but has an advantage in accuracy. Up to now, three types of VAD analysis methods on the market have been widely used in different demand scenarios.

3. Echo Cancellation

Echoes, as a third type of annoying voice problem in addition to noise and background sounds, are also widespread in mobile gaming scenarios. The process of noise processing can be simply understood as, from all the audio collected from the near-end, the audio signal from the far-end is eliminated through the adaptive filter, and then output to the opposite end, which completes the purpose of echo cancellation. It is to continuously reduce the error between the filter weights and the echo path channel weights.

Echo Cancellation - Adaptive Filter

Figure 3 Echo Cancellation - Adaptive Filter

Adaptive filtering algorithms are generally divided into two categories: least mean squares (LMS) and recursive least squares (RLS) algorithms. The advantage of RLS is that the filtering effect is better, but LMS has the advantages of simple structure and low computational complexity, so More research also focuses on LMS, such as normalized least mean square (NLMS), proportional normalized least mean square (PNLMS), etc. to optimize and improve the LMS algorithm in terms of filtering performance and convergence speed. At the same time, the AcousticEchoCanceler interface is also provided in Android, which can directly perform echo cancellation on the sound. High-end smartphones will also have built-in professional sound processing chips to better eliminate echoes at the hardware level.

4. Multi-channel sound aliasing

In the game, the team voice will have multiple players talking at the same time, and while the players are listening to the voice, the background sound of the game cannot be removed, so how to make the multi-channel voice clear and not cause popping noise is a high-quality mix in this scene. sound criteria. The simplest mixing method is simple time-domain audio superposition. When the intensity exceeds the maximum value, the peak is clipped to the maximum value to avoid the popping sound, but the artificial peak clipping method will introduce additional noise while destroying the audio signal; another method is The multi-channel sound is linearly superimposed and averaged. The essence of this algorithm is to reduce the volume of the multi-channel audio. However, when the number of sound channels is too many or too few, this method will cause the volume of each channel to fluctuate, which will affect the experience; therefore, in practice In the usage scenario, a better way is to give corresponding weights during mixing according to the importance of each voice, so as to ensure the recognizability of each audio after mixing.

5. Voice automatic gain

Automatic gain control technology is widely used in data communication, voice processing, etc. The signal amplitude of communication signals often fluctuates greatly. The automatic gain technology can smooth the signal amplitude and improve the communication quality. In mobile game scenarios, the distance between the mobile phone and the person's mouth may change drastically depending on the different external environments of the player during the game. Therefore, smoothing the volume of each person's voice and the volume of a person's voice at different speaking times will affect the quality of voice calls. It matters a lot. In the scenario of multi-person real-time voice communication, the automatic gain can be completed after VAD processing, and the threshold value can be set according to the needs of multi-channel voice aliasing. Nice smooth volume effect.

Peak AGC System Structure

Figure 4 Peak AGC system structure

6. Streaming media transmission protocol

Common streaming media protocols include RTP, RTMP, HLS, HTTP-FLV, etc. Among them, RTMP, HLS, HTTP-FLV are more commonly used in major live broadcast platforms due to their higher support in domestic and foreign CDN platforms. RTMP is Adobe's patented protocol, with low protocol delay and high popularity in China, it is the most commonly used streaming protocol; HLS is an HTTP-based streaming media transmission protocol proposed by Apple, supports H5, and can be played at will on multiple platforms and multiple browsers. But the delay is high; HTTP-FLV uses HTTP to transmit FLV format files, avoiding Adobe protocol kidnapping, and the delay is slightly better than RTMP. In the fields of video surveillance and high-definition video conferencing with high real-time requirements, RTP is a more commonly used protocol. Compared with the first three TCP-based streaming media transmission protocols, the biggest difference is that RTP is a UDP-based protocol. Since UDP itself is out of order, RTP is often shared with the RTCP protocol and is used to control the packet order of the RTP protocol. At the same time, the data packet congestion control protocol (Datagram Congestion Control Protocol, DCCP) can also be used to control congestion, delay and data quality from eg. For example, the Webex video conferencing system with the highest market share in the field of network video conferencing adopts the SRTP (Secure RTP) protocol as its streaming media transmission protocol. At present, most of the interactive voice and video products of major cloud manufacturers also use RTP-like protocol—a private protocol optimized based on UDP protocol—as their real-time streaming media transmission protocol to reduce latency and bandwidth consumption.

7. Streaming media packet loss processing

In the unstable network environment where mobile devices are located, and the network quality problems between players across regions brought about by the global release of the game, voice communication needs to be considered in the case of network jitter and packet loss, so that the normal operation between players can still be guaranteed. communication. Common streaming media packet loss processing solutions are: ARQ, FEC and cross-transmission. In the case of serious network packet loss, it is impossible to generate clear voice information according to the existing packet processing. The automatic repeat request (ARQ, Automatic Repeat-reQuest) method of lost packets can retransmit large-scale lost packets, increasing the Delay but can guarantee the validity of the data. Forward Error Correction (FEC) can ensure the validity of data, reduce the frequency of retransmissions, and reduce delays through data redundancy rather than data retransmission in the case of a small amount of packet loss. The core of the algorithm can be analogized to a binary linear equation, which requires two equations to find the solution of the two elements. In FEC, three quadratic linear equations are transmitted, so in the case of losing one equation, we can still find the solution, and achieve the purpose of tolerating partial packet loss without affecting the validity of information.

FEC: In the case of packet 3 loss, the original data can still be recovered by relying on redundant packets

Figure 5 FEC: In the case of packet 3 loss, the original data can still be recovered by relying on redundant packets

In addition, conventional data packet transmission is sent in strict order. For example, if a sequence of data is divided into 123, 456, and 789 packets to be sent, once a packet is lost, a whole piece of information will be blank. , but in the streaming media environment, in most cases, the lost content of a single frame can be completely compensated by the content of the previous and subsequent frames and has little impact. Therefore, if the cross-transmission mode of 147, 258, and 369 is used to encapsulate the packet, a single frame The loss of packets will not have a great impact on the overall experience. In addition, in the case of cross-regional transmission during the service period when the network bandwidth is allowed but the packet loss is uncontrollable, the single-frame multi-packet encapsulation transmission mode can also effectively reduce the delay impact caused by frequent retransmissions caused by packet loss.

To sum up, it can be seen that only from the aspects of voice processing, protocol selection and dealing with network jitter, professional voice processing R&D personnel are required to carry out a number of optimization measures. However, this is only a theoretical discussion. In the actual game scene, it is more important to achieve all the above voice optimizations without affecting the core gameplay of the game, namely: 1) It is necessary to 2) Keep a very low crash rate in the face of many Android devices on the market and adapt to the latest devices at any time; 3) Support common game engines such as Unity, Cocos. In addition, in order to ensure the stability of the global voice service, the R&D cost of the architecture design, the cost of multi-region dedicated line connection, and the cost of post-operation and maintenance are all obstacles for game manufacturers to build their own in-game voice services.

Tencent Cloud Game Voice GVoice - Born for Games

As Tencent, which has been deeply involved in the game industry, as early as the initial stage of heavy mobile games (early 2015), the Tencent Interactive Entertainment team discovered the problem of text communication in mobile games, and officially formed a team consisting of QT, YY and other technologies. The game voice R&D team led by the backbone started the development process of game voice GVoice. Although Tencent already has very mature voice solutions such as QQ voice, WeChat voice messages and voice-to-text, it still needs to do a lot of optimization and improvement for game voice demand scenarios. With the rise of high-quality games such as Naruto and King of Glory, GVoice, which supports these game voice services, is also becoming more mature, and innovatively added the anchor channel gameplay to Yulongzaitian Mobile Games, which improves the playability of the game. This gameplay has also been adopted by other MMO games. Then in 2016, GVoice was opened to Tencent Game Cloud after a large number of Tencent games were polished and commercialized. Many partners only need to simply integrate into their games to use Tencent's mature game voice capabilities in the game.

image

As the only game voice product on the market for games, it includes real-time team voice, team voice, offline voice message, voice-to-text and anchor platform capabilities at the functional level, covering all voice demand scenarios of mobile games; at the technical level , GVoice widely uses voice processing methods such as noise filtering, echo cancellation, and automatic voice gain as described above, and continuously optimizes to improve voice quality; while ensuring sound quality, through the self-developed streaming media transmission protocol based on UDP protocol, The voice communication bandwidth is compressed to 3-5kbps, and the delay and bandwidth are greatly reduced compared with the standard protocol; after continuous optimization of the integrated SDK, the CPU occupancy rate is only 2%, and the package size after integration is only 2MB. The ultra-low crash rate tested by the market ensures that the integrated GVoice service will not have a negative impact on the core gameplay of the game; the carefully crafted and productized GVoice can also support common game engines such as Unity3D, Cocos2D, Unreal, etc., with only 5 lines of code. Complete the access of all voice functions; covering more than 20 service nodes on five continents, combined with the CDN deployed globally by Tencent Cloud, it can achieve one-time access and global services for game voice.

At present, the game voice GVoice has been widely used in the game industry, such as Tencent's most popular King of Glory, Changyou's Tianlong Babu, Sanqi's Eternal Era, Hero Entertainment's National Shootout, NetMarble's Paradise II-Rebirth in South Korea, NetDragon Globally released Blade of Souls, etc., all use GVoice to help their in-game social interaction, increase player activity, and enable the game to operate for a longer period of time.

At present, the game voice has been launched on Tencent Cloud's official website. Click the website below to learn more about the product information.

Related Reading

3D positional voice, leading the game experience of eating chicken to upgrade Walk vr - unity voice chat room development based on deep learning voice enhancement - minimalist source code


This article has been published by the author with the authorization of the cloud + community, please indicate the source of the article when reprinting

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324413782&siteId=291194637