RTC 体验优化的“极值”度量与应用

随着线上互动需求的增加,直播连麦、语音/视频聊天的应用越来越广泛。我们一直在说“追求用户的极致体验”,但是体验是一个抽象的概念,很难量化和统计。如何从用户的行为中得到所在场景的优化“极值”,如何依据“极值”建立统一的质量指标体系以指导业务优化?如何迁移抖音的服务经验,满足toB用户的体验需求?LiveVideoStackCon 2022北京站邀请到火山引擎RTC团队负责人——杨智超,为大家介绍在实时通信场景下火山引擎RTC对体验的理解与应用落地。

文/杨智超

编辑/LiveVideoStack

大家好,本次和大家分享的题目是RTC体验优化的“极值”度量与应用,重点在于“极值”二字,因为RTC很多时候很难定义体验天花板,如何通过数据手段定义天花板并指导团队进行极致体验优化?

我是杨智超,火山引擎RTC体验团队的负责人。

f04d40a334061776bb800610ef971b01.png

本次分享包括以下方面:

一、RTC指标体系及目标;

二、直接指标的“极值”探索-以进房为例、间接指标的“极值”探索-以卡顿为例、“异常特征库”指标:这三类指标在日常的使用和分析过程中的思路不同;

三、体验优化的能力迁移与最佳实践:用实例说明以上三类指标如何提升用户体验。

-01-

RTC指标体系及目标

fc8339748542f134d26dcb6edaa83f47.png

为什么要花大精力做指标呢?因为很多团队做RTC时在指标体系花的精力并不大,很多是算出来,我刚进字节的前两年管理RTC供应商,他们为我提供了非常多指标,但在过程中遇到了一些事情让我意识到指标的重要性。在一次重大事故中,抖音连麦持续40+分钟无法成功进房,只能直播无法连麦,而进房指标依然99%+,我们对此表示怀疑,供应商提出的理由是一个旁路服务器出现了故障,但这也同时体现出了指标定义的问题。

7d3e78a820994d6bfc9d46964ba39ef2.png

In daily work, even if the overall service business does not fluctuate, indicators sometimes continue to change, and the range of change is 20%-30% in absolute value. Intuitively, it seems that there is a problem in a certain part of the service business, but after the actual investigation, the response is often that the threshold has not been reached, and there is no need to pay attention to it for the time being. When the threshold is reached, it will be found that the alarm on the supplier side is useless, and the cause cannot be found out. At this time, the suggestion is to observe for another two days, and the final result is that the indicator will automatically return to normal after two days, or it will continue to be at a high level. , the supplier will ask us about any changes in user experience feedback. If there is no change, then directly adjust the higher threshold and continue to observe.

ef9442bcf1b28bcd90ea5bc84fb8703e.png

After several such incidents, we started making our own metrics. After two years of precipitation, the indicator system has also become "stable, accurate, and ruthless."

  • Stable —very small fluctuations, the variance of the core indicators on the 14th day of the first month is <0.003, the reason for choosing the 14th day of the first month is that the network fluctuations of the domestic package on the 14th day of the first month are relatively small. The variance of non-core indicators including room entry, 80ms audio freeze and 500ms video freeze basically fluctuates at the level of 10,000.

  • Accurate —there is a reason for fluctuations. When a problem occurs, we must find the general direction to troubleshoot the cause. After conducting a large number of attribution analysis, we set the goal of the attribution ratio > 95%. Only when this level is reached can we see There are rules to follow if we can find out the overall fluctuation trend of the index, otherwise the fluctuation without analyzing the reasons is meaningless. The attribution ratio of core indicators, such as 98.23% for room entry and 94.92% for 80ms audio freezes, is relatively low because there is no need to attribute small freezes, and 97.36% for 500ms video freezes.

  • Ruthless —there must be consequences for every cause, and the cause must be found out when calling the police. The number of calls to the police in the past month is 41, and the proportion of the exact cause that can be found out in the past year is 92.7%.

f808c81a826f6dfecba1d097d627ac0a.png

There are three requirements for the indicators to be "accurate":

  • The goal is clear . This is a very critical point. For RTC, the goal includes two opposite aspects: one is commercialization, which requires the indicators to be as accurate as possible with little fluctuation; the other is sensitivity, as long as there is a problem, fluctuations must occur in advance. Be able to show each case and analyze the reasons. We researched other solutions in the market, and they all regarded them as two sets of indicators, but after discussing products and R&D, we decided to make a set of indicators and put the experience of another set of indicators into the research of this set of indicators. In terms of business, strive to thoroughly study this set of indicators, show more details to the business side, and make them trust us more.

  • attitude . Will the question of each data be available in the next version? Volcano Engine itself pays great attention to AB experiments, and everyone attaches great importance to the demand for data, which is very good.

  • down to earth . I summarize the three words to make the indicators accurate, the minimum behavior granularity, API behavior alignment, and the minimum threshold for its experience.

1badb224b45be9a32c2c54f41ae75ac2.png

In the conference scene, from entering the room to speaking, Mute yourself, Unmute yourself, and then to speak, there are two first frame transmissions, and each first frame transmission is a sample, which is very important in the definition of indicators. Find the smallest behavior granularity and define it as the smallest metrics granularity.

7e5d454a97ec7d9b7aef5f7a14eb27a2.png

This is a typical calculation model for success rate indicators, from process start to success/failure/timeout, and other events after success. However, when calculating indicators, only B events or A/B1 events are selected. If all ABC events are not considered, the indicators are likely to fluctuate with the fluctuation of log reporting volume. Considering that all ABC events are completely related to user API calls Aligned, there will be no situation where the user has been unable to enter the room, but the indicator is still 99%.

e7f535a6f1d4f4a29bb92e76a923e237.png

RTC指标中有许多阈值,最简单的是音频卡顿率阈值,业内音频卡顿率一般是80ms、100ms、200ms,我们定指标的时候如何判断阈值呢?RTC主要使用Opus编码器,编码器出现卡顿时会有补偿,当补偿到4帧时,它会把声音分为清音、谐音和噪音进行编码,谐波的衰减系数已经降到0.48,一般衰减系数降至0.5以下,其实已经完全听不清了。所以最多补齐4帧,也就是80ms,正常人说话发一个音节的声音也是80ms,可以理解为,当发“和/he”的音时,如果“h音”出现了卡顿,那么对面听到的就是“额/e”,如果大于80ms,声音已经发生了偏移,不是原来的声音,这是无法接受的,所以我们在定义指标的时候针对此原则确定最小阈值。

3cde0f0f24b8a9df58596373b6502946.png

做出指标体系后,如何对其进行打磨?这个过程持续了大概两年,原理是当一个指标非常稳定的时候,它是符合正态分布的,也就是3倍标准差范围的概率是99.7%,也就是说一旦超过了3倍标准差的范围,那么会有99.7%的可能性出现了问题。

8d4e1ef5896b6de71c5e99528afc28ba.png

0f4f11b2c5198b40f7fda6005e66f8f2.png

c6426ca762d918a4736fb8b5bef1e2e2.png

去年5月19日,我们的5s进房成功率变为99.3%,拉长看好像没什么问题,但将其单抽出来,可以看到前14天的平均值是99.342%,标准差是0.013%,监控阈值判定是99.303%,也就是99.3%的实际值中有超过99.7%的可能性是线上出现了问题,根据当时的归因,5月19日的ICE建联获取节点失败也在上涨。我们由此牵引开始排查,发现服务器上线了一个概率非常小的bug,有一定的可能拒绝连接,影响进房成功率不到0.01%。这正是因为指标非常准确,及时发现问题,才能够避免它在线上一直存在且随时可能爆发。

c1dc99e95aefe5ae6a6987ca39ea0e1e.png

指标体系分为三类:直接指标、间接指标和异常特征库指标。

直接指标包括进房类、首帧类、Crash等,间接指标包括卡顿类、延迟类、CPU、内存等,两者是0和1的关系,对单个语音用户来说,进房要么成功要么失败。而间接指标如卡顿可以是50ms、100ms等,是一种程度的关系。

Abnormal feature library indicators solve silent problems, noise problems, echo problems, etc. that cannot be covered by QoS indicators.

This paper introduces these three types of indicators in detail with examples.

-02-

Exploration of "extreme value" of direct indicators——Taking house entry as an example

79f0d37dd7f1eba697bedc8d0702b98c.png

c963a27f4e9e2809d6cf9e3b625dba87.png

The index of room entry is the biggest loss of experience for users. When we first connected with Douyin, the request made by the other party was that any failure to enter the room was unacceptable. At that time, our conclusion was that this The requirements are also unacceptable, and even think it is impossible, so we confidently compete with the business side.

The result of the PK is that Douyin showed us the flow chart of the connecting mic: After user A invites user B, the business side forwards it to B, and AB enters the room together to start connecting mic. Then Douyin raised two questions. First, if the disconnection is due to a bad network, why are all the previous business requests good, but always disconnected from you? Second, why does a user who is stuck in live broadcasting go to Lianmai? Shouldn't he solve the live card problem first?

We think that the interpretation from the perspective of the business side is very reasonable, so we set a goal to achieve a 100% success rate of house entry.

7e13aee48b17128005272441ec41cbcb.png

9a5bfc8b0cedd1f9dfb0609f896e3a93.png

After analysis, we believe that dismantling and attribution should be done in a down-to-earth manner. A typical example is "the failure of ICE to establish a network does not mean that the network is not good." When investigating the case before, most of the reasons for the failure of Jianlian were attributed to the poor network, but the business side did not agree with this result, and they believed that the failure was caused by internal bugs in the business.

Later, we clarified a concept internally. When analyzing indicators, dismantling and attribution are completely two things. The failure of ICE to build alliances can only be said to be a failure of one step, not the reason for this failure. The first step is to clarify the dismantling steps. When doing specific attribution, it is necessary to meet the standard of "technical approval and business understanding". Business understanding, then you can reach a consensus and work 100% on it.

Regarding the network problem mentioned above by the business side, the follow-up solution is that we will send a standard http request when we enter the room. The standard is completely aligned with the business side, covering 30+ domain names around the world, and at least 3 domain names are selected for each request. , when these requests fail, we believe that the failure of the business network is caused by the failure of the user network, and this attribution has also been recognized by the business side. Correspondingly, there is a second attribution: http succeeds but ICE fails to establish a connection. This is recognized by both technology and business. It is a bug in the technology and needs to be optimized. This is because ICE connection itself should be easier to succeed than http .

In addition, there are some attributions: it is more common in Europe, the failure caused by the user only supporting ipV6; there is a bug when the user calls, resulting in a crash while entering the room; the quality of the domain name is not good, etc.

e58ef36e4472b6a960bddce6c31b8df6.png

2735c02d75e202b71e0e3c10967f345d.png

b75f2ad3a26f8fd634a0df0310fb9b03.png

After doing the above work, we were able to answer two questions:

①For Douyin, how much room for optimization is there before 100%?

②For other businesses, how much room for optimization is there before Douyin? The previous answer to this question was that when the room entry rate was optimized to 99%, it reached the level of Douyin. But the business side will say that their users are high-end users, and the room occupancy rate should be higher than 99%. How to answer this question?

For each attribution, after calculating the number of theoretical optimizations, and comparing the calculated indicators with Douyin, the difference between this business and Douyin’s optimization space can be obtained, and at the same time, the optimization ceiling can be obtained.

-03-

"Extreme Value" Exploration of Indirect Indicators——Taking Caton as an Example

72b1974201098d4984f9b1113f91935c.png

Indirect indicators have different standards for a single user and different scenarios. For example, the anchor PK and the voice chat room have completely different tolerances for freezes. The former is mainly to increase popularity. , he is likely to end Lianmai and replace another person to PK, but the latter pays more attention to the accumulation of chat rooms. The anchor can chat with the audience all day, and will not change rooms easily. Subdividing it into the voice chat room, you will find that the anchor and guests have different tolerances for freezes. In order to ensure the popularity of the live broadcast room, the anchor will not leave easily when there is a freeze, but the guests may directly change the chat room.

In this regard, we set a goal to find a way to measure user discomfort that can adapt to as many scenarios as possible.

bfc43ec39e02bad28ff6cb2b08ca96e6.png

The picture above is from the user's point of view, according to the different cumulative time and severity of the freeze, the user's different reactions:

  • When a little lag is like 100ms, the user cannot feel it;

  • When the lag is larger at 500-700ms, users will bear it;

  • When the freeze lasts for a second, the user will continue to watch although they complain;

  • If the freeze reaches more than 5s, most users will choose to exit and re-enter or end directly;

  • When there are multiple continuous freezes, the user will try to give feedback;

  • When repeated retries are useless, the user will think that the service is not working and will no longer use it.

From the perspective of the business side, there has been feedback, one after another, and the average length of retention and continuous wheat has decreased. The average mic-connection time can be used as an important QoE indicator to connect with QoS. The reason is that when the mic-connection is not good once, it will be connected again. If the number of mic-connections increases in a period of time, the corresponding average mic-connection time will naturally decrease, and the business performance will decrease. The indicator will go down.

b8754c3a471ecee62cecdeed2d925c9f.png

Under the above reaction chain, we boldly assume, first of all, what is QoE? The only thing we can control and express with our own anchor is the "user logout" action.

When the freezing time gradually increases, the exit ratio is shown in the figure below. At the beginning, users do not perceive it, and the proportion of check-out users is very low. As the freezing time increases to a certain point, most users start to check out. The slope will increase to an extreme value. After reaching this point, it will be found that some users have no feeling for freezes, and they have to connect to the microphone due to some fixed reasons, such as attending the morning meeting, the content of the morning meeting is not important, the important thing is to participate, no matter how serious the freeze is at this time , and the user won't check out. We found that different scenarios will have a certain percentage of degradation, so we finally choose the sensitive point of user behavior as the threshold for whether freeze affects user experience.

755ba83c8641cc9e44e84fa555c81c2f.png

The above is the overall assumption, and the figure is the data specific to Douyin.

1v1PK is very sensitive to stuttering. Users quit and reconnect on a large scale in about 2s. The anchor basically has no curve shape. He has a very high tolerance to stuttering, and the tolerance of guests is between 1v1 PK and the anchor. In the middle is the Douyin IM call, where the calling and called curves are exactly the same, which means the scene is exactly the same.

The most interesting thing about this graph is that it does not change over time. Whether it is the first week of each month, the first half of the year or the second half of the year, the graph almost overlaps when verified in different time periods, and it is completely consistent with the user experience scenarios and The user group corresponding to the APP is related.

151c87f43fb0e55df5baac113241b3a3.png

On this basis, we calculate the sensitive points based on each graph, and based on this, we analyze and dismantle the causes of lag beyond the sensitive points, obtain the above two graphs, and guide users to perform online optimization. This can lead to results that are more likely to cause QoE changes.

-04-

"Anomaly Signature Library" indicator

b4f15bad41a9ec9e43543bd2ee0c8aa6.png

26d5bc2f7d583fc3f7cad9164cf9c823.png

4663bdda30ceffc4678cda486b3288b2.png

When supporting various businesses, including internal innovative business, confidential projects, and external toB business, we found that the business side is not interested in what we make with QoS indicators, and there is a big gap between them. C-end users are easy to measure with QoS indicators, but B-end business parties will be affected by the key person, the boss. We often receive some feedback that it is stuck again, and we need to investigate the cause, so the business convenience thinks that our RTC is not up to standard. Another situation is that some users will give feedback by themselves or the APP has its own feedback function. When we communicate and report to the business side with good QoS indicators, they will only question. With so much feedback, you tell me that the quality has changed. alright?

At the beginning, we thought that QoS indicators were used to measure system quality in the industry. How could this not make sense to the business side? After analysis, it is concluded that this type of case has three characteristics:

① Difficult to use indicators to measure, because each question is random;

② Both we and the business side may have problems. Maybe the business side calls the wrong one or changes the logic when calling, which leads to the problem;

③Although the small amount does not affect the online indicators, it does affect the user experience. For a single user, he has a large freeze and started to give feedback, which shows that it is easy to close the line in his own scene, which is very serious problem.

In order to solve this kind of problem, we analyzed many similar cases and came to the conclusion that "the user found the problem earlier than the boss". The problems found by the boss have all appeared in the previous user feedback. The amount of data is small, the amount of data fluctuates greatly, it is difficult to extract useful information, and it does not pay much attention to feedback. So in response to the feedback, we made a set of abnormal feature library indicators, hoping that it can experience the user's evaluation system.

4c27844984ec5701faec06a58b069c7b.png

The basis for making this set of indicators is the characteristics of the Douyin Group - it cares about user feedback, and our company's cafeteria screen will broadcast user feedback without deletion.

f762fc840d126758d2f7cec5c33e607e.png

bc77c239020b57a9b5a37c108ec11d1f.png

d0ac2b32c9a1e5b2855bb47f32e73cb7.png

bfa44d430dc68247c74c406fafeb12d1.png

fb4b99d1676fdb0e81a7e6ee379c7aeb.png

ea0d534b1d61767fa8a331550da9f7ef.png

Prior to this, we unified and categorized all real-time communication related feedback from Douyin Group.

Secondly, Douyin Group has as many as 30,000+ online feedback every day, all of which are typed out word by word by users. Our subdivision dimension analysis is still very stable, which is of great help to subsequent analysis.

We have also designed a set of processes: when a single user makes feedback, a wave of people will analyze it, and then fall into the abnormal feature database. Through two sets of system calls, the intelligent service assistant will actively send it to SA, and SA can directly communicate with Communicate with users, and convert information into services according to the characteristics of customers.

The key here is two points:

  • The first point is to define the concept: "problem" includes silent problems; "feedback", the feedback corresponding to silent problems includes "I can't hear the other party" and "the other party can't hear me", there seems to be no difference, but these two bugs The trend of the same optimization is completely different; "phenomenon attribution", such as audio_level==127 is the most important phenomenon attribution of the silent problem, that is, the collection volume is zero. Corresponding to the attribution of the phenomenon of acquisition is the fourth point, "technical attribution", such as the problem of silence, there are 30+ attributions just for audio_level==127.

  • The second point is the attribution standard. We also abstracted the feedback content as a standard rule for everyone to use before, but the question is whether the standard is universal or the feedback is strongly related? In this regard, we have set three mandatory standards:

    • The target user feedback rate is more than 1 times higher than the overall feedback rate. The picture on the right is the actual content. The feedback rate corresponding to the attribution of most feedback is more than 2 times the overall feedback rate, and the highest is 8 times, indicating that this feature corresponds to Silent feedback is the most obvious, and if this feature basically appears, there will be silent problems.

    • Migrate to a scene with recording, playback manual verification> 80%, because the underlying buried point is common, so after a problem is found in a certain scene, it is difficult to verify whether the attribution is silent, such as in a meeting scene, often only You can only verify by watching the video, but it is impossible to keep the video in the meeting, so we migrated the meeting rules to the mic scenario with video, and verified whether the rules are accurate through manual verification. All rules require >80% accuracy during validation.

    • Overall coverage >95%. The first two criteria are to illustrate that attribution rules are very strongly related to silent feedback, and this one is to fit the above rules into QoS indicators, which are strongly related to QoE feedback actions.

24065e68b97134caba1147102e80892d.png

The picture shows the problems found through feedback optimization. For example, "the combination must be silent" reflects the product compatibility problem, which occurred during the code iteration process. At that time, the silent feedback suddenly increased, so it was checked and corrected before the feedback from the business side bug. There is also a silent problem caused by business calls or customer equipment, typically the microphone is occupied. The picture on the right shows the howling feedback received. When many users use the computer for a meeting without wearing headphones, there will be howling with the equipment in the meeting room. At this time, the system will actively prompt "whether to join the meeting in no audio mode", so as to solve the problem.

-05-

Capability Migration and Best Practices for Experience Optimization

4e48ab019f6178db04260c99df0ab8ab.png

e43ecbd47e88760fc21b1b90204cd215.png

2d49ef206873fcd101fd8b681610eaf7.png

56519163bf10f58aabdd4a04578949dd.png

70d611bd402ac4335265122f32d8333c.png

c6131399322c9090138168212fec602c.png

d41e24da3fc394d5208a0114b483bff2.png

This is a silent case. This kind of situation is actually posted all the time on Douyin. Let’s start from the perspective of review:

At 11:40 on August 12th, the spontaneous inspection found that the silent feedback of Douyin increased

At 11:45 on August 12, according to the technical attribution market, it was clearly defined as "abnormal capture frame rate" and "abnormal first frame transmission". At this time, the former continued to increase, and the latter began to fall

At 22:31 on August 12th, it was clear that the problem had been dealt with before "the first frame was abnormally sent". At present, it is mainly "abnormal collection frame rate" (if it is not for detailed analysis, it will come to the same conclusion as the supplier at this time, "the problem has been solved, observe for another two days")

At 00:15 on August 16th, it was clearly a business call problem, and there was no real problem with RTC

At 17:24 on August 26, the business side reproduced and clarified the problem. Because they called multi-threading, they would set a mute for the customer when answering the phone

After the bug was contacted on September 14, the feedback rate of "the other party can't hear me" decreased, and the feedback rate of "I couldn't hear the other party" did not change significantly, and a new attribution was produced - collection startup failure

During the rule review process on October 12, the rule was revised to "the same user connects to the mic several times in a row, the first time it is normal, and there is no sound later", because "normal for the first time" means that there is no problem with the customer's equipment, and "no sound behind" means that a certain link has occurred question

On November 20th, the rules were clarified and put into the feature database, and it was found that an internal innovation business also had problems, and it has been optimized and launched

On the whole, we are gradually improving the user experience through the above methods. The above is the sharing of this time, thank you!


2e422266fd5447b6eae8ecdcbaaf0630.png

LiveVideoStackCon 2023 Shanghai lecturer recruitment

LiveVideoStackCon is everyone's stage. If you are in charge of a team or company, have years of practice in a certain field or technology, and are keen on technical exchanges, welcome to apply to be a lecturer at LiveVideoStackCon. Please submit your speech content to the email address: [email protected].

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/130633266