Tencent Games’ advanced practice of zero-intrusion observability based on DeepFlow

Author: Feng Yafei/SRE engineer, from Tencent game distribution line technical operation team, Chen Zixin/Blue Whale monitoring product operation

Tencent is not only committed to developing popular self-developed games, but also cooperates with well-known game developers around the world to bring these games to the market so that more players can enjoy the fun of games. These partners come from all over the world and use a variety of technology stacks, which poses complex challenges for game stability maintenance. This article aims to explore how Tencent Interactive Entertainment uses DeepFlow's eBPF technology to achieve non-intrusive observability. This strategy not only ensures a smooth user experience during the game's progressive release process, but also speeds up the diagnosis and resolution of problems, effectively preventing potential performance issues.

01 Game background introduction

"A Game" is a game developed by overseas developers and published by Tencent. The game uses technology stacks such as Scala, Zio, Istio, and CockroachDB . These technology stacks bring additional complexity and challenges to the launch and operation and maintenance of the game. And as the project launch date approaches, it becomes impractical to adjust the code to enhance the observability of the application. Therefore, the development team urgently needs an observability solution that does not require modification of the game business code.

Against this backdrop, the team found that eBPF technology could provide the required non-intrusive solution. eBPF does not rely on a specific technology stack and can automatically generate service call graphs , calculate request, error, delay (RED) indicators , record call details , and automatically generate distributed tracing links . Based on this, the game publishing team teamed up with the Blue Whale monitoring team and began to explore how to quickly deploy eBPF technology to achieve imperceptible and out-of-the-box application observability capabilities for the game business, thereby ensuring the smooth launch and efficient operation and maintenance of the game.

Tencent Games

02 Ensure user experience during progressive release

The official launch of "A Game" adopted a carefully designed progressive release strategy. This strategy allows us to gradually increase the proportion of users pushing new versions at different stages, generally divided into 10%, 50%, 80% and 100%. % four stages. A progressive release strategy can effectively control risks, ensure that new versions can be launched smoothly, and ultimately provide all users with a more stable and better gaming experience.

In the progressive release strategy, RED indicators (request rate, error rate, request duration) provide a real-time and intuitive window, allowing the operation and maintenance team to promptly observe the specific impact of the new version on service performance. Influence. Through these metrics, the team can make data-driven decisions , such as whether to continue to expand the user base, whether performance optimization is needed, or whether to roll back to an older version. The ultimate goal of this kind of refined control is to ensure that users get the best gaming experience while protecting the stability and continuity of the business.

At present, after each version of "A certain game" is released, the difference in RED indicators comparing the old and new versions of POD has been integrated into the detection list during the release process. If the difference in indicator indicators is not significant, it means that this update has not introduced performance degradation or severity. Errors, load imbalance and other issues. Combined with other business characteristics indicators, you can confidently decide to continue to implement the full server update.

RED indicator

During the progressive release process, if an exception is discovered, the requested API, parameters, response time, return status code and other information can be quickly located through the call details. Help operation and maintenance personnel quickly analyze the problem, troubleshoot and solve the problem.

Call details

At the same time, you can use the fully automatic distributed tracing capability to discover potential performance bottlenecks to help perform performance tuning and continuously improve the response speed and stability of the system.

Distributed tracing

03 Practical combat: Eliminate the hidden dangers of CPU surge in new versions

In several online updates after "A Game" was launched (concentrated between 5:00-6:00 in the morning), it was found that every time a new configuration table was released, the CPU usage of the entire cluster server would soar. See It looks like the client launched a DDoS attack on the server.

CPU surge

Judging from the operating characteristics of "A Game", the server's CPU does not need to process too much calculation logic, and the main consumption should only increase with the number of client requests or online players. As can be seen from the curve of the number of online players in the figure below, there was no obvious sudden increase during the launch period, so it can be considered that the business was not affected.

Number of online players

On the other hand, since these suspected DDoS traffic are requests initiated by normal clients, the security team's DDoS protection system will consider this to be normal traffic and will not process it. In the past, abnormalities in the CPU indicators of this type of infrastructure usually ended here. From time to time, students would try to analyze the application logs during the upgrade phase. However, because the volume was too large and there was no obvious progress, we could only assume that when the program started, It does process more business logic and consume more CPU .

Comparison chart of analysis ideas

After eBPF observability was launched, this type of problem suddenly became simple and straightforward:

Step 0 - Continuous monitoring : During the upgrade process, the cluster QPS trend was continuously monitored. At this time, the Ingress gateway Request Rate indicator automatically obtained based on eBPF was used for analysis, and it was found that QPS had a sudden increase of nearly 10 times. It has been basically concluded that the previously discovered increase in CPU is due to the sudden increase in QPS.

Request Rate

Step 1 - Drill-down analysis : Next, continue to analyze the call details automatically obtained by eBPF. By analyzing the URI proportion chart, it is found that nearly 70% are from the SDK of a certain client.

URI proportion chart

Step 2 - Locate the root cause : We can further analyze based on eBPF's detailed Request Log or business log. No matter which method, the number of logs that need to be analyzed has been greatly reduced. We searched the log corresponding to the URI in the business log and found that the version carried by the client SDK when sending the gRPC request was inconsistent with that of the server. As a result, the request was always rejected by the server. After rejection, the client retried frequently. Caused a DDoS attack on the server. After R&D confirmed this problem, it was fixed immediately. Continuous monitoring was carried out after the repair went online. It was confirmed that the sudden increase in cluster QPS after the configuration table was updated no longer existed, and the CPU also performed normally, successfully eliminating a major hidden danger .

In addition, the Blue Whale team and the DeepFlow community have jointly supported custom acquisition of gRPC Header fields in the new version. We have extracted the metadata representing the client version number into the call log to further avoid similar problems in the future .

metadata extraction

Summary : Use the business-insensitive Request Rate provided by eBPF to quickly determine the cluster QPS sudden increase, and then use the call details to accurately locate URI anomalies. In just a few simple steps, you can detect the performance risks of the new version early and successfully avoid them. A serious online failure occurred.

04 Case Summary

本文通过《某游戏》的上线和运维过程,深入展示了腾讯互娱如何利用 eBPF 技术有效应对复杂技术栈所带来的挑战。通过引入 DeepFlow 基于 eBPF 的零侵扰可观测性能力,我们不仅加速了问题的排查和解决过程,还显著提升了游戏的稳定性和用户体验

By adopting a progressive release strategy, we are able to carefully monitor performance indicators at each release stage to ensure that each update brings players a smoother and more stable gaming experience. Practical cases show that eBPF technology allows us to quickly respond to performance changes and effectively identify and solve potential problems, thereby ensuring the continuous operation and service quality of the game .

In the end, through the application of these technologies, we not only successfully improved the performance and stability of "A Game", but also set a new benchmark for future game operation and maintenance. These results prove that the observability capabilities based on eBPF technology have great potential in dealing with complex technical environments and improving user experience.

We look forward to continuing to explore and share more technological innovations and successful practices in the future, bringing a better gaming experience to users, and providing valuable insights and solutions to the industry.

 

Original link: https://deepflow.io/blog/zh/059-tencent-games-achieves-zero-code-observability%20-based-on-deepflow/

GitHub address: https://github.com/deepflowio/deepflow

Visit DeepFlow Demo to experience zero instrumentation, full coverage, and fully relevant observability. 

Fellow chicken "open sourced" deepin-IDE and finally achieved bootstrapping! Good guy, Tencent has really turned Switch into a "thinking learning machine" Tencent Cloud's April 8 failure review and situation explanation RustDesk remote desktop startup reconstruction Web client WeChat's open source terminal database based on SQLite WCDB ushered in a major upgrade TIOBE April list: PHP fell to an all-time low, Fabrice Bellard, the father of FFmpeg, released the audio compression tool TSAC , Google released a large code model, CodeGemma , is it going to kill you? It’s so good that it’s open source - open source picture & poster editor tool
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3681970/blog/11049317