The Road to Quality Assurance in the APM Era: Interview with the performance leader of the Quality Management Department of Tencent Interactive Entertainment

The Road to Quality Assurance in the APM Era: Interview with the performance leader of the Quality Management Department of Tencent Interactive Entertainment

Guide: November 23-24, GIAC Global Internet Architecture Conference will be held in Shanghai. GIAC is a technical architecture conference launched by the high-availability architecture technology community for architects, technical leaders and high-end technical practitioners. This year’s GIAC has Microsoft, Tencent, Alibaba, Ant Financial, Huawei, iFlytek, Sina Weibo, JD, Qiniu, Meituan Dianping, Are you hungry, Caiyun, Geling Shentong, Databricks, etc. Company experts attended. Buy this week to enjoy a 12% discount on tickets, and as low as 40% for members of the high-availability structure.

On the eve of the conference, High Availability Architecture interviewed He Chun, producer of the 2017 GIAC Quality Assurance Sub-forum, and conducted interviews on quality assurance issues that are of widespread concern to everyone.

The Road to Quality Assurance in the APM Era: Interview with the performance leader of the Quality Management Department of Tencent Interactive EntertainmentHe Chun, the performance leader of the quality management department of Tencent Interactive Entertainment, and an expert in Tencent TDR, participated in the formulation of Tencent mobile game release standards, focusing on the positioning and tuning of mobile games on performance issues. Leading the development of performance analysis tools (UPA) and APM mobile game client performance management tools. Responsible for participating in the performance optimization of many strategic products such as "Glory of the King", "Cross Fire: Gunfight King", "Contra: Return", "Naruto Mobile Games" and tactical competitive mobile games, and accumulated a wealth of client performance experience of.

High-availability architecture: performance issues are now very hot issues, because performance directly affects user experience, many related practitioners also emphasized in various conferences that they are solving real user experience problems, so from the game industry you are engaged in So, how do you define the real user experience? According to your experience, where are the current mobile game performance problems?

He Chun: Taking the game industry I work in as an example, performance problems are mainly reflected in the jitter and freeze of the screen, as well as the response time of user operations. This is actually a performance problem in a broader sense that we usually call. From the technical level, we are concerned about the above data, but when converted to the user's perception, the user's hand operation and the user's eyeball perception during the game do not actually reflect the numbers. For example, we usually say that the memory is very high, the CPU is very high, or the network delay is very high. However, for the user, he does not know how high your CPU is, and the user may see a The jitter of the screen, or the response timeout caused by network delay, or the direct flashback of the entire APP, is the effect that the user can see directly from the eyes. I think this is the real user experience.

In terms of performance monitoring, the game mainly looks at the frame rate of the screen. The screen that the player is operating is not a standard operating system screen. It is neither the Android interface nor the Apple interface, but it is drawn through the game engine, so The jitter of the picture and the change of the picture frame rate have a great impact on the player's experience. There are also some special points that in the monitoring of games, we have to give users a choice, that is, let users choose his core scenes. Because many games, if you are in the login phase or in the game lobby, although the screen jitter and network delay will also affect the user experience, it does not affect the user much, that is, the loading time is a little longer, but once After entering the map, there should not be any performance problems when playing.

High-availability architecture: You were working on performance testing tools before, and now you are working on APM tools. What are your thoughts on the development of these two different tools?

He Chun: Performance testing tools are aimed at testers and developers. Testing tools can actually get more, deeper, and more comprehensive performance data, and testing tools have a certain performance impact on the APP itself. There is no problem, because This is only used in the development stage. Speaking of APM, performance monitoring, collection, and analysis are oriented toward the C side, and it is directly integrated in the product and released to the outside world. Therefore, when monitoring, its impact on the performance of the APP itself must be minimized. And in different countries, there may be some restrictions on user sensitive data, so when we have some businesses going overseas, we will also encounter the restrictions of the EU GDPR agreement, and we will make some trade-offs at the level of data collection. Broadly speaking, the current monitoring is to better improve the user experience.

High-availability architecture: When it comes to collecting data, the data sources of APM and the data back-end commonly used by product managers have great unity, so can you consider combining the two data sources into one? If so, what aspects need to be paid attention to? If not, what is the reason?

He Chun: Yes, you need to align the data types and data definition standards. Of course, there may be differences in data privacy requirements between the two. For the “monitoring” and “collection” of all performance data mentioned in this article, Tencent Interactive Entertainment has a strict supervision process. Under the premise of not involving user privacy, these performance data can help improve the quality of R&D and present better results. Product quality and user experience.

High-availability architecture: In your speech, you mentioned the closed loop of performance testing. There are four links, including verification and comparison, real-time monitoring, performance early warning, and problem location. As a "big factory", Tencent attaches great importance to systematic construction. Can you tell us separately why these four dimensions are selected?

He Chun: First of all, we said that the first dimension is verification and comparison. The ultimate goal of performance collection, analysis, and monitoring is to improve user experience, which means that problems can be discovered and fixed as quickly as possible during the version iteration process. Once the problem is fixed, a before and after comparison must be done after the new version is launched.

The second link is real-time monitoring. The so-called "real-time" means that when we do big data calculations in the background, we try to reduce the calculation frequency to a certain range, such as 1 minute or 5 minutes. At present, our calculation frequency delay is up to 5 minutes. When a user opens our APM product page at any point in time, all the data he sees is the latest, which is currently calculated.

The third link is performance early warning. What we are currently working on is a tool-based platform product. From the perspective of tools, there is a principle of "use and go". We hope that tools will not consume too much time for development students. Because platform-type R&D tools like ours, its essence is to improve our work efficiency. If we spend a lot of time looking at this page every day to see the data, it does not actually improve my efficiency. Therefore, once the project team recognizes the function of our product, he will not come to our page every day. He hopes that we can use some warning methods to help them automatically filter and find problems. In fact, it is to save them. time.

The last link is problem positioning. We need to reduce the problem to a certain range as much as possible under limited circumstances and conditions. For example, we will usually mention what is the frame rate of your game, which may be 25 or 23 frames per second. Suddenly a problem occurred at a certain point in time, and the frame rate dropped to a very low level. At this time, we need to further find some points that can be optimized to narrow the scope of the problem. At this time, we need to target a specific small scene, or a model of a mobile phone, or even the version number of Android or Apple, etc. So in terms of problem positioning, we have done very detailed data analysis, divided into many dimensions, and made cross-comparisons between multiple dimensions. In this way, the data we analyzed can be accurate to the game version and game. The scene will be accurate to the player’s model, and even the time period when performance problems occur. Regarding the concept of time period, I also mentioned in my speech that there are many third-party reasons that cause game performance problems, which may not be clear to the game project team. For example, a third-party component suddenly has an online update, or a mobile phone manufacturer updates the operating system version. At this time, the project development team may not know in advance. This requires us to monitor the time period. It is true in our work. Through these problem positioning methods, the project team has found many problems.

High-availability architecture: At present, many companies have adopted a microservice architecture. The scale and dynamics of microservices have also increased the cost of data collection, and service splitting will have some impact on end-to-end monitoring to a certain extent. What do you think? these questions?

He Chun: From my understanding, the essence of microservices is to split many large back-end services into many small links, reducing coupling between each link and increasing the independence between services. In this way, the team of each small function is smaller, and the development of small functions can be faster, more efficient, more independent, and more autonomous. The ensuing problem is that in terms of user experience, when I initiate a service request, there will be many links on the entire link. At present, the focus of our WeTest APM service is client-side monitoring, and server-side monitoring is our next phase of focus. The server monitoring industry actually has mature solutions. Google’s engineering team explained their large-scale distributed system monitoring system Dapper in a paper. It is a time-consuming analysis of the entire link. This solution is relatively mature. Used in some web-side products. For example, a user submits a request to purchase a product on a web page. From the time I click the button to the background response, the total time consumed by this request will be divided into many nodes, so that you can quickly find that the time consumed in this request is Which service takes the longest time. So in my opinion, the microservice architecture can be more convenient to do some performance analysis and positioning. Assuming that before the implementation of the microservice architecture, whether it is the background response timeout or the server problem, it may require the background team to spend more To troubleshoot the problem. But if we have such a tool that calculates time-consuming analysis by node, we will soon know which node has the problem. At this time, I only need to notify the small team. No team is needed. Irrelevant people stop their work to do this analysis.

High-availability architecture: What are the stages of WeTest application performance monitoring research and development, and what plans are there in the future?

He Chun: The focus of the first stage is to make an SDK. The SDK itself has a very low impact on client performance. There is still a big difference between the game and the commonly used apps in the SDK. The commonly used apps also have the APM SDK, but the requirements for performance impact are not necessarily high, because the commonly used APP operations are low-frequency operations, for example For some IM software, we only type and chat most of the time and do not perform complicated operations, so even if the SDK consumes some performance, users will not feel it. But if you are playing games, the player's eyes and mind are highly concentrated, and the frequency of hand operation is also very high. At this time, the impact of the SDK itself on game performance has to be almost reduced to zero, which is the difficulty of the first stage.

The difficulty in the second stage is the high concurrency in the background. Especially in China, some popular games have high concurrency and a very large number of users. The number of active users can easily reach tens of millions. On the contrary, it will be simpler overseas. Most overseas games are global The daily active quantity of a product is not as good as the daily active quantity of a domestic product.

The difficulty in the third stage is the background data analysis. Data analysis requires both professional knowledge of data analysis and understanding of the game itself. At present, in my opinion, our data analysis is not enough. Although we have been able to help the project team find many problems, the depth of data analysis is not enough. In fact, a lot of data can be crossed, and deeper cross analysis can be done. .

In terms of future plans, we will also go back to do full-link monitoring, and we will apply full-link analysis to the game based on the idea of ​​Dapper mentioned above. This involves the adjustment of the game back-end server architecture. At present, we are experimenting with Tencent's self-research team, which is doing micro-services for the game back-end system. After the construction of the full link is completed, the time consumed by each of our nodes can be counted. If one million requests occur at the same time a day, we know that there are 10,000 requests among these million requests. The time consumed is beyond our standard value, and then we look at these ten thousand requests, most of the problems are in which node, so that we can actually fix and optimize the background response time quickly work.

High-availability architecture: You mentioned high concurrency in your answer just now. What optimizations has APM made for high concurrency? For the 120 million DAU product line, how is data cleaning achieved?

He Chun: The high concurrency of APM is actually the same as the traditional APP backend server. The amount of data collected at the same time is very large, and we cannot let the data be lost. Based on the characteristics of the game business, we collect all of it. The characteristic of processing high concurrency on the APM side is that after one of our clients has completed a game, we will collect it in the form of a single file. After the file is uploaded, the high concurrency processing in the background should be to receive the file. The high concurrency of traditional APP is just a request, and the amount of data for a request is very small, perhaps only a few bytes. The average file size in our background is about 100k. If it is large, it may be 300k, so we receive a file and save it, then parse the data in the file, save it in different data tables, and then delete the file. Drop. This is the characteristic of our high concurrency.

High-availability architecture: Does the full collection you mentioned collect all indicators? We know that the performance of data collection is usually closely related to the sampling rate. If the sampling rate is high, the performance will usually be affected. If the sampling rate is low, some extreme situations may be easily missed. How is the sampling rate of WeTest APM determined to achieve high performance and collect as much data as possible?

He Chun: Yes, all indicators are collected, and the background selection is made, which is more flexible; our APM is to collect the rendering time of each frame, write the data to a fixed cache memory, and transfer to the file when the cache is full. The disk buffer technology is used for write operations to minimize the performance impact.

High-availability architecture: Your speech also mentioned the application of machine learning and deep learning in your products. So how are machine learning and deep learning applied, and in which scenarios?

He Chun: On the APM side, machine learning is mainly used in two aspects.

The first is the learning of the warning system. Our deep learning has made some progress on this project. When sharing the speech, I mentioned that the open rate of alert emails is now 75%, and may only be 30% at the earliest. At the beginning, the warning may appear several times a day. If you watch too much, you will have some aesthetic fatigue, and you may not want to open it. Later, we did convergence, that is to say, the same type or the same level, or the problems found in the previous two or three consecutive weeks, we will not give warnings for problems that have occurred. In addition, some supervised learning is done manually. For example, we have two buttons in the alarm system: confirm and negate. Through the click of the button, some deep learning attempts are made based on the data values ​​in the alarm system at the time. At present, the open rate of alert emails has been increased to 75%, and our goal is to increase to 90%. The alarm accuracy rate is now 90%. In the process of deep learning or improving the alarm accuracy rate, our team will also verify it on a single project. For example, there is a tactical competitive mobile game. We will manually confirm the contents of the alarms generated every day, because you have to manually confirm to go down the road of deep learning. The alarm system is completely unsupervised learning, because most of the performance problems are subjective, which is different from crashes and crashes, but performance problems such as some people think that the picture is a bit stuck, some people think it’s not stuck. The tolerance of each person has a lot to do with subjective factors, so we are still doing supervised learning here.

The other is a project that we are working on but has not achieved very good results. Simply put, we need to combine various monitoring data into a score. This score can be understood as the "performance health" of the game, and the score is accurate. Performance and effectiveness need to be done through deep learning, because behind this score is actually a calculation formula, there is a set of algorithms, we need to strengthen this calculation formula through deep learning to improve the accuracy and effectiveness of the formula. We have to improve even its rationality. On this project, we have already started to experiment and hope to share with you when there is progress in the future.

To sum up, deep learning on the APM side is mainly used to strengthen the warning system and strengthen the comprehensive score in the future.

High-availability architecture: You mentioned that APM uses AI alarms with an accuracy rate of more than 90%, that is, there are still some misjudgments. What will happen if misjudgments occur? How to continuously optimize through technical means to reduce misjudgments?

He Chun: We will manually confirm the alarms of key projects one by one, and let AI perform supervised learning through manual intervention; Google’s recent WeChat Moments of Friends released the “You Hua I Guess Mini Program”, which actually belongs to supervised learning. Kind.

High-availability architecture: Supporting more than 80 popular games developed by Tencent and self-developed, APM pressure should not be small. In the case of full collection, what is the current scale of the storage cluster and application cluster that you use to support this set of APM? Has the back-end architecture been under pressure and instability? How to deal with it?

He Chun: For some data with high real-time performance, we still use mysql to store; data with low real-time requirements is stored in Hadoop and calculated offline; with the growth of business volume and the needs of the product itself, the back-end architecture Three adjustments and optimizations have been carried out successively. Because we collected the data in "single-game" units, projects with a large number of "single-games" such as QQ Speed ​​will generate greater pressure, while large-map projects are not so stressed; The plan includes dynamic expansion of the receiving end, independent storage of large projects, and offline calculation of data with low real-time requirements.

High-availability architecture: In the area of ​​data analysis mentioned in the interview, how are the data collected by us finally analyzed, compared and displayed? real time? Or offline? Which dimensions of data do you generally focus on?

He Chun: The data is divided into two types, real-time and offline. Most of them are calculated offline by Hadoop, and a small part are calculated in real-time. From the user experience, the focus is still on frame rate, freeze, and delay;

High-availability architecture: For full engine compatibility, how can the three engines of C2DX, U3D and UNREAL be fully covered? What interesting problems were encountered during the adaptation process?

He Chun: In fact, there are three versions of SDK packages, c++ version, unity version, unreal version; adaptation test is also one of the most basic processes, every time the version is updated, it will run on the wetest top300 model to ensure that the mainstream machine The type is fine.

High-availability architecture: As a tool product, it should not be easy to promote internally. Different business lines or studios may have their own tools and their own "wheels". Do you have any experience in this part to share , Can students who make tools learn?

He Chun: First of all, there is indeed a situation of "re-creating wheels" in large companies. This is indeed a bit tricky situation for our classmates who are tools. The most important thing here is to change our thinking and put ourselves in The tools made are promoted and operated as a to B product. In Tencent's self-developed projects, depending on how much each project team attaches importance to problem monitoring and problem repair, there are indeed teams that will do APM work and do some data collection. But the project team just collected the data to the backend, and did not do cross-analysis, did not do a very detailed, multi-dimensional combination analysis, and this tool they did not have a very friendly product page, the development team may be based on development needs. Take some data from the database to see. Analyzing the two pain points of the development team, this is actually the direction the tool team can work hard. Also had the opportunity to do APM. What's more interesting is that I actually learned about the term APM later. When I was working on this system, I actually wanted to build a complete monitoring system that can be used in all Tencent projects. Later, I gradually learned that this international name is called APM.

Speaking of APM products, it is a To B application in itself. It is really difficult to do at the beginning, but once you do it, the barriers you build will be higher. For example, for the performance testing tool mentioned earlier, a company may have several similar systems. The project team can use mine, yours, or even his. Because the test tool itself does not have any impact on the product operation stage, but the APM SDK, as long as it is connected and the performance is stable, the project team will not be easily replaced.

In addition, in the To B field, you have to learn how to interact and collaborate with different teams. If you know that another team also wants to make APM SDK, then what is his purpose? His ultimate goal should not be to build an SDK for the sake of an SDK. The ultimate goal should be that the project team wants to get some data to find and locate their problems. Knowing their true purpose, we can open data interfaces to them, which is equivalent to SaaS services. This is equivalent to an ecological cooperation or cooperation between teams, open and win-win. This may dispel the idea of ​​"making wheels" in many teams.

High-availability architecture: Let’s talk a little bit about personal growth. You did performance testing last year, and switched to APM this year. In fact, we think this may be a bit higher than testing from the perspective of the technical route, because performance management may require a larger overall view. In the mobile Internet era in the future, performance is also a very important topic. Do you think that for some test engineers, or engineers who are doing performance or may go to performance management in the future, they should pay more attention to personal growth. What are the problems? Or what good experience do you have in your personal growth?

He Chun: Let’s talk about the background first. Tencent’s mobile game started in 2013. In 2013, I started to do "Daily Cool Run". Later I did the card category, then I did RPG, and now I do MOBA. From this trend, Heavy games will gradually move to mobile phones. From the current point of view, the operating strategy of many games is long-term operation, and it is hoped that a single product can have stable and long-term operating income. This will require higher and higher requirements for all aspects of the product, especially experience requirements, so from a general background, performance monitoring and optimization and continuous iteration are a rigid need.

In terms of personal development, the original test tool only covers some teams in the development period. If you build an online real-time APM system, you can expand your user group to the development team or even the planning team. , Even higher-level people such as project producers and directors, expand your own influence in the team. In addition to the improvement of personal skills, the overall view of being a technical person will also be greatly improved. In the original test tool, you just saw the result of a single test of a single project, but now I can see the big data of a project, and I can see the data of all projects, after all the project data is out, I can do The category segmentation can be subdivided into MOBA category, racing category, etc. according to the game type. After the performance data of the segmentation category comes out, it has an indicator meaning for new projects in the future.

After these data come out, you can tell the game development team intuitively that as long as these indicators are met, your game cannot be said to be successful, but the experience will never be bad. For new projects, this has great reference significance.
This year, my feeling is that the APM platform is actually a bridge. The left side of the bridge is the R&D team, and the right side of the bridge is various manufacturers. The APM platform can not only help the project team to optimize, but also promote the entire hardware manufacturer to make some common hardware Product optimization. Because games are the most consuming apps of mobile phone software and hardware resources, many manufacturers still value the performance of their mobile phones in heavy games. I remembered that a new model was released before. When it was released, "Honor of Kings" defaulted to the highest image quality, but in fact, the performance data was not good after monitoring, so I adjusted its default option to medium image quality. Because through long-term data analysis, we now find that the work efficiency of different hardware combinations is not additive. CPU, memory, GPU and even some low-level motherboards and chipsets are not 1+1+1=3. The relationship may be 1+1+1=2.5 or 2.8.

High-availability architecture: When you said that you wanted to do this thing at the time, you didn't know that this thing was called APM internationally. Was it a leadership task to do this thing at the time, or was it a bottom-up team that wanted to do it?

He Chun: It was launched from the bottom up. Many of Tencent's creativity and many tasks are bottom-up. Leadership ideas will account for 30%-40%, and the rest are initiated by the team from top to bottom. Everyone needs technological growth and innovation. Tencent also encourages everyone to innovate internally, encourages everyone to have their own ideas, the habit and ability to think actively. I used to work on testing tools, and some development classmates asked me that my current testing tool only looks at the data in the testing phase. After it is actually released online, can I also have tools to help them monitor performance. So I decided to do this at the time.

In June of this year, I attended Microsoft's Build conference in Seattle. Microsoft proposed that APM is a very important part of DevOps. It can serve both testing and operation and maintenance, and of course also development. The big data generated in APM is what the company’s decision-makers want to see after the conclusions and indicators we have obtained through data analysis. Through these conclusions and indicators, the head of the studio will know which project’s user experience or external Net feedback will be better. This is a good example of how our engineering team can use tools to promote business improvement from the bottom up within the company.

High-availability architecture: What's your message for GIAC?

He Chun: As the producer of the quality special last year, I deeply felt the on-site charm of GIAC and the active enthusiasm of the audience. This is not only a pure topic sharing, but also a deep collision exchange of industry thinking. I wish GIAC do better and better, come on!

About WeTest:
Tencent WeTest is a one-stop quality open platform officially launched by Tencent. More than ten years of experience in quality management, dedicated to quality standard construction and product quality improvement. Tencent WeTest provides mobile developers with excellent R&D tools such as compatibility, cloud real machine, performance, security, and Penguin news, and provides solutions for more than a hundred industries, covering the testing needs of products in various stages of R&D and operation, and has undergone thousands of products. . The gold medal expert team guarantees the quality of your products 360 degrees through 5 dimensions and 41 indicators.

In this issue of GIAC, the wonderful topics in the quality assurance/DevOps section are as follows:
The Road to Quality Assurance in the APM Era: Interview with the performance leader of the Quality Management Department of Tencent Interactive Entertainment
participate in GIAC and take stock of the latest technologies in 2018. Click "Read Original" for more details of the conference.

Guess you like

Origin blog.51cto.com/14977574/2546723