Message Quality Platform Series Articles | Full Link Investigation

background

Xianyu’s daily circulation of messages is over 100 million, reaching half of its users. Due to the nature of second-hand goods, Xianyu users need to learn more about the quality of their babies through chat and negotiate commodity prices. Messages, as Xianyu’s basic functions, are promoting Play a big role in commodity transactions. At the same time, at Xianyu, buyers and sellers are usually individual users, and there is a great uncertainty about whether they are online or not. Once a message reaches a problem, it may affect the transaction of goods, and even lead to WeChat for fraud. Therefore, There is an urgent need to provide users with stable and reliable messaging services through effective means.

Problem definition

In the field of messaging, from the user's point of view, the main problems are message loss and delay in message reach. Technically speaking, the root cause of message loss is the end-to-end architecture design, that is, messages are pulled through the server interface, and messages are sent through the accs long connection channel. When a large number of messages arrive at the client at the same time, the message is merged or dropped. Will be discarded, resulting in loss of user messages. The online message delay is more caused by accs channel delay and congestion.

However, Xianyu messages face three main problems because of the long sending and receiving links, the complicated processing logic implemented on the server and the client, and third-party channels:

  • How to find problems as early as possible before going online?

  • How to quickly find problems online?

  • How to effectively locate public opinion issues?

The above questions are all worth solving, but considering that the Xianyu Xianyu messaging function has been online since its core logic has been implemented without using the Group’s ready-made messaging SDK, with the frantic growth of early business development, the demand for development to take over messaging is now on thin ice. , Which leads to continuous stacking of online problems. To start message problem management, the first priority is to have a comprehensive troubleshooting method that can quickly locate online problems.

Message full link investigation and construction

When locating public opinion problems, it is difficult to find the essence of the problem only through limited text descriptions and screenshots, and it is even impossible to confirm whether it is a problem on the end or the server. For example, one night when everyone was about to leave work, the boss raised a public opinion, and the user feedback message was lost. At this time everyone panicked. The server, client, and quality classmates gathered together to locate the problem, and the only location The way is to check your log, I check my log, and finally I guess the problem blindly, which is time-consuming, labor-intensive and inaccurate.

From this point of view, in order to improve the quality of Xianyu's message service, it is far from enough to rely on development and optimization. Infrastructure projects are also needed to improve the positioning of the problem news, so the quality team and the development team focus on the whole chain of Xianyu's news. Construction of road inspection. To check the full link of the message, the key is to have a comprehensive message log support to obtain the complete track of the message. We aggregate the server message node logs, interface logs, client message status logs, and behavior logs to restore the user's current behavior and message path links to the greatest extent.

Log report

The first is to sort out the core scenarios of the message link. For key nodes that are prone to or reflect problems, such as message merging, dropping library, screen upload, domain ring synchronization, domain ring update, etc., the logs required for troubleshooting will be reported.

The second is the log format convention. The core is that the client generates a messageId for each message when it generates it. For marketing messages pushed by the server, the messageId is generated by the server. Each time a message passes through a core node, only the status of the path plus the messageId is reported, and the id will also be transparently transmitted to the server to report the log for use, so that the messageId connects a message from the client to the server and then to the client. Trace the link.

The last is the method of reporting. The first idea was to access the SLS SDK on the terminal for real-time reporting, but the cost of changes on the terminal was too large, which would have a certain impact on stability, so I gave up, and finally reused the client’s embedded point to report. Path, only need to clean the buried point log in real time later, which satisfies the demand for real-time log acquisition without modification on the terminal. The server log report uses the existing SLS report link.

It should be emphasized that for privacy and storage cost considerations, the log report will not bring the specific message content, only the necessary parameters and messageId required for troubleshooting.

Real-time log cleaning

To perform real-time log cleaning, first subscribe to Xianyu's TT minute-level burial log, and clean the message-related burial log. But the amount of data inside is huge, and there may be dozens of buried points for a message. Therefore, we clustered these logs according to messageId and utdid, and reduced the magnitude of the data dozens of times. Finally, we wrote the investigation data back to SLS for link investigation, and wrote the minute-level statistical data into TDDL for monitoring construction.

User Behavior Log

In addition to focusing on the life cycle of a single message, another perspective of troubleshooting is to look at the behavior of the user in the terminal that triggered the exception.

Through real-time cleaning and editing of the clicks and page exposure points reported by the client, it is possible to understand which buttons the user clicked and which pages were visited before the abnormality occurred, so as to analyze the reproducible path of the abnormality. At the same time, integrate the server interface call log to check whether the user successfully requested the server interface when the user exception occurred, whether the request parameters are correct, and what is the error code of the exception. The integration of these effective information helps us to reproduce and locate the specific scene that appears, and assist development in solving problems faster.

Front-end interaction

"Soldiers and horses are not moved, data first"-the previous data preparation allows us to more finely investigate the possible problems in the message link, so how to make the problem exposed more obvious and make development and use more convenient? Our other goal. By observing and developing students, when troubleshooting message problems, they usually do it according to the three latitudes of user, message ID, and session ID, so they are classified and sorted out.

In addition, in order to allow users to more intuitively observe the message at each node of the link during the query, we have classified the link: client uplink, server, and client downlink. At the same time, a noticeable reminder will be given to abnormal nodes, allowing users to quickly discover where and where the problem is.

Summary and outlook

Now, through the link troubleshooting tool of the Xianyu message quality platform, you can clearly view the complete life cycle of the message, and check the path of the problem based on the behavior log of the abnormal user, and assist the development to quickly locate where the problem may occur. There is no need to search multiple databases and log streams for integrated analysis, and you can even get data for investigation the next day, which improves the efficiency of investigation by more than 90%. In addition, the investigation mode is reusable, such as publishing link investigation is also in the platform access. In addition, the message quality platform has also done many things in the efficiency of problem discovery and test efficiency, such as on-end inspection and reconciliation capabilities, real-time monitoring of core indicators, public opinion management, and link-level test regression tools. Give a detailed introduction. There is also the combination of automation and end intelligence, which is also the direction we are constantly trying. We hope that through our escort, Xianyu news will become more and more stable.

Can't be idle? Come Xianyu!

PICK ME

Xianyu's technical team pursues more value through innovation and continuously drives business changes.

From the old line of idle business, to the creation of "worry-free shopping", "playing community" and "new offline",

From publishing books, speaking out at summits, to open source patents and overseas communication,

If you can’t stay idle, you can enjoy the fish-the technical team’s ultimate exploration and deep cultivation is our confidence.

 Join now 

1. Recruit client/server/front-end/architecture/quality engineers

2. Send your resume to [email protected]

3. You can also find us in Toutiao, Zhihu, Nuggets, facebook, twitter

Guess you like

Origin blog.csdn.net/weixin_38912070/article/details/111570176