▐The message “understandable but useless” is the most deadly
type |
Real system error report |
Business perspective |
No further information |
|
Then what? what to do? |
Same error for different reasons |
|
Where is wrong? |
Insider gossip |
|
What is double reporting? What is the flat sales period? Do you really think I understand? |
Garbled characters |
|
WHAT? ! What? ! |
▐This is a problem that cannot be cured, and it is the operations students who suffer the most.
▐How to solve it?
Based on the above problems, we need to solve one main demand: "The existing error reporting documents need to be managed urgently, and they must be understandable and helpful."
If you dig further, these needs will be "derived"
Operable: can be configured dynamically according to the scenario, and can also be operated during network closures
Feedback available: Users have the final say on whether the prompt is helpful and can provide feedback.
Find problems: Which errors are problematic and can be quickly located
-
Arm in advance: configure in advance based on foreseeable errors
▐Research solution OneCode
OneCode is a mature error code unified management tool within the group. It can achieve dynamic mapping of Code --> Message. It also integrates Medusa and supports multi-language copywriting access.
But relative to our needs, this tool’s capabilities are slightly insufficient.
The code used by international services is neither complete nor unique, and the cost of modification may not be acceptable to back-end development students (the requirements cannot be fulfilled)
Adding operations to Code cannot be implemented (because it is not unique)
In addition, it "seems" that this system has not been maintained for a long time, and there is a risk of outage.
Yeah~, let’s do our own research.
▐Self -research plan analysis
On which level is it done?
Can be divided into 2 parts
Part 1: Controllable copywriting (static copywriting): The backend can be modified during daily release, and some "relatively" unique ErrorCode needs to be added.
Part 2: Uncontrollable copywriting (from the bottom layer): It requires monitoring and adaptation. This is more suitable for the front end, and is optimized according to the interface + page + ErrorCode + copywriting combination conditions.
Data Sources
The data used for analysis mainly includes "error code and error information". This part can come from the service interface. It is easier to achieve this based on the standard Response.
{
"code": 200,
"success": true,
"message": "xxx",
"errorCode": "错误码",
"errorMessage": "错误信息",
"data": {
...
}
}
Because we have access to AEM in the early stage, the error data will be automatically collected. However, there is a problem. AEM automatically intercepts the first 50 characters of the errorMessage. Therefore, the complete data needs to be reported by us.
Data consumption
Establish operational behavior
like:
Copywriting: All compiled symbols can be decompiled and restored.
ErrorCode + copywriting: For the same "Network Error", if there is no errorCode, it means that the network is not smooth and you need to check your own network connection, while if the errorCode is "SYSTEM_EXCEPTION", it means that the underlying call failed and you can try again later.
ErrorCode + copy + page address: the same "insufficient permissions", URL1 needs to be directed to apply for permission 1, and URL2 needs to be directed to apply for permission 2.
We have refined 4 conditions for combination: ErrorCode (absolute match), copywriting (rule match), page address (absolute match), interface address (absolute match). In this way, one or more error reports can be hit, and based on this, the operational behavior expression can be configured.
▐Interaction plan
User side
After upgrade:
Embedded mode: for customization
Operation side
Error pool:
System copywriting, trigger pages, interfaces, and codes: you can filter out incomprehensible errors.
Number of error reports and error rate: Determine the impact. Cases with high number of error reports & high error rate need to be processed with priority.
In operation:
The following data can be seen:
Configuration content: Optimized copywriting, matching rules
Performance data: exposure, interception rate, user feedback, help rate
Error list selected by rule
Configuration form:
Configurable content:
-
Processing method: Can replace operational Chinese and English copywriting, directly transcode, and output as is -
Matching rules: condition aggregation, including ErrorCode (absolute match), copywriting (rule match,), page address (absolute match), interface address (absolute match) -
Operational capability:
-
4 display modes: Toast, pop-up window, side notification, custom (mostly used for embedded) -
3 types of operational help behaviors: help (you can customize the copywriting or associate it with help documents), guidance (you can jump to pages or associate process guidance), feedback (you can open the message board or be directed to Xiaomi)
▐Technical solutions
This section only reveals part of the content
Time series interaction
-
Client flow chart
Class Diagram
▐Some issues discovered and resolved
The system has been online for more than two months. Currently, it only allows monitoring of some high-frequency operations and important pages. During this period, it has completed the following goals:
type |
Example |
Error report management Make copywriting understandable and useful |
|
System stability guarantee Incompatible release upgrade reminder to keep front-end and back-end matching |
|
Expose invisible problems Discover & track online issues |
The online interface error reporting rate and error reporting volume are both high --> exposing the following problems:
|
Prepare in advance for major events |
|
▐Further capacity expansion in the future
将错误&Code的监控升级为接口Code的监控,利用规则圈选出业务流程切面,可尝试在此切面上做进一步探索。
切面按SOP组合,可计算人效
切面可注册「引导、问卷」等运营能力
团队介绍
天猫国际前端团队,深耕技术革新,紧密联结业务脉搏,致力于搭建桥梁连接消费者与未来,通过不断优化的全链路体验,创造性地打造下一代人机交互产品。在追求技术极致简洁、高效卓越、灵活多变及新鲜前沿的同时,我们为天猫国际的持续创新和繁荣发展提供坚实的赋能支持。
本文分享自微信公众号 - 大淘宝技术(AlibabaMTT)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。