How to solve the problem of externalizing error reporting on merchant workbench?




If you want to do something about system errors, I hope this summary will be helpful to you.


Issues & Challenges

▐The message “understandable but useless” is the most deadly  


for example:
type

Real system error report

Business perspective

No further information

  • "Sorry, insufficient permissions to operate"

  • “timeout of 60000ms exceeded”

Then what? what to do?

Same error for different reasons

  • “Network Error”

  • "Remote call failed"

  • "System exception"

Where is wrong?

Insider gossip

  • "This product is not double-quoted and does not involve flat sales period verification"

  • "Report tmc not found"

What is double reporting? What is the flat sales period? Do you really think I understand?

Garbled characters

  • “Content type 'application/x-www-form-urlencoded;charset=UTF-8' not supported”

  • “nested exception is org.***.***.exceptions.***: ### Error querying database. Cause: ERR-CODE: [TDDL-4614]*** ATOM '***_i-8vbfhe0qt”

WHAT? ! What? !


▐This is a problem that cannot be cured, and it is the operations students who suffer the most.  


There are some situations that we actually cannot solve through development and release. When the above customer complaints occur, our PD may find development classmates to modify the copy.
There may be the following sounds——
"Sorry, balabala~~~"
"The error reporting of the underlying service is not under our control."
"It's too old, it's too old, and it will take time to change it."
"The Internet is closed today."
Originally I wanted to reduce the 100 request rate, but there are always many reasons that cannot be dealt with immediately. In the end, it has to be solved by the operation students.
There are mainly 2 solutions. The first one is to answer the questions one by one. Second, make an announcement when the impact is large.

▐How to solve it?  


Based on the above problems, we need to solve one main demand: "The existing error reporting documents need to be managed urgently, and they must be understandable and helpful."


If you dig further, these needs will be "derived"

  1. Operable: can be configured dynamically according to the scenario, and can also be operated during network closures

  2. Feedback available: Users have the final say on whether the prompt is helpful and can provide feedback.

  3. Find problems: Which errors are problematic and can be quickly located

  4. Arm in advance: configure in advance based on foreseeable errors

plan


▐Research solution OneCode  

OneCode is a mature error code unified management tool within the group. It can achieve dynamic mapping of Code --> Message. It also integrates Medusa and supports multi-language copywriting access.


But relative to our needs, this tool’s capabilities are slightly insufficient.

  1. The code used by international services is neither complete nor unique, and the cost of modification may not be acceptable to back-end development students (the requirements cannot be fulfilled)

  2. Adding operations to Code cannot be implemented (because it is not unique)

  3. In addition, it "seems" that this system has not been maintained for a long time, and there is a risk of outage.

Yeah~, let’s do our own research.


▐Self -research plan analysis  


On which level is it done?

Can be divided into 2 parts

Part 1: Controllable copywriting (static copywriting): The backend can be modified during daily release, and some "relatively" unique ErrorCode needs to be added.

Part 2: Uncontrollable copywriting (from the bottom layer): It requires monitoring and adaptation. This is more suitable for the front end, and is optimized according to the interface + page + ErrorCode + copywriting combination conditions.


  • Data Sources


The data used for analysis mainly includes "error code and error information". This part can come from the service interface. It is easier to achieve this based on the standard Response.

{  "code": 200,         "success": true,  "message": "xxx",  "errorCode": "错误码",  "errorMessage": "错误信息",  "data": {    ...  }}


Because we have access to AEM in the early stage, the error data will be automatically collected. However, there is a problem. AEM automatically intercepts the first 50 characters of the errorMessage. Therefore, the complete data needs to be reported by us.


  • Data consumption


Once we have the data, the next step is to "clean" it. We choose to complete the data cleaning, processing tasks, and orchestration on the DataWorks platform, publish the orchestrated tasks, and schedule them periodically on a daily basis.

The final output is a data set that is jointly disassembled by platform, page, interface address, error code, and error information, and then flowed back into our system for subsequent data analysis and operational configuration.


  • Establish operational behavior


At first we expected that operational behavior would be strongly bound to a unique Code, but since Code cannot be "unique", defining a "scenario" requires conditional combination of multiple parameters.

like:

  1. Copywriting: All compiled symbols can be decompiled and restored.

  2. ErrorCode + copywriting: For the same "Network Error", if there is no errorCode, it means that the network is not smooth and you need to check your own network connection, while if the errorCode is "SYSTEM_EXCEPTION", it means that the underlying call failed and you can try again later.

  3. ErrorCode + copy + page address: the same "insufficient permissions", URL1 needs to be directed to apply for permission 1, and URL2 needs to be directed to apply for permission 2.


We have refined 4 conditions for combination: ErrorCode (absolute match), copywriting (rule match), page address (absolute match), interface address (absolute match). In this way, one or more error reports can be hit, and based on this, the operational behavior expression can be configured.


▐Interaction plan  


  • User side


Before upgrading:

After upgrade:


Guidance: jump to the SOP document or open "in-page guidance"

feedback:


Embedded mode: for customization


  • Operation side


Error pool:

The following data can be seen:
  1. System copywriting, trigger pages, interfaces, and codes: you can filter out incomprehensible errors.

  2. Number of error reports and error rate: Determine the impact. Cases with high number of error reports & high error rate need to be processed with priority.


In operation:

The following data can be seen:

  1. Configuration content: Optimized copywriting, matching rules

  2. Performance data: exposure, interception rate, user feedback, help rate

  3. Error list selected by rule


Configuration form:

Configurable content:

  1. Processing method: Can replace operational Chinese and English copywriting, directly transcode, and output as is
  2. Matching rules: condition aggregation, including ErrorCode (absolute match), copywriting (rule match,), page address (absolute match), interface address (absolute match)
  3. Operational capability:
    1. 4 display modes: Toast, pop-up window, side notification, custom (mostly used for embedded)
    2. 3 types of operational help behaviors: help (you can customize the copywriting or associate it with help documents), guidance (you can jump to pages or associate process guidance), feedback (you can open the message board or be directed to Xiaomi)


▐Technical solutions  


This section only reveals part of the content


  • Time series interaction



  • Client flow chart


  • Class Diagram



Achievements & Outlook


▐Some issues discovered and resolved  


The system has been online for more than two months. Currently, it only allows monitoring of some high-frequency operations and important pages. During this period, it has completed the following goals:

type

Example

Error report management

Make copywriting understandable and useful

  • "Remote call failed" -- (restricted page + interface) --> "Third-party service call failed, please reduce the number of product searches to less than 100 and try again"

  • “Network Error” --(Global + Condition)--> “The system encountered an error” + Added self-diagnosis method & guidance

System stability guarantee

Incompatible release upgrade reminder to keep front-end and back-end matching

  • The error report contains "application/x-www-form-urlencoded" -- (restricted page + interface + code) --> "The version is lower, please refresh the page and try again."

Expose invisible problems

Discover & track online issues

The online interface error reporting rate and error reporting volume are both high --> exposing the following problems:

  • 2 historical issues on the interface

  • Multiple offline interfaces still have traffic

  • Interaction defects lead to 2 abnormal interface calling issues

Prepare in advance for major events

  • Double Eleven write ban announcement & error message linkage: During the write ban, the interface adopts a "denial of service" method, pseudo-cross-domain, and the front-end service layer returns "Network Error". Set up defenses in advance and inform them of the writing ban time and handling methods.


▐Further capacity expansion in the future  

将错误&Code的监控升级为接口Code的监控,利用规则圈选出业务流程切面,可尝试在此切面上做进一步探索。

  1. 切面按SOP组合,可计算人效

  2. 切面可注册「引导、问卷」等运营能力


团队介绍


天猫国际前端团队,深耕技术革新,紧密联结业务脉搏,致力于搭建桥梁连接消费者与未来,通过不断优化的全链路体验,创造性地打造下一代人机交互产品。在追求技术极致简洁、高效卓越、灵活多变及新鲜前沿的同时,我们为天猫国际的持续创新和繁荣发展提供坚实的赋能支持。


¤  拓展阅读  ¤

3DXR技术 |  终端技术 |  音视频技术
服务端技术  |  技术质量 |  数据算法


本文分享自微信公众号 - 大淘宝技术(AlibabaMTT)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

博通宣布终止现有 VMware 合作伙伴计划 deepin-IDE 版本更新,旧貌换新颜 WAVE SUMMIT 迎来第十届,文心一言将有最新披露! 周鸿祎:鸿蒙原生必将成功 GTA 5 完整源代码被公开泄露 Linus:圣诞夜我不看代码,明年再发布新版 Java 工具集 Hutool-5.8.24 发布,一起发发牢骚 Furion 商业化探索:轻舟已过万重山,v4.9.1.15 苹果发布开源多模态大语言模型 Ferret 养乐多公司确认 95 G 数据被泄露
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4662964/blog/10315617