This year's 618 promotion has arrived as scheduled. Next, I will talk to you about the underlying logic and actual combat , hoping to answer some doubts in your mind.
years |
618 sales (100 million yuan) |
Annual sales (100 million yuan) |
618 Sales Proportion |
2022 |
3793 |
33155 |
11.4% |
2021 |
3439 |
32970 | 10.4% |
2020 |
2694 |
26125 |
10.3% |
2019 |
2017 |
20854 |
9.7% |
2018 |
1592 | 16769 | 9.5% |
-
What are the factors affecting system stability? -
What is the difference between stability requirements and daily high availability requirements for the system? -
In the face of various unstable factors, how should we deal with them?
-
Traffic size : During the big promotion period, the traffic is often several times or even dozens of times the usual, which puts forward extremely high requirements on the stability of the system. A small problem will often turn into a big problem after the traffic is enlarged; -
Large amount of data : Taking the order in 2022 as an example, the order amount reached 3.4 trillion. In the scenario of massive order data, a simple query will become very challenging; -
The scene is complex : the superposition of various marketing methods such as various promotional offers, platforms, merchants, operations, etc., makes the order production link always in a high-load computing state; -
Long delivery link : Flow distribution at each end, promotion calculation, vehicle addition, settlement, bill of lading, payment, logistics distribution, customer service, after-sales and other process nodes need to be stable. If a service has a 99.9% availability rate, then the combination of 100 related service nodes can only reach 99.5% availability rate, and the unavailability of 0.5% corresponds to a large number of order losses; -
Low tolerance : Consumers require a good user experience, merchants need promotions to take effect quickly, and platforms need to reduce errors and asset losses to protect the interests of consumers and merchants. Higher expectations and concerns lead to lower patience and tolerance;
-
Time is tight : during the promotion period, the stability of the service needs to be guaranteed in a short period of time, and there is usually no time to go into technical details; -
Different perspectives : Stability focuses on the overall business effect, while high availability focuses on service response results; -
Different dimensions : The guarantee of business stability is usually based on the high availability of the system, and cooperates with related service operation strategies to achieve higher-dimensional business stability.
3.1 Application Perspective
3.1.1 Unitization
-
降低整个应用因某个单元故障而导致服务中断的风险; -
降低故障排查的难度,因为可以快速定位出问题的单元并进行修复; -
每个单元都可以独立维护和升级,这样可以降低整个应用因某个单元升级或维护而导致服务中断的风险; -
每个单元都可以独立扩展和缩减,这样可以根据实际需求动态调整应用的规模。
3.1.2 监控预警
-
监控粒度方面:监控按照层级分为底层中间件监控、依赖RPC监控、方法监控、机器监控、系统监控、业务监控、流程监控、整体的大盘监控; -
监控的灵敏度问题。灵敏度过低会导致部分问题被延时暴露甚至被隐藏,而灵敏度过高则会造成信息爆炸,难以分辨信息的主次。因此,在实施监控前需要提前做好功课,确定合适的灵敏度; -
监控的覆盖度方面:关注监控服务单元、监控指标梳理、监控触达方法。比如:监控需要覆盖容器数、资源指标、运行环境(JVM、线程池)、流量大小、限流值、上下游依赖、超时时长、异常日志、数据容量、模型规模、特征数量等,并可以进行时间维度的纵向对比; -
监控的准确性方面:看可用率,需要看上游调用方的,可能200ms响应时长,对于调用方来说,已经属于不可用的区间了。看CPU繁忙程度,不能只盯着利用率,还要结合容器核数和CPU负载来分析; -
预警解除方面:接到预警消息,及时排查并处理风险,切不可将小问题演变成大问题。先确认是单机硬件或网络问题,还是集群通用问题,如果是通用问题,能否通过服务调用链追踪技术快速定位问题点,确认好问题原因,才能做好应对预案;
3.1.3 日志打印
3.1.4 快速失败
-
线程池超时时间的设置,关键系统要拥有动态调整线程池运行参数的能力; -
利用好工具已有的能力,比如:JSF,JimDB,JMQ等中间件也都支持超时失败的动态调整能力; -
服务限流也是快速失败的一种实现策略,常见的微服务框架和物理网关一般也都支持类似功能;
3.1.5 服务限流
-
限流方式和阈值需要经过系统多轮压测验证,以确保数据指标的准确性。 -
对于业务聚合系统,主要依赖于第三方服务,通常没有存储层,瓶颈往往出现在应用服务本身。这种情况下,单机限流是比较好的方式,因为这种方式对于服务扩容或缩容非常友好。只需保证扩容的容器硬件配置与线上容器保持一致即可。 -
对于底层基础服务,瓶颈点往往在数据存储层,而存储层的扩容成本相对较高,实现起来也比较困难。在这种情况下,全局集中式限流是一个很好的选择,其目的是优先保证存储层的稳定性。 -
建议根据调用方的重要程度进行精细化限流运营,确保在极端情况下,具有优先保证核心业务可用性的能力;
3.1.6 业务降级
3.2 存储视角
3.2.1 数据库
3.2.2 缓存
3.2.3 Elasticsearch
3.3 运营视角
3.3.1 备战小组
3.3.2 军演压测
3.3.3 技术封版
3.3.4 每日巡检/假期值班
3.3.5 应急预案
本文从技术角度深入分析了大促备战的背景和重要性,重点介绍了备战期间稳定性保障的相关措施,包括具体的指导方向和落地细节。本文旨在回顾和梳理备战期间的关键步骤,以帮助我们更加从容地应对系统稳定性的挑战。虽然大促备战是一场紧急行动,但备战的效果离不开平时的协作共识和技术积累,过往的经验和教训,在此刻将得到充分验证。
-end-
本文分享自微信公众号 - 京东云开发者(JDT_Developers)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。
{{o.name}}
{{m.name}}