[Stability] evolution day6 way public comment a highly available architecture - low-frequency, fast time

This article comes from public comment Share Chen party.

 

1. Availability of understanding

Understanding of the objectives

Availability of the industry's goal is to several 9, the requirements of each system are not the same. R & D personnel, in the design or development system user to know the size and usage scenarios, and the availability of the goal. 
For example, the target 5 9 breaks down: annual failure 5 minutes.

1 is an exploded target availability

 

Dismantling target

Some of the abstract goal 9, need to be reasonable decomposition, can be decomposed into two subgoals follows:

A lower frequency: reducing the number of failed

No problem, it must be highly available, but this is impossible. The larger the system, the more complex and can only try to avoid the problem, to reduce the probability of this problem through system design, process mechanism. But if you often go wrong, then quickly recover back to no avail.

Time to be fast: faster recovery time

When failure is not resolved or to specific problems, rapid recovery is the top priority in order to prevent secondary disasters, problems to expand. Here we need to think about standing on a business perspective, not just a technical point.

 

2. a lower frequency: reducing the number of failed

High availability design: according to business changes constantly iterate

Comments to the evolution of the trading system, for example:

Early childhood: 2012

Mission: To meet the business requirements, rapid on-line.

Because in 2011 to buy the product quickly to the market, add team members are temporarily extracted from each team personnel, most of the more familiar to .NET, so use .NET buy the system has been designed first generation. Meet business requirements is the first, there is no opportunity to meet quality availability. Considered relatively simple to hang all hung up, the amount is relatively small, a problem occurs, restart, expansion, rollback solved.

The system shown in FIG. 2 grown

2 Reviews early childhood trading system architecture of FIG.

 

Adolescence: split vertically (2012-2013)

Mission: R & D efficiency & fault isolation.

When the group in 2012 in a single volume from 1000-10000 change the order of the user's daily order volume also went to ten thousand times, you need to consider is the iteration rate, research and development efficiency, so do small but nice team. On the other hand also need to be isolated from each other each business, such as products webpage display, showing the stability of the product details page, orders, payment processing requirements are not the same. Front can be cached, static can do to ensure the availability to provide some flexibility to experience; do remote disaster recovery system to pay back, for example, we Nanhui room in addition to the payment system, also deployed in Baoshan room, but later found the system too fast evolution, not tools and mechanisms to ensure the double room update, so then is not good to use.

The system evolved to Figure 3, the service is a vertically oriented, but the data no complete isolated.

Children during the trading system architecture Figure 3 Reviews

 

Youth: a small service, do not share data (2014-2015)

Mission: To support the rapid development of business, to provide efficient, highly available technical capacity.

From 2013, deal-service (product system) occasionally because once a high flow rate (big promotion and other routine activities) hang, there is always once every few months. Basically availability Just 3 9 wandering, where order and payment system is very stable, because the flow of goods to the order details page has a conversion rate of traffic flow hung up the details page, there will be no flow of orders. Later, the details of static are doing better, and can reduce the speed of recovery and downgrade, but each system deal-service relies too deep, still can not guarantee the availability of the entire end to end.

So, in 2014 deal-service to do a lot of reconstruction, large systems do little, the commodity system split into numerous small details of services, such as inventory services, price services, basic data services, this would resolve the Product Details problem pages. So from 2014 pressure low order system to come, starting in October 2014, ordering system, payment system also launched a comprehensive micro-services technology, after about a year of practice, order systems, marketing systems, payment systems three the sum of the areas of service after almost a hundred, and the corresponding database has more than 20 behind, so to be able to support one million orders per day.

The growth of business expansion in the application can service level, but the largest single point, the database is centralized. At this stage we are mainly data access applications on separate read and write database providing more problem-solving read from the library, but still the biggest bottleneck is written (MySQL extension can read, write, QPS will small 2 million).

系统演变成如图4这样:这个架构大约能支撑QPS 3000左右的订单量。

图4 点评交易系统青年时期的架构

成年时期:水平拆分(2015-现在)

使命:系统要能支撑大规模的促销活动,订单系统能支撑每秒几万的QPS,每日上千万的订单量。

2015年的917吃货节,流量最高峰,如果我们仍然是前面的技术架构,必然会挂掉,所以在917这个大促的前几个月,我们就在订单系统进行了架构升级、水平拆分。核心就是解决数据单点,把订单表拆分成了1024张表,分布在32个数据库,每个库32张表,这样能支撑到我们看见到未来了。

虽然数据层的问题解决了,但是我们还是有些单点,使用的MQ、网络、机房等。举几个我们过去遇到实际上却不容易碰到的可用性问题:

服务的网卡有一个坏了,没有被监测到,后来发现另一个网卡也坏了,这样服务就挂了。

我们使用Cache的时候发现可用性在高峰期非常低,后来发现这个Cache服务器跟公司监控系统Cat服务器在一个机柜,高峰期的流量被Cat跑了一大半,给业务的网络流量就非少,由此影响到了业务。

917大促的时候我们对MQ这个依赖的通道能力评估出现了偏差,也没有备份方案,所以造成了一小部分的延迟。这个时期系统演进如图5所示。

图5 点评交易系统成年时期的架构

 

未来:思路仍然是大系统做小,基础通道做大,流量分块

大系统做小,就是把复杂系统拆成单一职责系统,并从单机、主备、集群、异地等架构方向扩展。

基础通道做大就是把基础通信框架、带宽等高速路做大。

流量分块就是把用户流量按照某种模型拆分,让它们聚合在某一个服务集群完成,闭环解决。

系统可能会演变成如图6所示

图6 点评交易系统的未来架构演进

图6是点评交易系统的发展几个阶段,只以业务系统的演进为例。除了这些还有CDN、DNS、网络、机房等各个时期会遇到不同的可用性问题,我们遇到的问题,比如:联通的网络挂了,需要切换到电信;比如数据库的电源被人踢掉了。

 

易运营

高可用性的系统一定是可运营的。听到运营,大家更多想到的是产品运营,其实技术的运营指的是线上的质量、流程能否运营。比如,整个系统上线后,是否方便切换流量,是否方便开关,是否方便扩展。这里有几个基本要求:

 

可限流

线上的流量永远有想不到的情况,在这种情况下,系统的稳定吞吐能力就非常重要了,高并发的系统一般采取的策略是快速失败机制,比如系统QPS能支撑5000,但是1万的流量过来,我能保证持续的5000,其他5000我快速失败,这样很快1万的流量就被消化掉了。再如917的支付系统就是采取了流量限制,如果超过某一个流量峰值,我们就自动返回请稍后再试。

 

无状态

应用系统要完全无状态,运维才能随便扩容,分配流量。

 

降级能力

降级能力是跟产品一起来看的,需要看降级后,对用户的体验的影响,简单的比如,提示语是什么。比如支付渠道,如果支付宝渠道挂了,假设挂了50% ,我们支付宝的渠道旁会自动出现一个提示,来说明这个渠道可能不稳定,但是可以点击;当支付宝渠道挂了100% ,我们的按钮是灰色的,不能点击,也会有提示,(如换其他支付渠道)。另一个案例,我们在917大促的时候对某些依赖方,比如诚信的校验,这种如果判断比较耗资源,又在可控的情况下,可以通过开关直接关闭或者启用。

 

可测试

无论架构多么完美,验证这一步必不可少,系统的可测试行就非常重要。

测试的目的要先预估流量的大小,比如某次大促,要跟产品、运营讨论流量的来源、活动的力度、每一张页面的每一个按钮位置,进行较准确的预估。

 

测试集群的能力,有很多同学在实施的时候总喜欢测试单台,然后水平放大后给一个结论。但这不是很准确,要分析所有的流量是否在系统间流转时候的比例,尤其对流量模型的测试(要注意高峰流量模型跟平常流量模型可能不一致的)、系统架构的容量测试,比如我们某一次大促的测试方法。

 

图7 测试架构

 

从上到下评估流量,从下至上评估能力:发现一次订单提交有20次数据库访问,读写比例高峰期是1:1,然后就跟进数据库的能力倒推系统应该放入的流量,并做好前端的异步下单,让整个流量平缓地下放到数据库。

 

降低发布风险

严格的发布流程

目前点评的发布都是开发自己负责,且通过平台完成的,上线的流程和发布的常规流程模版(如图8所示)。

图8 发布的常规流程模版

 

灰度机制

  • 服务器发布是分批,按照10%、30%、50%、100%的发布,开发通过观察监控系统的曲线,及系统的日志确定业务是否正常;

  • 线上的流量灰度机制,重要功能上线能有按照某种流量灰度上线能力;

  • 可回滚是标配,最好有最坏情况的预案。

 

3.时间要快 :故障恢复时间要快

如果目标是保证全年不出故障或者出了故障在5分钟之内能解决,要对5分钟进行充分的使用。对5分钟的拆解:1分钟发现故障,3分钟定位故障出现在哪个服务,再加上后面的恢复时间,就是整个时间的分解。目前我们系统大致能做到前面2步,离整体5个9的目标还有差距,因为恢复的速度跟架构的设计,信息在开发、运维、DBA之间的沟通速度和工具能力,及处理问题人员的本身能力有关。

生命值: 

 

持续关注线上运行情况

  • 熟悉并感知系统变化,要快就要熟,孰能生巧,所以要关注线上运营情况。

  • 对应用所在的网络、服务器性能、存储、数据库等系统指标了解。

  • 能监控应用的执行状态、对应用自己的QPS、响应时间、可用性指标,并对依赖的上下游流量情况同样熟悉。

  • 保证系统稳定吞吐:系统如果能做好流量控制、容错,保证一个稳定的吞吐,以及保证大部分场景的可用,也能很快地消化高峰流量,避免出现故障,产生流量的多次高峰。

 

故障时

快速的发现机制

  • 告警的移动化:系统可用性的告警应该全部用微信、短信这种能保证找到人的通信机制;

  • 告警的实时化:目前我们只能做到1分钟左右告警;

  • 监控的可视化:我们的系统目前的要求是1分钟发现故障,3分钟定位故障。这就需要做好监控的可视化,在所有关键Service里面的方法层面打点,然后做成监控曲线,不然3分钟定位到具体是哪个地方出问题,比较困难。点评的监控系统Cat能很好地提供这些指标变化,我们系统在这些基础上也做了一些更实时的能力,比如订单系统中QPS就是开发的秒级监控曲线(如图9所示)。

    图9 点评开发的秒级监控曲线

     

     

     

    有效的恢复机制

    比如运维的四板斧:回滚、重启、扩容、下服务器。在系统不是很复杂、流量不是很高的情况下,这能解决问题。但当大流量的时候这个就很难解决了,所以更多的从流量控制、降级体验方面下功夫。

     

    4.经验总结

    • 珍惜每次真实高峰流量 ,建立高峰期流量模型;

    • 珍惜每次线上故障复盘,下一楼解决问题,上一层楼看问题 ;

    • 可用性不只是技术问题: 
      系统初期是:以开发为主; 
      系统中期是:以开发+DBA+运维为主; 

    • 系统后期是:技术+产品+运维+DBA ;

    • 单点和发布是可用性最大的敌人。

     

 

发布了172 篇原创文章 · 获赞 352 · 访问量 16万+

Guess you like

Origin blog.csdn.net/Ture010Love/article/details/104374128