[Stability] day8 pull payment system to pay high availability of the road - and avoid annihilation of two kinds of play

This article comes from Dangdang to pay the teacher's flag pull Pingzhong share.

For Internet applications and large enterprise applications, most are required to do as much as possible 7 * 24-hour operation, but to be completely uninterrupted operation can be said to "extremely difficult."

To this end, the extent of the application of the general availability metrics have three 9-5 9.

And a function for increasing the amount of data applications, to maintain high availability and easy. In order to achieve high availability, pay pulled from the single point of failure, the application itself to ensure high availability, the settlement of volume growth, and so do a lot of exploration and practice.

Without taking into account external systems rely sudden failure, such as network problems, a large party payment and banking area is not available, etc., paid pull service capacity of up to 99.999%.

This article focuses on how to improve the usability of the application itself.

In order to improve the usability of the application, the first thing to do is try to avoid application failure, but do not entirely fault is impossible. The Internet is likely to have a "butterfly effect" in place, no one seemingly small, the probability of occurrence is zero accidents can occur, and is hugely magnified.

We all know that RabbitMQ itself is very stable and reliable, paid pull the beginning has been to use a single point in RabbitMQ, and operational failure has never been so psychologically we think this thing is unlikely to go wrong.

Until one day, this node where the physical host hardware is broken because of disrepair, was RabbitMQ can not provide this service, resulting in instantaneous service system is not available.

The failure is not terrible, the most important thing is to discover and solve problems. Pull pay their own system requirements is found in fault-second, quickly diagnose and troubleshoot, thus reducing the negative impact brought about by the failure.

First, a brief review of some of the problems we have encountered:

 

Learning from history

  1. The development of new colleagues in dealing with new access tripartite channel, due to lack of experience neglect the importance of setting time-out period. Is such a small detail, resulting in the tripartite transaction queue is located completely blocked, while other transactions affecting the channel.

  2. The system is distributed deployment, and support grayscale released, so that the environment and deployment of the module is very large and complex. A particular adds a new module, since the presence of a plurality of environments, and each is a two-node environment, the lead wire connections are not enough new database module, thus affecting the functions of other modules.

  3. It is also a time-out issue, a tripartite timeout, resulting in exhausted all worker threads currently configured, so that no other transaction thread can handle.

  4. A tripartite while providing authentication, payment and other interfaces, one of the interfaces because of our sudden increase in trading volume, thereby triggering DDoS A tripartite limit side of the network operator. IP usually exports room are fixed so as to be mistaken for network operators from the export transaction is IP traffic attack, eventually leading to A-party authentication and payment interfaces simultaneously is not available.

  5. Besides a database problem, also because our trading volume caused by the sudden increase. Establish an upper limit to a colleague sequence is a sequence of 999,999,999, but the inventory data field length is 32, when the small trading volume, the system generated value and field 32 is matched sequence is not l bit. But with the increase in trading volume, the number of bits of the sequence unknowingly rise, resulting in 32 is not enough storage.

 

Like this very common problem for Internet systems, and has a hidden, so how to avoid it is very important.

We changed the following three ways we do.

 

As far as possible to avoid failure

Fault-tolerant systems design

Such as rerouting users to pay for it, users do not care about their particular money is paid out from which channel, users only care about success. Connecting a plurality of pull-pay channels 30, it is possible to pay the A channel is unsuccessful, this time need to dynamically re-route path B or C, it can be avoided so that the user fails to pay, payment to achieve fault tolerance through the rerouting system.

There OOM do for fault tolerance, like Tomcat same. The total system memory exhaustion happens, if the start of the application itself set aside some memory occurs when the system OOM, you can catch live this exception in order to avoid the OOM.

 

Certain aspects of rapid-fail "Fail fast principle"

Fail fast principle is that when there is a problem of any step in the main flow should be reasonably quick end to the entire process, rather than wait until a negative impact only treatment.

To give a few examples:

  • Pull start paying when you need to load some information and queue configuration information to the cache, if a queue fails to load or is not configured properly, can cause failure of the request process, for which the best approach is to load the data failed, JVM exit, avoid subsequent start unavailable;

  • Our real-time transaction processing class longest response time is 40s, over 40s if the system does not wait for the pre-release thread, inform the merchant is being processed, the subsequent processing results will be there in a manner or line of business notification initiative inquiry result;

  • We used to do redis database cache, one use of real-time alarm Buried and test weight and other features. If the connection redis more than 50ms, then this operation will redis waiver, in the worst case of this operation is to bring influence to pay 50ms, controlled within the range allowed by the system.

 

Design has the ability of self-protection system

Systems generally rely on a third party, such as a database, three interfaces. System development, the need for a third party to keep suspicion, to avoid problems when third-party chain reaction, leading to downtime.

(1) Split the message queue

We offer a variety of payment interface to the merchant, there are common shortcuts, personal online banking, online banking, refund, cancellation, bulk payment, withholding batch, single payment, single withholding payments voice, balance inquiries, identity authentication, bank card authentication, authorization cards close so on. Corresponding payment channel letter micro payment, ApplePay, more than 30 Alipay payment channel, and the access hundreds of merchants. In three dimensions, how to ensure that different business, the three parties, businesses, and payment type independently of each other, we've done is split the message queue. The figure is part of the business message queue FIG Split:

 

(2) limit the use of resources

Restrictions on the use of design resources highly available system is the most important point, is easily overlooked point, relatively limited resources, too much use, it will naturally lead to application downtime. To this end we have homework to do the following:

  • Limit number of connections

With the distributed scale of the need to consider the number of database connections, rather than to maximize endless. The number of connections to the database is limited, it is necessary to consider all of the modules globally, particularly increased scale brings.

 

  • Restrictions on the use of memory 

Memory usage is too large, it will cause frequent GC and OOM, memory usage, mainly from the following two aspects: 

  1. A set of oversized; 

  2. Not release the object is no longer referenced, for example, has been put into ThreadLocal object will wait until the thread exits when recycling.

 

  • Restrictions thread creation

Unlimited thread creation, and ultimately lead to uncontrollable, especially hidden method to create a thread in the code.

当系统的SY值过高时,表示linux需要花费更多的时间进行线程切换。Java造成这种现象的主要原因是创建的线程比较多,且这些线程都处于不断的阻塞(锁等待,IO等待)和执行状态的变化过程中,这就产生了大量的上下文切换。

除此之外,Java应用在创建线程时会操作JVM堆外的物理内存,太多的线程也会使用过多的物理内存。对于线程的创建,最好通过线程池来实现,避免线程过多产生上下文切换。

 

  • 限制并发

做过支付系统的应该清楚,部分三方支付公司是对商户的并发有要求的。三方给开放几个并发是根据实际交易量来评估的,所以如果不控制并发,所有的交易都发给三方,那么三方只会回复“请降低提交频率”。

所以在系统设计阶段和代码review阶段都需要特别注意,将并发限制在三方允许的范围内。

 

及时发现故障

故障就像鬼子进村,来的猝不及防。当预防的防线被冲破,如何及时拉起第二道防线,发现故障保证可用性,这时候报警监控系统的开始发挥作用了。一辆没有仪表盘的汽车,是无法知道车速和油量,转向灯是否亮,就算“老司机”水平再高也是相当危险的。同样,系统也是需要监控的,最好是出现危险的时候提前报警,这样可以在故障真正引发风险前解决。

 

实时报警系统

如果没有实时报警,系统运行状态的不确定性会造成无法量化的灾难。我们的监控系统指标如下:

  • 实时性:实现秒级监控;

  • 全面性:覆盖所有系统业务,确保无死角覆盖;

  • 实用性:预警分为多个级别,监控人员可以方便实用地根据预警严重程度做出精确的决策;

  • 多样性:预警方式提供推拉模式,包括短信,邮件,可视化界面,方便监控人员及时发现问题

 

报警主要分为单机报警和集群报警,而付钱拉属于集群部署。实时预警主要依靠各个业务系统实时埋点数据统计分析实现,因此难度主要在数据埋点和分析系统上。

 

埋点数据

要做到实时分析,又不影响交易系统的响应时间,我们在系统各个模块中通过redis实时做数据埋点,然后将埋点数据汇总到分析系统,分析系统根据规则进行分析报警。

 

分析系统

分析系统最难做的是业务报警点,例如哪些报警只要一出来就必须出警,哪些报警一出来只需要关注。下面我们对分析系统做一个详细介绍:

1、系统运行架构

 

2、系统运行流程

 

3、系统业务监控点

我们的业务监控点都是在日常运行过程中一点一滴总结出来的,分为出警类和关注类两大块。

出警类:

  • 网络异常预警;

  • 单笔订单超时未完成预警;

  • 实时交易成功率预警;

  • 异常状态预警;

  • 未回盘预警;

  • 失败通知预警;

  • 异常失败预警;

  • 响应码频发预警;

  • 核对不一致预警;

  • 特殊状态预警;

 

关注类:

  • 交易量异常预警;

  • 交易额超过500W预警;

  • 短信回填超时预警;

  • 非法IP预警;

 

4、非业务监控点

非业务监控点主要是指从运维角度的监控,包括网络,主机,存储,日志等。具体如下:

  • 服务可用性监控:

使用JVM采集YoungGC/Full GC次数及时间、堆内存、耗时Top 10线程堆栈等信息,包括缓存buffer的长度。

  • 流量监控:

通过Agent监控代理部署在各个服务器上,实时采集流量情况。

  • 外部系统监控:

通过间隙性探测来观察三方或者网络是否稳定。

  • 中间件监控:

针对MQ消费队列,通过RabbitMQ脚本探测,实时分析队列深度;

针对数据库部分,通过安装插件xdb,实时监控数据库性能。

  • 实时日志监控:

通过rsyslog完成分布式日志的归集,然后通过系统分析处理,完成日志实时监控和分析。最后,通过开发可视化页面展示给使用者。

  • 系统资源监控:

通过Zabbix监控主机的CPU负载、内存使用率、各网卡的上下行流量、各磁盘读写速率、各磁盘读写次数(IOPS)、各磁盘空间使用率等。

以上就是我们实时监控系统所做的,主要分为业务点监控运维监控两方面,虽然系统是分布式部署,但是每个预警点都是秒级响应。除此之外,业务系统的报警点也有一个难点,那就是有些报警是少量报出来不一定有问题,大量报警就会有问题,也就是所谓的量变引起质变。

举一个例子,拿网络异常来说,发生一笔可能是网络抖动,但是多笔发生就需要重视网络是否真的有问题,针对网络异常,我们的报警样例如下:

  •  单通道网络异常预警:1分钟内A通道网络异常连续发生了12笔,触发了预警阀值; 

  • 多通道网络异常预警1: 10分钟内,连续每分钟内网络异常发生了3笔,涉及3个通道,触发了预警阀值;

  • 多通道网络异常预警2: 10分钟内,总共发生网络异常25笔,涉及3个通道,触发了预警阀值。

 

日志记录和分析系统

对于一个大型系统而言,每天记录大量的日志和分析日志是有一定的难度的。付钱拉每天平均有200W笔订单量,一笔交易经过十几个模块流转,假设一笔订单记录30条日志,可想而知每天会有多么巨大的日志量。

我们日志的分析有两个作用,一个是实时日志异常预警,另外一个是提供订单轨迹给运营人员使用。

 

实时日志预警

实时日志预警是针对所有实时交易日志,实时抓取带有Exception或者Error的关键字然后报警。这样的好处是,如果代码中有任何运行异常,都会第一时间发现。我们针对实时日志预警的处理方式是,首先采用rsyslog完成日志归集,然后通过分析系统实时抓取,再做实时预警。

 

订单轨迹

对于交易系统,非常有必要实时了解一笔订单的状态流转。我们最初的做法是通过数据库来记录订单轨迹,但是运行一段时间后,发现订单量剧增导致数据库表过大不利于维护。

我们现在的做法是,每个模块通过打印日志轨迹,日志轨迹打印的格式按照数据库表结构的方式打印,打印好所有日志后,rsyslog来完成日志归集,分析系统会实时抓取打印的规范日志,进行解析然后按天存放到数据库中,并展示给运营人员可视化界面。

日志打印规范如下:

简要日志可视化轨迹如下:

日志记录和分析系统除了以上两点,也提供了交易和响应报文的下载和查看。

 

7*24小时监控室

以上的报警项目给操作人员提供推拉两种方式,一种是短信和邮件推送,一种是报表展示。除此之外,由于支付系统相比互联网其他系统本身的重要性,我们采用7*24小时的监控室保证系统的安全稳定。

 

及时处理故障

在故障发生之后,特别是生产环境,第一时间要做的不是寻找故障发生的原因,而是以最快速度处理故障,保障系统的可用性。我们常见的故障和处理措施如下:

自动修复

针对自动修复部分,我们常见的故障都是三方不稳定造成的,针对这种情况,就是上面说的系统会自动进行重路由。

 

服务降级

服务降级指在出现故障的情况下又无法快速修复的情况下,把某些功能关闭,以保证核心功能的使用。我们针对商户促销的时候,如果某个商户交易量过大,会实时的调整这个商户的流量,使此商户服务降级,从而不会影响到其他商户,类似这样的场景还有很多,具体的服务降级功能会在后续系列介绍。

发布了178 篇原创文章 · 获赞 353 · 访问量 16万+

Guess you like

Origin blog.csdn.net/Ture010Love/article/details/104374655