High Availability System Essentials

Referring to what was written by the great god Fang Tengfei

"How to Build a High Availability System"

Sort out the corresponding mind map and make some changes to help everyone understand better.


 -------------------------------------------------- -------------------------------------------------- --------------------- Discount ---------------------------- -----------------------------------

"High Availability" is often used to describe a system that has been specifically designed to reduce downtime while maintaining high availability of its services. The following are design recommendations for highly available systems:

 

design advice

  • Reduce single point  – To go to a single point, you must first identify the single point of all main links in the whole system, such as computer room (two computer rooms in the same city and different places), application server, DNS server, SFTP server, LBS, cache server, database, message server, proxy server If the system calls the other party's service through the dedicated line, it is necessary to consider pulling the dedicated lines of China Unicom and China Telecom at the same time. There is still a certain probability that there will be problems with the China Unicom or China Telecom leased lines, but the probability of problems at the same time will be much smaller. Use soft loads first, and use hard loads to cover the bottom.
  • Reduced dependencies  – Reduce DNS dependencies, reduce remote service dependencies, DNS dependencies can try to set up a local host, use tools to push the latest domain name mapping relationship to all servers, and reduce RPC calls through local cache or near-end services.
  • Limit loops  - to avoid infinite infinite loops, resulting in 100% CPU utilization, you can set the maximum number of loops for for loops, such as the maximum loop 1000 times.
  • Control traffic  – To prevent abnormal traffic from affecting the application server, you can set traffic limits for specified services, such as QPS, TPS, QPH (total requests per hour) and QPD (total requests per day).
  • Accurate monitoring  – monitor CPU utilization, load, memory, bandwidth, system calls, application errors, PV, UV, and traffic to avoid memory leaks and abnormal codes from affecting the system. Configuration monitoring must be accurate, as usual The memory utilization is 50%, and the monitoring can be configured to 60% to alarm, so that the memory leak problem can be sensed in advance and the application can not be unresponsive.
  • Stateless  – The server cannot save user state data. For example, in a cluster environment, static variables cannot be used to save user data, and user files cannot be stored locally on the server for a long time. The state of the server makes it difficult to expand, and there is a single point of problem.
  • Capacity Planning  – Regularly evaluate capacity. For example, pressure testing and capacity estimation are carried out before the big promotion, and capacity expansion is carried out as needed.
  • Function switch  – turn on and off certain functions, such as the amount of messages is too large, the system cannot process it, and the switch is turned on and the message is discarded without processing. The new function is added on the line, and if there is a problem, the new function will be turned off.
  • Set timeout  - set the connection timeout and read timeout settings, which should not be too large. If it is an internal call, the connection timeout can be set to 1 second, the read timeout can be set to 3 seconds, the external system call connection timeout can be set to 3 seconds, and the read timeout can be set to 20 seconds .
  • Retry strategy  – When calling an external service abnormally, you can set a retry strategy, and the retry time increases each time, but you need to set the maximum number of retries and retry switches to avoid affecting the downstream system.
  • Isolation  – Application isolation, module isolation, computer room isolation and thread pool isolation. Applications and modules can be isolated according to priority, invariant and variable dimensions. For example, abstract and invariant codes are placed in one module. The code of this module is hardly modified, with high availability and frequently changing business logic. In the module, even if there is a problem, it will only affect a certain business. Different services use different thread pools to avoid low-priority tasks blocking high-priority tasks, or if there are too many high-priority tasks, low-priority tasks will never be executed.
  • Asynchronous call  – synchronous call is changed to asynchronous call to solve the impact of remote call failure or call timeout on the system.
  • Hotspot Cache  – Cache hotspot data to reduce RPC calls. For example, if system B provides a list service, system B can provide a client SDK to provide a near-end caching service, and periodically fetch data from the server to reduce RPC calls.
  • Cache disaster recovery  – Cached data can be used when the database is unavailable. And set the hierarchical cache, such as reading the local cache first, followed by the distributed cache.
  • Hierarchical cache  – read the local cache first, then read the distributed cache. Update the local cache via push mode.
  • System classification  – Classify the system, such as ABC three levels, the high-level system does not depend on the low-level system, and the high-level system is more highly available than the low-level system.
  • Service degradation  – If the system responds slowly, some functions can be turned off, thereby releasing system resources and ensuring the normal operation of core services. It is necessary to identify which services can be degraded. For example, there is a sudden inflow of a large number of messages, causing the service to be unavailable, and we will directly discard the messages. Or by setting flow control to deny service to low-level systems.
  • Traffic flood storage  – When the traffic increases sharply, the request can be flooded, for example, the request can be stored in the database, and then flooded according to the specified QPS, which can effectively protect the downstream system and ensure the availability of the service. When the other party's system is called, and the other party's system responds slowly or does not respond, automatic flood storage can be adopted.
  • Service weights  – In a clustered environment, high-performance services can be automatically identified and calls to low-performance services are rejected. For example, in a cluster environment, the weight of the server with the call timeout is reduced, and the server with the higher weight is called first.
  • Dependency simplification – reduce dependencies between systems, such as using message-driven, A and B systems pass data through a message server, A and B systems use a database to separate read and write, A system is responsible for writing data to the database, and B system is responsible for reading data , because the data is stored in the database, when A is unavailable, it will not affect the service provided by the B system in a short time.
  • 弹性扩容 – 根据资源的使用率自动或手动进行扩容。如带宽不够用时,快速增加带宽。
  • 灰度和回滚 – 发布新功能只让部分服务器生效,且观察几天逐渐切流,如果出现问题只影响部分客户。出现问题快速回滚,或者直接下线灰度的机器。
  • 减少远程调用 – 优先调用本地JVM内服务,其次是同机房服务,然后是同城服务,最后是跨城服务。如A调用B,B调用互联网的C系统获取数据,B系统可以把数据缓存起来,并设置数据的保鲜度,减少B对C的依赖。配置中心把注册服务的地址推送到调用服务的系统本地。参数中心把参数配置信息推送到系统的本地内存,而不是让系统去远程服务器获取参数信息。
  • 熔断机制 – 增加熔断机制,当监控出线上数据出现大幅跌涨时,及时中断,避免对业务产生更大影响。如我们做指标计算时,指标可以计算慢,但是不能算错,如果发现某个用户的指标环比或同比增长一倍或跌零,会考虑保存所有消息,并中止该用户的指标计算。
  • 运行时加载模块 – 我们会把经常变的业务代码变成一个个业务模块,使用Java的ClassLoader在运行时动态加载和卸载模块,当某个模块有问题时候,可以快速修复。
  • 代码扫描 – 使用IDEA代码分析等工具进行代码扫描,识别出程序中的BUG,如空指针异常,循环依赖等。
  • 自动备份 – 程序,系统配置和数据定期进行备份。可使用linux命令和shell脚本定时执行备份策略,自动进行本地或异地。出现问题时能快速重新部署。
  • 线上压测 – 系统的对外服务需要进行压测,知道该服务能承受的QPS和TPS,从而做出相对准确的限流。

参考资料

 

  • 分布式系统稳定性模式

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326562010&siteId=291194637