Internet Architecture -- High Availability

Internet Architecture -- High Availability

 

 

What is high availability

 

       High Availability HA (High Availability) is one of the factors that must be considered in the design of distributed system architecture. It usually refers to reducing the time when the system cannot provide services through design.

 

 

high availability standard

 

  • Assuming that the system has been able to provide services, we say that the availability of the system is 100%.
  • If the system runs for every 100 time units, there will be 1 time unit that cannot provide service, we say that the availability of the system is 99%.
  • The high availability target of many companies is four nines, or 99.99%, which means that the system has an annual downtime of 8.76 hours.

 

High availability example

 

       Baidu's search homepage is recognized as a system with excellent high availability guarantees in the industry. Even people will judge "network connectivity" by whether www.baidu.com can be accessed. Baidu's high availability service makes people stay. , Baidu can access", "Baidu can't open, it should be because the network can't connect", this is actually the highest praise for Baidu HA.

 

 

 

 

How to ensure the high availability of the system

 

  • Clustering (redundant to single point)
  • automatic failover

 

clustering

 

       We all know that a single point is the enemy of high system availability, and a single point is often the biggest risk and enemy of high system availability. We should try to avoid single points in the process of system design. Methodologically, the principle of high availability guarantee is "clustering", or "redundancy": there is only one single point, and the service will be affected if it fails; if there is redundant backup, there are other backups that can be topped up.

 

 

automatic failover

 

       With redundancy, it is not enough. Every time a failure occurs, manual intervention is required to restore the system, which will inevitably increase the unserviceable practice of the system. Therefore, high availability of the system is often achieved through "automatic failover".

 

 

 

 

Common Internet Layered Architecture

 



 

The common Internet distributed architecture is as above, divided into:

 

    1. Client layer: The typical caller is the browser browser or the mobile application APP

    

    2. Reverse proxy layer: system entry, reverse proxy

 

    3. Site application layer: implement core application logic and return html or json

 

    4. Service layer: if the service is realized, there is this layer; if not, it is enough to have the site application layer

 

    5. Data-Cache Layer: Cache Accelerates Access to Storage

 

    6. Data-database layer: database solidifies data storage

 

The high availability of the entire system is achieved comprehensively through clustering + automatic failover at each layer.

 

 

 

 

High availability of [client layer -> reverse proxy layer]

 

clustering

 

      以nginx为例:有两台或多台nginx,一台对线上提供服务,另一台冗余以保证高可用,常见的实践是 keepalived 存活探测,相同virtual IP提供服务。

 

自动故障转移

 

       当nginx挂了的时候,keepalived能够探测到,会自动的进行故障转移,将流量自动迁移到shadow-nginx,由于使用的是相同的virtual IP,这个切换过程对调用方是透明的。



 

 

 

【反向代理层->站点层】的高可用

 

集群化

 

       假设反向代理层是nginx,nginx.conf里能够配置多个web后端,并且nginx能够探测到多个后端的存活性。其实我们也可以用云平台(如:AWS)代替nginx做负载均衡,同时也可以监控流量。



 
 

自动故障转移

 

       当web-server挂了的时候,nginx能够探测到,会自动的进行故障转移,将流量自动迁移到其他的web-server,整个过程由nginx自动完成,对调用方是透明的。



 
 

 

 

【站点层->服务层】的高可用

 

集群化

 

       服务层集群化是通过服务层的冗余来实现的。“服务连接池”会建立与下游服务多个连接,每次请求会“随机”选取连接来访问下游服务。



 

 

自动故障转移

 

       当service挂了的时候,service-connection-pool能够探测到,会自动的进行故障转移,将流量自动迁移到其他的service,整个过程由连接池自动完成,对调用方是透明的(所以说RPC-client中的服务连接池是很重要的基础组件)。


 


 

 

 

【服务层>缓存层】的高可用

 

集群化:方法一

 

       利用客户端的封装,service对cache进行双读或者双写,这样就不需要考虑自动故障转移,因为已经有多台缓存数据库做集群,当然读写的时候会比较耗时。


 

 集群化:方法二

 

       缓存层也可以通过支持主从同步的缓存集群来解决缓存层的高可用问题。redis天然支持主从同步,redis官方也有sentinel哨兵机制,来做redis的存活性检测。


自动故障转移

 

       当redis主挂了的时候,sentinel能够探测到,会通知调用方访问新的redis,整个过程由sentinel和redis集群配合完成,对调用方是透明的。


允许“cache miss”

 

       说完缓存的高可用,这里要多说一句,业务对缓存并不一定有“高可用”要求,更多的对缓存的使用场景,是用来“加速数据访问”:把一部分数据放到缓存里,如果缓存挂了或者缓存没有命中,是可以去后端的数据库中再取数据的。

 

 

应对“cache miss”的缓存架构建议

 

    1. 将kv缓存封装成服务集群,上游设置一个代理(代理可以用集群冗余的方式保证高可用),代理的后端根据缓存访问的key水平切分成若干个实例,每个实例的访问并不做高可用。


 

    2. 缓存实例挂了屏蔽:当有水平切分的实例挂掉时,代理层直接返回cache miss,此时缓存挂掉对调用方也是透明的。key水平切分实例减少,不建议做re-hash,这样容易引发缓存数据的不一致。


 

 

 

【服务层>数据库层】的高可用

 

       大部分互联网技术,数据库层都用了“主从同步,读写分离”架构,所以数据库层的高可用,又分为:

 

  • 读库高可用
  • 写库高可用

 

读库高可用

 

       其实是通过读库的冗余来实现的。既然冗余了读库,一般来说就至少有2个从库,“数据库连接池”会建立与读库多个连接,每次请求会路由到这些读库。


 

   
自动故障转移

 

       当读库挂了的时候,db-connection-pool能够探测到,会自动的进行故障转移,将流量自动迁移到其他的读库,整个过程由连接池自动完成,对调用方是透明的(所以说DAO中的数据库连接池是很重要的基础组件)。


 

 

写库高可用

 

       其实是通过写库的冗余来实现的。可以设置两个mysql双主同步,一台对线上提供服务,另一台冗余以保证高可用,常见的实践是keepalived存活探测,相同virtual IP提供服务。


 

 

自动故障转移

 

       当写库挂了的时候,keepalived能够探测到,会自动的进行故障转移,将流量自动迁移到shadow-db-master,由于使用的是相同的virtual IP,这个切换过程对调用方是透明的。


 

 

 

总结

 

       高可用HA(High Availability)是分布式系统架构设计中必须考虑的因素之一,它通常是指,通过设计减少系统不能提供服务的时间。

 

       方法论上:高可用是通过冗余+自动故障转移来实现的。

 

步骤如下:

 

    1.【客户端层】到【反向代理层】的高可用,是通过反向代理层的冗余实现的,常见实践是keepalived + virtual IP自动故障转移

    

    2.【反向代理层】到【站点层】的高可用,是通过站点层的冗余实现的,常见实践是nginx与web-server之间的存活性探测与自动故障转移

    

    3.【站点层】到【服务层】的高可用,是通过服务层的冗余实现的,常见实践是通过service-connection-pool来保证自动故障转移

    

    4.【服务层】到【缓存层】的高可用,是通过缓存数据的冗余实现的,常见实践是缓存客户端双读双写,或者利用缓存集群的主从数据同步与sentinel保活与自动故障转移;更多的业务场景,对缓存没有高可用要求,可以使用缓存服务化来对调用方屏蔽底层复杂性

    

    5.【服务层】到【数据库“读”】的高可用,是通过读库的冗余实现的,常见实践是通过db-connection-pool来保证自动故障转移

    

    6.【服务层】到【数据库“写”】的高可用,是通过写库的冗余实现的,常见实践是keepalived + virtual IP自动故障转移

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326197177&siteId=291194637