Three places and five centers

The hottest news in the science and technology circle should be "The digging of the optical fiber cable in AWS China has caused the unavailability of many enterprise services such as Samsung and Xiaomi." The optical cable was dug again, hey! ? Why is it again, let us go back to the past together:

  • 2019.6.02: Amazon's optical cable was cut, and the network in some areas of China was abnormal
  • 2019.3.23: The construction team digs out Tencent's optical fiber, which affects more than 100 games under Tencent, and the loss is large
  • 2015.5.27: As the optical fiber was cut in a certain area in Xiaoshan District, Hangzhou, a small number of users cannot use Alipay.

Here I just list a few large companies involved in the optical cable dug accidents, and the rest include what radio and television optical cables were dug, and the social security bureau's optical cables are not listed. If you are interested, go to Baidu.

Well, we found that "no matter how big the company is, we are afraid of the construction team", so can the construction team be blamed for this kind of accident? I personally feel that the responsibility cannot be transferred to the construction team. Of course, we will not discuss these here. As a large company, how can we prevent this phenomenon in the future? We can take a look at Alipay’s solution. After all, the elderly experienced this miserable situation in 2015.

On September 20, 2018, a special technology show was staged at the ATEC main forum of Hangzhou Yunqi Conference. Hu Xi, deputy CTO of Ant Financial, simulated cutting off the optical cables of nearly half of Alipay's servers. As a result, after only 26 seconds, Alipay in the simulated environment was completely restored to normal.

This solution is "three places and five centers", which is a computer room architecture, that is, five computer rooms are deployed in three cities. Once one or two of the computer rooms fails, technology can switch all the traffic in the failed city to A working computer room. So there are many other structures before the "three places and five centers". Let's take a look at their characteristics one by one.

Initially, we put the application (a very simple read-only application, such as a web page that displays Hello World, regardless of data storage) on only one machine, then when the server is down, our application is unavailable. Therefore, we consider putting our application on multiple machines and opening a separate computer room in the company to place these machines, so that a single machine down does not affect our application. But what if your company goes out of power one day? At this time, we are considering placing a computer room in another place in this city. This is the application is deployed in two computer rooms in the same city (this is called dual-active in the same city ). However, if your city experiences a tsunami or typhoon one day Due to natural disasters such as earthquakes and earthquakes, neither of the two computer rooms can be used. At this time, we will consider building another computer room in another city to deploy our application, so that the usability of our application will be higher (this is called live in different places ) . Well, so far, no matter what the situation, our application is basically available (unless the earth is destroyed...)

Then the application we considered above is a very simple read-only application, so applications in various places can provide external services at the same time, so if our application involves data storage, at this time applications in various places cannot provide external writing at the same time Data access service, because data conflicts are likely to occur, then we temporarily stipulate that only the server in the company’s internal computer room (hereafter called the host room) can provide data writing services, while another computer room in the same city and another remote location A computer room can only synchronize data from the host room, so the function of the computer rooms in these two places is called disaster recovery , because the data will be synchronized, so even if the host room is out of power, the other two computer rooms can still provide services temporarily. So the current architecture can be as follows:

The architecture that programmers must understand: three places and five centers (1)

 

When the main computer room is powered off, the user will request the Beijing backup computer room. When the Beijing backup computer room is also powered off, the user will request the Shanghai backup computer room. Well, for this architecture, we just said that only the main computer room can provide external services, and the other two computer rooms are only used as disaster recovery backups, so that means the utilization of the backup computer room is not high, because after all, the main computer room cannot be old under normal requests. Power failure, so can the utilization rate of the backup room be increased? Of course, we can let the backup computer room in Beijing also receive some business requests, but these requests can be less important, such as some read requests, while the backup computer room in Shanghai does not receive requests, or simply serves as a disaster recovery backup machine, because if No one can guarantee that there will be other unpredictable problems when the backup computer room receives service requests. Then the roles of the three computer rooms are actually somewhat different now:

The architecture that programmers must understand: three places and five centers (1)

 

This is called two places and three centers .  So the two-location three-center architecture is currently used by many banks or large enterprises, because the state has set requirements for the disaster recovery capabilities of banks, and the two-location three-center architecture must be used to ensure that the amount of assets exceeds The stability of the banking system.

So are there any disadvantages to this architecture? Let us consider its high availability? Usability means that the architecture is fast enough to process user requests? We found that this architecture requires data backup between centers, so there are only two methods for data backup, either asynchronously or synchronously.

  • Maximum performance mode: If it is asynchronous, it means that a user writes a data request. As long as the data is stored in the production data center, the result will be directly returned to the user, and the data will be backed up asynchronously. However, if you are preparing to asynchronously back up the data, it will be produced The data center is out of power~, can the disaster recovery server be exposed to provide services to users at this time? No, because it is very likely that the data in the disaster recovery center is out of date.
  • Maximum protection mode: If it is synchronized, it means that a user requests to write data, not only waiting for the production data center to store the data, but also waiting for other disaster recovery centers to back up the data before returning, and only when there is a problem with the disaster recovery center, because The data cannot be backed up, so the entire architecture cannot provide external services. This availability is very low.
  • Maximum available mode: This is a commonly used solution. Under normal circumstances, the maximum protection mode is used. At the same time, the production data center monitors the disaster recovery data center. Once a problem is found in a disaster recovery center, it will be changed to the maximum performance mode. This ensures that the production data center is not affected by other disaster recovery centers.
  • Three writes and two synchronizations: This is Ali’s previous architecture model , which means that there are three centers in the same city. Data backup does not occur at the database level, but at the application layer. When the application writes data to the database, it will write to the three centers at the same time. Data, as long as there are two centers that return successfully, so even if one of the three centers is out of power, it will not affect the high availability of the entire architecture. This idea is different from our previous three, and the performance will definitely be much higher. .

Well, we introduced the three centers in two places and summarized its shortcomings:

  1. Disaster recovery center utilization rate is not high
  2. After the production data center stops operating, there may not be 100% identical data in the disaster recovery center
  3. The cost is high, but it cannot really achieve the desired high availability

So to solve this problem, there have been three centers and five centers. Although the names are similar to those of the two centers, they provide completely different functions. Three places and five centers refer to three cities and five centers. The concept of three places and five centers is unitization. It takes a lot of space to talk about it. Let's continue with the next article.

Guess you like

Origin blog.csdn.net/luzhensmart/article/details/112526856