A summary of the architecture design after the AWS China optical fiber was broken

Yesterday, the hottest news in the science and technology circle should be "the AWS China optical cable was dug, resulting in the unavailability of Samsung, Xiaomi and many other enterprise services."

The optical cable was dug again, hey! ? Why is it again, let us go back to the past together:

2019.6.02: Amazon's optical cable was dug, and some areas of the country's network were abnormal
2019.3.23: The construction team cut Tencent's optical fiber, which affected more than 100 Tencent games, and the loss was large.
2015.5.27: Due to a certain in Xiaoshan District, Hangzhou The ground optical fiber was dug and cut, causing a small number of users to be unable to use Alipay.
I just listed a few major companies involved in the dug accidents of optical cables, the rest include what radio and television optical cables were dug, and the social security bureau's optical cables are not listed. If you are interested, go to Baidu.

Well, we found that "no matter how big the company is, we are afraid of the construction team", so can the construction team be blamed for this kind of accident? I personally feel that the responsibility cannot be transferred to the construction team. Of course, we will not discuss these here. As a large company, how can we prevent this phenomenon in the future?

We can take a look at Alipay’s solution. After all, the elderly experienced this miserable situation in 2015.

On September 20, 2018, a special technology show was staged at the ATEC main forum of Hangzhou Yunqi Conference. Hu Xi, deputy CTO of Ant Financial, simulated cutting the optical cables of nearly half of Alipay's servers. As a result, after only 26 seconds, Alipay in the simulated environment completely returned to normal.

This solution is "three places and five centers". This is a computer room architecture, that is, five computer rooms are deployed in three cities. Once one or two of the computer rooms fails, technology can switch all the traffic in the failed city to A normal computer room.

Then there are many other structures before the "three places and five centers". Let's take a look at their characteristics one by one.


At the beginning of the disaster evolution , we put the application (a very simple read-only application, such as a web page displaying Hello World, without considering data storage) on one machine, then when the server is down, our application is unavailable Up.

Therefore, we consider putting our application on multiple machines and opening a separate computer room in the company to place these machines, so that a single machine down does not affect our application.

But what if your company goes out of power one day? At this time, we considered placing a computer room in another place in this city. This is the application that was deployed in two computer rooms in the same city (this is called dual-active in the same city)

However, if your city experiences natural disasters such as tsunami, typhoon, earthquake, etc., and both computer rooms are unavailable, then we will consider building another computer room in another city to deploy our application. The usability is even higher (this is called living in different places).

Well, so far, no matter what the situation, our application is basically available (unless the earth is destroyed...)

Then the application we considered above is a very simple read-only application, so applications in various places can provide external services at the same time, so if our application involves data storage, at this time applications in various places cannot provide external writing at the same time Data access service, because data conflicts are likely to occur, then we temporarily stipulate that only the server in the company’s internal computer room (hereinafter we will call it the host room) can provide data writing services, and another computer room in the same city and another remote location A computer room can only synchronize data from the host room, so the function of the computer rooms in these two places is called disaster recovery, because the data will be synchronized, so even if the host room is out of power, the other two computer rooms can still provide services temporarily. So the current architecture can be as follows:

Insert picture description here

When the main computer room is powered off, the user will request the Beijing backup computer room. When the Beijing backup computer room is also powered off, the user will request the Shanghai backup computer room.

Well, for this architecture, we just said that only the host room can provide services to the outside world. The other two computer rooms are only used as disaster recovery backups, so that means that the utilization rate of the backup computer room is not high, because after all, the host room cannot be old under normal requests. Power failure, so can the utilization rate of the backup room be increased? Of course, we can let the backup computer room in Beijing also receive some business requests, but these requests may not be so important, such as some read requests, while the backup computer room in Shanghai does not receive requests, or simply serves as a disaster recovery backup machine, because if No one can guarantee that there will be other unpredictable problems when the backup computer room receives service requests. Then the roles of the three computer rooms are actually somewhat different now:

Insert picture description here

This is called two places and three centers.

So the two-location three-center architecture is currently used by many banks or large enterprises, because the state has set requirements for the disaster recovery capabilities of banks, and the two-location three-center architecture must be implemented if the assets exceed the number. The stability of the banking system.

So are there any disadvantages to this architecture? Let us consider its high availability? Usability means that the architecture is fast enough to process user requests?

We found that this architecture requires data backup between centers, so there are only two methods for data backup, either asynchronously or synchronously.

Maximum performance mode: If it is asynchronous, it means that a user writes a data request. As long as the data is stored in the production data center, the result will be directly returned to the user, and the data will be backed up asynchronously. However, if you are preparing to back up the data asynchronously The data center is out of power~, can the disaster recovery server be exposed to provide services to users at this time? No, because it is very likely that the data in the disaster recovery center is out of date.
Maximum protection mode: If it is synchronous, it means that a user requests to write data, not only waiting for the production data center to store the data, but also waiting for other disaster recovery centers to back up the data before returning, and only when there is a problem with the disaster recovery center, because Data backup cannot be completed, so the entire architecture cannot provide external services. This availability is very low.
Maximum available mode: This is a commonly used solution. Under normal circumstances, the maximum protection mode is used. At the same time, the production data center monitors the disaster recovery data center. Once a problem is found in a disaster recovery center, it will be changed to the maximum performance mode. This ensures that the production data center is not affected by other disaster recovery centers.
Three writes and two synchronizations: This is Ali’s previous architecture model, which means that there are three centers in the same city. Data backup does not occur at the database level, but at the application layer. When the application writes data to the database, it will write to the three centers at the same time. Data, as long as there are two centers that return successfully, so even if one of the three centers is out of power, it will not affect the high availability of the entire architecture. This idea is different from our previous three, and the performance will definitely be much higher. .

Well, we introduced the three centers in two places and summarized its shortcomings:

The utilization rate of the disaster recovery center is not high. After the
production data center stops operating, there may not be 100% identical data in the disaster recovery center. The
cost is high, but the expected high availability capability cannot be truly achieved.

So to solve this problem, there have been three centers and five centers. Although the names are similar to those of the two centers, they provide completely different functions.

Three places and five centers refer to three cities and five centers. The concept of three places and five centers is unitization. It takes a lot of space to talk about it. Let's continue with the next article. Friends are also welcome to join my Java exchange group 901439810. Not only do I have a wealth of architect materials to receive, but there are also big guys who answer and share problems about java in the group.

Guess you like

Origin blog.csdn.net/Lubanjava/article/details/90755548