"From scratch learning architecture" Eight: availability of off-site live and interface failure

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_41594698/article/details/102698035

1 guarantee high availability of business: to live off-site infrastructure

High availability computing architecture and high availability storage architecture, by its very nature is designed to solve the scenario is part of a server failure, how to ensure that the system can continue to provide services.
However, in some extreme scenarios, it is possible that all servers have failed. For example, a typical machine room power, the engine room fire, earthquake, flood ...... these extreme cases can cause a system failure all servers overall paralysis, or business, and even if there are other parts of the backup, the backup to recover all the business systems be able to provide normal services, the time spent is relatively long, probably half an hour, it could be 12 hours. Because backup systems usually do not provide services, there may be a lot of hidden problems not found.
If the business expect to achieve even in the event of such a catastrophic failure, the business will not be affected, or will be able to recover quickly within a few minutes, you will need to live off-site design architecture.

1.1 concept

Off-site refers to the location on different places, like "Do not put your eggs in one basket";
live on the system refers to different geographic locations are able to provide business services, where the "live" is active, active meaning.

A system to determine whether the different places to live, need to meet two criteria:

  • Under normal circumstances, the user no matter which business systems place a visit, are able to get the right business services.
  • Somewhere business exceptions when users visit other places normal business system that can get the right business services.

1.2 architecture patterns

Divided according to geographical distance, remote multi-city living architecture can be divided into different areas, in different places across the city, transnational different places.

1.2.1 different city district

The room service deployment in a plurality of different regions of the same city, then connect several machine room together with a dedicated high-speed network.

City two rooms, is typically from about tens of kilometers distance, by building high-speed network, two different rooms city area can be realized within the same room, and the network transmission speed is almost the same. This means that although the room on two different locations, but logically they can be seen as the same room, this design greatly reduces the complexity and reduce the off-site live for the design and implementation complexity and cost.

1.2.2 Cross City offsite

Cross City off-site refers to the service deployment in multiple rooms in different cities, but far from some of the best.

You can solve extreme events

Data consistency requirements are not so high, or less change data, or data loss even if the impact is not big business, live in different places across the city can use

1.2.3 Multinational places

Room service deployment in multiple different countries.

To provide services (China Taobao, Taobao US) for different users in the region, read-only class of business to do more live (search engine)

1.2.4 doubt

Assumptions made highly available data partition backup storage architecture previously mentioned, but also by automated operation and maintenance to ensure a 1-minute all the system starts properly, does that mean there is no need to do more than live in different places?

No, for the following reasons:
1 backup system usually does not flow, if the direct on-line may not trigger the usual test of bug
2 then there will be real-time system data delay, if it comes to finance such a system is still not directly switch .
3 during operation of the system there will be many intermediate data, the cache data. The system does not flow directly through the warm-up upside down, large flow directly to system collapse

2 places to live four design tips

2.1 Skills a: to ensure that the core business of off-site live

If all businesses have come true, some even question is no solution difficult, so to give priority to the core business of off-site live architecture

2.2 Skills II: to ensure that the final core data consistency

Theoretically impossible to live off-site quickly, because this is determined by the laws of physics, so there is a contradiction: on the business requires rapid data synchronization, just can not do quickly synchronize data on physical, so all data is synchronized in real time, in fact, It is an unattainable goal

Methods to minimize the impact of:

Minimize off-site live from the room, to build high-speed network

Minimize data synchronization, synchronize only the data related to the core business

Guarantee eventual consistency, consistency does not guarantee real-time

2.3 Skills III: Using a variety of means to synchronize data

Data synchronization is the remote multi-core architecture of living, and comes with its own storage system synchronization, in some scenarios are unable to meet the business needs, so avoid using only the sync function of the storage system, you can fit a variety of means synchronization with storage systems, even without using synchronization scheme of the storage system, switch to its own synchronization scheme.

the way:

Message queue mode

Secondary reading mode

Storage system synchronously

Read way back to the source

Rebuild data mode

If the user subsystems:
Here Insert Picture Description

2.4 Tip 4: only ensure the vast majority of users of off-site live

Places to live can not guarantee 100% service availability, which is determined by physical laws

Can not be 100% availability, users can only take some measures to appease or compensation

2.5 The core idea

Using various means to ensure that the core business of most users live in different places, according to economic capacity, and the time required for the business require has to choose to live off-site business services and live forms.

3 places to live for designing four steps

Step 3.1: Business Classification

According to certain criteria to classify traffic, pick out the core business, the core business of designing only for places to live, reducing the overall complexity of the program and implementation costs.

Common grading standards:

Most popular business

Core Business

A large amount of business income

Step 3.2: Data Classification

After the selected core business, business-related data to the core further analysis is to identify all of the data object and feature data, these data will affect the design features of the latter.

Common features of the data analysis dimensions are:

The amount of data

Uniqueness

real-time

Be losing

Recoverability

Step 3.3: Data Synchronization

After determining the characteristics of the data, different synchronization scheme can be designed according to different data.

Common data synchronization program are:

Storage system synchronization

Synchronization Message Queuing: Suitable non-transactional or non-sequential data requirements.

Repeat generation

3.4 Step Four: Exception Handling

Exception handling has the following main purposes:

When a problem occurs, a small amount of data to avoid abnormal result in overall business unavailable.

When the problem is restored, the abnormal data is corrected.

The user appease, to make up for the loss of customers.

Common exception handling measures:

Multi-channel synchronous, typically bis

Synchronize and combine access

Logging

User compensation

4 interface failure

Live mainly off-site program to deal with system-level fault, for example, machine downtime, room failures, network failures and other problems, these system-level fault although a great impact, but low probability of occurrence.

The interface level of fault in the actual business operations, the impact may not be as big as the system level, but the higher probability of occurrence

Typical performance: system downtime and no, there is no network interruption, but the business problem has emerged.

the reason:

Internal Cause: The bug, database query slow

External reasons: hacker attacks, two-eleven, a third-party system problems

The core problem-solving ideas and places to live essentially similar: give priority to ensuring the core business and give priority to ensuring the vast majority of users .

Demote

Fuse

Limiting

queue

4.1 Case

The whole point of designing a limited edition spike system, including login, buy, pay (dependent Alipay) and other functions, how to design the interface level means to deal with failure?

1 for customer service during rush can prepare downgrade strategies to ensure that available, registration and modify user login information can be downgraded when the pressure is too large

Under 2 comes to buying single orders, inventory, and merchandise inquiries. Can be limiting by requesting queuing, inventory beyond the request of return directly or queued
order to deal with failures stock and commodity services that may occur, you can do ahead caching product data and inventory data, if failure to end services, can also provide local service

3 payment systems rely on third parties, a reasonable set of fusing strategy, measures to compensate or fault-tolerant, long exceed the limit may prompt the user to make the payment later when average pay as

Guess you like

Origin blog.csdn.net/qq_41594698/article/details/102698035