Availability of common strategies

High concurrency is to make the system "efficient", high concurrency is to make the system "more reliable."

High availability architecture is the design must be considered when technology is often asked to interview.

The following compiled some common high-availability strategies, including:

(1) multiple copies

(2) isolation

(3) limiting

(4) fuse

(5) downgrade

(6) gray-scale publishing and rollback

(7) monitoring system

(8) alarm log

1. multiple copies

Avoid a single point, do not put all your eggs in one basket.

Such as gateways, application servers, cache servers, databases ......, it usually does more copies.

Like Gateway, an application server such stateless, multiple copies is pretty good, but like databases, caches have such status, it must involve issues of data synchronization when multiple copies.

Publish and subscribe mechanism such as a message queue is a common way.

There are like Redis cluster, MySQL provides a master-slave replication mechanism.

It should be noted, do data synchronization will involve data consistency issues, and data consistency and availability and contradictory.

For example MySQL Cluster using asynchronous replication mechanism, there is replication delay when the master is down, there is a small part of the data has not had time synchronization, if the slave is switched to the master, that part of the data is lost.

At this point, it is to protect the site, repair master as soon as possible, or switch immediately?

Usually sacrifice some consistency to ensure data availability, because most of the data can all be repaired, such as manual operation, compensation mechanism, not a big problem. However, if a large-scale system is unavailable 10 minutes, then the impact is very bad.

2. Isolation

Spaced apart isolation refers to system resources, the system may define the scope of failure, will not snowball effect.

the main form:

(1) data isolation

Separate storage of data, data such as the core and non-core data completely physically separated.

(2) Isolation machine

Similar VIP services, such services are a lot of callers, several of the caller is large, a large amount of calls can be alone with a group of machines dedicated to service these callers.

(3) thread pool isolation

For example using Tomcat, open 500 threads simultaneously process up to 500 requests.

Followed by multiple services, a service in a certain period of time to visit is very large, the 500 threads are exhausted, and the service response and relatively slow, causing the entire server card dead.

Isolate thread pool may be used to open a thread pool for each service, instead of sharing a thread pool, independently of each other.

(4) Isolation semaphore

For example, there are 10 concurrent requests to call a service, must acquire a semaphore can really go call.

A limited amount of signal, such as a 5, then there are five request requires into the queuing.

There is also the upper limit of the queue, if the queue was full, then it will go fallback request process, so as to achieve the purpose of preventing and limiting avalanche.

3. limiting

Limiting also common in life, such as limiting attractions, subway limiting.

(1) limiting the technical level

Limit the number of concurrent e.g., to limit the maximum amount of system resources, such as database connection pooling, thread pool, Nginx limit_conn the module.

There rate limit, such as the Guava RateLimiter, Nginx limit_req the module. For example, through an interface test to know QPS is 2000, you can limit the amount of flow in this, when the amount of concurrency beyond the direct denial of service, guarantee not to be overwhelmed.

(2) limiting the operational level

For example spike activity, a total of 100 items, but there are several people involved, can only put 500 came in to grab the back of directly notify spike to end.

4. Fuse

When the circuit problems (such as short circuit, overheating), may burn the whole circuit, the fuse will automatically fuse, circuit protection.

There are also ideas fuse system design.

(1) The request to do fuse failure rate

If a service frequent timeouts or error in a short time, it is open fuse, that is, we do not call it.

I called again after certain time, if you still have problems, continue blown.

(2) the response time of the fuse according to a request made

The average response time statistics service, after the threshold is exceeded open fuse.

And the difference between the current limiting fuse, current limiting is the server for their own protection, the fuse is to protect their client.

5. Downgrade

For example, the electricity supplier system, the core of the buying process, like personalized recommendation system, a lot of pressure in the system, you can stop the service, which is downgraded to non-core services will be suspended in order to protect its core business.

6. grayscale publishing and rollback

Gradation on the line (1) New

When the new on-line features, you can let a small number of people to see, if there is no problem, then gradually open.

The flow division such user_id, or according to user tag division.

(2) reconstructed old system gradation

After the reconstruction of the old system, all generally not immediately switch to the new system, old and new systems will coexist for some time.

For example, 10% of users at the start of the new system, 90% of the old system, the new system there are problems in a timely manner to modify little influence.

The new system is more and more mature and stable, in the process increasing the proportion of new users of the system, the final completion of the handover.

(3) rollback

If you find that the new features of the new system has a more serious problem, you can roll back to the old system.

One is the overall roll back directly to the entire system is rolled back to the previous version.

Another is the ability to rollback, made the switch function in the development of new, old and new functions can be switched by the opening.

7. monitoring system

The system is now available for you? We have to constantly look, check it.

Monitoring system is to help us observe a full range of system status.

(1) Resource Monitoring

Such as CPU, memory, disk, network ......

(2) surveillance system

For example, some URL access has failed, interface calls are normal, interfaces, average response time, JVM recovery of ......

(3) business monitoring

Such as an order system has a key business indicators: the success rate of payment orders.

This indicator is abnormal? Can be compared to historical data, we know the history of the distribution curve, if volatility happened today, it could be a problem.

8. Alarm Log

Logs can help us quickly locate the problem, but it is passive, should take the initiative by the police log, the initiative to solve.

When writing code, written in advance foreseeable problem log. For example, it can be concluded by the ASSERT abnormality, the abnormality may be due to the bug, dirty incoming data upstream, downstream returned dirty ......

For problem areas in advance to write error log, then the log monitoring, and proactive alerts.

Guess you like

Origin blog.csdn.net/suifeng629/article/details/94546629