Availability of some of the thinking and understanding

Public source article number: ingenuity zero

img

In the current Internet era, under the impact of high concurrency, you must also ensure service availability , it means that if the service is not highly available:

  • The system is not 7 * 24-hour service, the user experience is particularly bad, maybe next time the user No, unable to retain users.
  • When the system is not available when the image of the company is somewhat affected, BAT is a symbol of this technology is similar.
  • The most important point, when the system is not available when the direct loss is money! ! ! Basically seconds count losses, vaguely remember May 28, 2015 Ctrip standstill , according to data Ctrip announced first quarter earnings, loss of Ctrip average downtime per hour to $ 1,064,800.

High Availability is very complex, their limited level, does not cover so much, can only say that some of his thinking and understanding of high availability.

So how to make it highly available system?

We can not allow the server does not hang, so service does not hang, then how to let this situation will not have a problem losing it, is that you can hang, service can be bad, then how the system can also provide services?

First, if the machine has a lot of, a lot of service, even if the bad part is no problem ah, losing the situation has been resolved. Below step by step analysis, if the machine is stored inside a specific value, then it can not be extended, you must use the machine to hang, then this is not enough, the machine issues a good solution, the same configuration alternatives is easy, then the application service it is similar, related application services can not state the value of their own internal storage and will not have to store some specific features of the data on any machine, if there is no way to easily extend only if each of the main pieces are the same time, without any difference, we go to replace, easy to expand, then the calling, no state of the service.

If the current state of the service is already free, and then how to make the system dynamically perceived service hang of it? Otherwise request or go back to that machine hanging, how transferred to the new machines? You may need to service discovery and registered.

If you reach the top of the case, to deal with the general situation has basically been enough, but the Internet is complex, just say the machine is broken, bad service problems, so if the network is unreasonable because a short how to do it?

So there should be between the service detected a heartbeat, to see if you can regularly pass (the machine breaks down, hung up service, network nowhere) anyway, it is not up to the. This case through service registration and discovery can be solved, but sometimes the network is then flash at that particular situation? Just such a service has sent the request to the service b, b of the service request has been received, then this time the network suddenly cut off, but the service logic processing b would be done, but it is not a response to the service response reception overtime, then again triggers the next, so if b service logic to do it again before the existence of the problem? Such as payment, it has been paid 200 yuan, 200 yuan payment and then do it? It should be mentioned a idempotency design concept, what is idempotent, that is many times the results are the same, if there is idempotency design so afraid of this situation, and without getting feedback retry can be, No problem will occur.

It says that to achieve these machines to deal with the bad, the service hung up, the network is disconnected or flash, etc. has been basically no big problem, then the current Internet are high concurrency, then in the case of high concurrency, how to improve the system capable?

And moving things on, like, a person slow, more than a point can help people with things, because the architecture that can be added above the machine, service, then it is easy to think of more than a point machines and services. So this certainly little faster than the machine, such as five machines, so many requests come, let them share what strategies to different machines? Through the device, through some software level, but there must be a registered service discovery, otherwise no way to know the dynamic node changes, there is control of some of the information, the black and white lists, access frequency. Many times, plus the machine may seem relatively low, but sometimes and is more effective, but can not blindly add machines, plus in some cases the machine will not be solved.

Indeed much faster machine, if there is a blocking method in the service inside, so even if the service in more than useless, it is necessary to pay attention on the issue of overtime service, because the service is idempotent, even if executed again and there is no relationship, there the time-out will not affect the back of the card for a long time served (downstream service is down, thread deadlocks, and downstream services busy, etc.).

About synchronous and asynchronous design patterns, in some business scenarios must be performed in the order it is necessary to use sync, not necessarily in this scenario so certain than asynchronous synchronization of concurrent volume should be large (due to the middleware through a lot of steps , so the total time from the point of view of a single request does not necessarily have to synchronize fast, but concurrent increase from a macro point of view the request will be much bigger). Simple talk asynchronous, in an internal service, then it needs to be mentioned asynchronous multi-threading, and multi-threaded lot of little increase cpu utilization, improve system performance, but the implementation cost is much higher, then how different services directly asynchronous it, the message middleware, (middleware message difficult to ensure that the first true asynchronous, a second need to ensure that do not leak , it is really hard to two points, especially in the case of large data), in particular a network I / O important consideration asynchronous model, but a very good Netty package.

Since each machine, or services are capped, if the amount of what type of flood is not over and his ability to be processed, so that if we solve it?

The problem can be seen everywhere in life, just right National Day to go home, go out to play, you can see reflected in the matter, such as through security when there is a security to get a special license Man running out, so that the people behind, etc., etc. processing Charles is running out, so that in the back of people, after a similar waiting. But if there is a high level, or the car faster start, and generally let them through, the software architecture which should be called limiting, service degradation, in general, there are two control strategies (1, reject the request, 2, shut down some services ) before the time may have referred to the closure of some services, but now is not recommended ( after all, reflect the company's technology strength ), said the current focus is on denial of part of the request, for the control of this add there? Is the need to control the piece, each layer may need to be added under the control.

Vaguely remember saying inside the industry, high concurrency, high availability three magic weapons: limiting, demotion, cache , cache on, most people should contact the Internet service is characterized by reading and writing less, then it is suitable for use cache.

Because it is a service request, extended or not extended, and unified service which some call a particularly large number, some call it less, because the continued division continues to dismantle, so you can still improve concurrency again.

Micro service, and micro-lot service concept, first mentioned is to engage in a vertical split, it is easy to understand, after many vertical business may also need to continue the level of split (split basis where everything is based on the company's own business, understanding the deeper the better only).

By the above these services can hang the machine can be bad, unreasonable or network problem glitches have been resolved, and can improve concurrency, do our best to make the service availability. Because doing so brought a lot of problems, so it is necessary to solve the problems caused by these changes:

  • In a previous service inside, it is easy to control affairs, then after the micro-services, control on the significant matters of particular importance, and many times we can not strong consistency, but we can do the final consistency is possible.
  • Call it chain monitoring is especially important, as well as with the early warning is particularly important.
  • Distributed logging also is particularly important.
  • Advanced jstack, Btrace in a real environment is particularly important.

Conclusion

I is limited, it is inevitable there will be some differences in understanding of the place, if found, to welcome you to point out the positive, thanks! ! !

Guess you like

Origin www.cnblogs.com/alterem/p/11606548.html