Read [Microservice Design] (7) Scale Microservices

1. Failures are everywhere

Statistically speaking, failure at scale becomes an inevitable event.

Therefore, when designing and implementing a microservice system, we only need to take into account as many possible failure factors as possible to ensure the availability of the system as much as possible.

2. Functional downgrade

The microservice system is made up of multiple services working together. When a certain service goes down, we need to consider how the system behaves externally. For example, the shopping cart service of the mall system is down. At this time, we choose to let Can users continue to browse products or set the mall homepage to "system maintenance"? This needs to be combined with the business context to make a decision. Usually, after a service hangs up, we will not set the entire system to be unavailable, but handle it appropriately, and some functions will still be available. Of course, this needs to be considered in our design.

3. Antifragile approach

  • Timeout, timeout is something that is easily overlooked. When calling other services, it is important to set the timeout correctly.
  • Circuit breaker, when A service calls B service always times out or fails (reaching a certain number of times), the circuit breaker will be opened, and subsequent requests will fail quickly; after a period of time, the client sends some requests to check whether the downstream service has been restored, if Responding correctly, will reset the circuit breaker.
  • Bulkhead mode, is a way to isolate yourself from failures. In shipping, a bulkhead is the part of a ship that, when closed, protects the rest of the ship. In the field of software architecture, there are many different "bulkheads" that we can use, such as using different connection pools for each service, so that when a connection pool is exhausted, the call requests of other services are not affected, circuit breakers Also belongs to a kind of bulkhead.

4. Idempotent

For an idempotent operation, the impact of multiple executions is equal to the impact of one execution. If an operation is idempotent, we can call it repeatedly without fear of adverse effects. Idempotent is very useful when we are not sure whether the operation has been executed and want to re-execute the operation.

For example, HTTP GET and PUT are defined as idempotent in the HTTP specification. Of course, this requires you to strictly follow the specification when defining these APIs.

5. Expansion

The purpose of extension is to improve the overall performance and reliability of the system.

  • Use a more powerful host
  • Splitting the load, that is, using a host to run a service, the host usually refers to a virtual host
  • Don't put it in one basket, even if you split the service to run multiple virtual hosts, but they are all on one physical host, there is still a lot of risk. So try to keep your services running in different data centers
  • Load balancing, such as Nginx, can improve system throughput and reliability
  • For redesign, the initial system will not be perfect, and the concurrency that can be handled is limited. When more loads are foreseen, redesign should be considered.

6. Expand the database

  • Extended reads. Many business scenarios are read-oriented, and caches or read-only replicas can be used to extend read performance. Mysql and Postgresql have this function, but you have to be aware that the replica data in this way may not be real-time, but it has final consistency.
  • Extended writing is more difficult than extended reading. One method is to use sharding, such as mongodb. The problem with using sharding is the database downtime that may be caused when adding shards. In this regard, you need to do a lot of completeness in advance test.

7. Caching

It is a commonly used method for performance optimization. By storing the results of previous operations, subsequent requests can reuse the stored value without spending time and resources to recalculate the value.

  • For client-side caching, the client decides when and whether to request new data, and the server will give corresponding prompts to help the client make decisions.
  • Proxy server cache, add a proxy server between the client and the server. Reverse proxies and CDNs are good examples.
  • Server caches, such as redis and memcache, or service in-process caches.

Which caching method to choose depends on what you want to optimize. Client-side caching can greatly reduce the number of network calls and is one of the fastest ways to reduce the load on downstream services, but in this way, when you want to change the way you cache, It is very difficult to make all changes in a large number of clients. Everything is opaque to both the client and the server as to how the proxy server caches, but it's usually a very simple way to add caching to an existing system. For the way of server-side caching, everything is opaque to the client. This way can easily track and optimize the cache hit rate.

  • HTTP caching, HTTP provides a very useful means of control to help us cache on the client or server side.
    • First, we can add a "cache-control" directive to the header of the response to tell the client whether or not the resource should be cached, and for how long. Some standard static website content such as CSS and images are suitable for this method.
    • In the header of the response, you can also specify an expires field, which specifies a date and time. The resource will expire after this time, and the client will reacquire it. This is appropriate if you know how soon a resource will be updated.
    • Etag is also a field placed in the header of the response, but it must be used in conjunction with the header of the request. It indicates whether the resource has changed. For example, if we want to obtain a resource, the Etag returned by the response is 05td21d. Later, cache-control tells us that the cache has expired, and we will request the server again, but at this time, we can add a key-value pair to the header of the request, "If -None-Match:05td21d", the server will judge the value of this field. If the Etag does not match, it will return the new resource + 200OK + the new Etag value, otherwise it will return a 304 status code (unmodified).
    • Note that the functions of the above fields may overlap, please fully understand the principles before deciding whether to use them at the same time.
  • write cache
    • That is, write to the cache/queue first, and write to the database or final location later, which can also improve write performance.
  • Avoiding cache breakdowns that bring down services
    • The author's recommended way is not to request the origin service on a cache miss, but to fail fast, but tell the origin service to fill the cache immediately (asynchronously).
  • Keep it simple, if you use caching in too many places, the client will most likely see stale data, which can cause a lot of problems.

8. Automatic scaling

It is to automatically expand/shrink your business cluster according to the high and low peak hours of the business. This capability is generally provided by the service hosting platform. There are two options, one is responsive scaling, and the other is predictive scaling; the latter needs to be passed A large amount of data observes the high and low peak hours of the business.

9. CAP Theorem

One of the famous theorems in distributed systems, C (consistency) A (availability) P (partition fault tolerance), this theorem tells us that a system has at most two of these characteristics. The author illustrates this theorem with a classic database master-slave copy scenario, and here I will directly state the conclusion.

There is no CA system in a distributed system , because sacrificing partition fault tolerance means that what is deployed is a single-process/node system, which is not a distributed system. Therefore, the distributed system we finally implement must be CP or AP. Which one to choose needs to be weighed according to the business scenario. For example, for a category service, a record is out of date for 5 minutes, is it acceptable? The answer is yes, but can it be outdated data for bank balances? Of course not.

Our system as a whole does not need to be AP or CP. The directory service can be AP because the records can be outdated, but it can only be CP for currency accounts or point services, because you don’t want users You can use the remaining 10 yuan to repeatedly spend 10 yuan worth of goods twice. So is this microservice CP or CP? In fact, what we have done is push the trade-offs about the CAP theorem into each function individually.

The author finally mentions that no matter how consistent the system itself is, it cannot know all possible things, especially when we keep records of the real world, which is why in many cases, the AP system is the final correct choice one. In addition to the complexity of building a CP system, it cannot solve all problems by itself.

10. Service Discovery

This provides a convenience: after a new service instance is registered/destroyed, the client can quickly access it/request other available instances.

  • DNS, the easiest way. DNS associates a name with multiple IPs, just like when you visit a certain website multiple times, the response is not always an IP. The disadvantage of DNS is also obvious. The DNS entry of the domain name has a TTL, and the client believes that the entry is valid within this time. When we want to change the host that the domain name points to, TTL will cause the client to not be able to access the new host in time, and DNS can cache entries in multiple places. The more places that are cached, the greater the delay. One way to circumvent this problem is to point the domain name to the IP of the load balancer, and the load balancer is responsible for the online/offline of the instance. This is like the commonly used Nginx website construction architecture now.
  • Zookeeper, originally developed as a part of hadoop, is used in many scenarios, including configuration management, data synchronization between services, leader election, message queue and naming service; it needs to deploy at least 3 nodes.
  • Consul, also supports configuration management and service discovery. It goes a step further than Zookeeper and provides more support for these scenarios. One of its killer features is that it provides a ready-made DNS server. Specifically, for a specified name, it can provide an SRV record containing an IP+port, which means that if the system is already using DNS and supports SRV records, You can start using Consul directly; it also provides the ability to check node health. Consul has Restful HTTP interfaces from registration services, key-value reading and writing, and health checks.

11. Documentation

  • Swagger, which allows you to describe the API, has a very friendly web interface through which the API can be called directly. To achieve this, Swagger needs the service to provide auxiliary files that match its format, and it has a large number of libraries in different languages ​​​​to help you do this.

Guess you like

Origin blog.csdn.net/sc_lilei/article/details/107040949