22- "The essence of distributed systems architecture" series 02-- practice from Amazon, talk about the difficulties of distributed systems

First, Amazon's architecture requirements

　　The earliest practice of distributed services architecture company should be thought of Amazon, which back in 2002 promulgated the following architecture requirements, which should be the basis of AWS (Amazon Web Service) appear:

　　1. All teams have their data program modules and functions open up the way through Service Interface.

　　2. The information and communication between team program modules, go through these interfaces.

　　3. In addition there is no other means of communication. Other forms not and will not allow: can not directly link the other programs (the other team's program as a dynamic link library to link), the other team can not read the database directly, you can not use shared memory model, others can not use the back door modules, etc. . The only permissible means of communication is to call the Service Interface.

　　4. Any technique may be used. For example: HTTP, CORBA, Pub / Sub, custom network protocols.

　　5. All Service Interface, without exception, must be designed from the bones to the surface to be able to open to the outside world. In other words, the team must do a good job planning and design for future open to programmers to interface to the world, without any exception.

　　6. People do not do so will be fired.

　　As mentioned earlier, distributed system architecture will bring a lot of problems, such as:

　　1. a single fault line of work would turn over in transit over the different services and different teams;

　　2. Each team may become a potential DDoS attacker, unless each service check quotas and limiting;

　　3. Monitoring and troubleshooting becomes more complex, unless there is a very powerful means of monitoring;

　　4. Service discovery and service management has become very complicated.

　　The face of these problems, the Amazon years of practice, it can be extremely complex operation and maintenance and management of distributed services architecture. Mainly in the following points:

　　1. distributed architecture service architecture requires a distributed team

　　In the Amazon, a service conducted by a small team (two pizza team, two pizza can feed the team). Front to data, to on-line analysis of operation and maintenance requirements. According to the division of responsibilities, division of labor rather than by skill .

　　2. Distributed Services Troubleshooting is not easy

　　Once the more serious failure, the need for overall troubleshooting. S2 emergence of a failure, you can see each team who will be on the line. It can be seen in the work order system, at the beginning of failure, everyone in attendance and their own self-examination system. If there is no problem, but also online standby (standby), and so the problem is solved.

　　3. There is no full-time testers, and no full-time operation and maintenance personnel, developers do all the things

　　Developers benefit is to do everything - eat their own dog food (Eat Your Own Dog Food). Write your own code to protect their own support, and let developers know, write code easier to maintain code complexity. This allows developers to access in demand, do the design, writing code, the tool will take into account when making long-term maintenance of the software.

　　4. operation and maintenance priority, advocating simplifying and automating

　　To be able to operation and maintenance of such a complex system, operation and maintenance and down inside the Amazon in a very big effort. Now they say DevOps this matter, in the Amazon 10 years ago did. Amazon is the most powerful of operation and maintenance, desperately simplify and automate the system, so that Amazon can easily do the operation and maintenance AWS cloud platform has tens of millions of sets of virtual machines.

　　5. internal services and external service agreement

　　Whether it is from the safety aspect, or the interface design, both from the aspect of operation and maintenance aspects of the process, or troubleshooting, Amazon's internal systems and external systems are treated the same. The advantage of this is that the system of internal services can always open it. Moreover, from the first day, service providers have the ability to external services. Imagine, such a standard operating team is its ability to look like.

Second, the distributed system problems that need attention

Non-standard questions 1. Heterogeneous Systems

　　Heterogeneous systems are not standard issue is mainly reflected in:

- Non-standard software and applications
- Non-standard communication protocol
- Non-standard data format
- Development, operation and maintenance processes and methods are not standard

　　Different software, language, naturally have different compatibility and different development, testing, operation and maintenance standards. Naturally, this makes us a different way to the development and operation and maintenance, which led to elevate architecture complexity. For example, some software configuration changes need to change the .config files, while others need to call management API.

　　In terms of communication, different software may use different protocols, even if the same protocols, data formats and so forth. Different teams, using different technology, development and operation and maintenance are not the same. These differences make the whole distributed system architecture is very complex. Therefore, the distributed system architecture have the appropriate specifications. To network communications, for example, many services API error does not return HTTP error status code, but returns to normal status code 200 and then error message in JSON string in HTTP Body. This caused great difficulties to the monitor. Now, we should regulate the use of Swagger.

　　We then software configuration management as an example: many of the company's software configuration management is in the form of key-value. This very flexible, flexible and can easily be abused to - configuration named non-standard, non-standard values, or even directly into the front end of the content display configuration.

　　Good configuration management should be divided into three layers: the underlying operating system and related intermediate layer and associated middleware, and business applications related to the top. The bottom and middle layers can not allow customers the flexibility to modify, but to provide a template so that users can choose, rather than random configuration.

　　Another example is the data communication protocol, there will be agreement protocol header and body. Protocol headers define the basic protocol data, the protocol is a real body business data. We want every team to use this protocol follows the specification defines the protocol header, the request can be easy to monitor, control and management.

2. Service dependency problems in the system architecture

　　Conventional single application, linked to a machine, the software will also hang. So whether it is a distributed architecture such a thing will not happen? In fact, the distributed architecture, service is dependent. A service depends on a service chain hung up, it may cause a domino effect.

　　As mentioned above, a distributed system, dependent services will also bring some problems:

- If a non-critical services are dependent on critical business, then this becomes a non-critical business critical operations.
- Service dependency chain, there will be "short board effect." SLA is determined by the whole of the worst service.

　　This is the governance of services. Service governance not only wants us to define the criticality of services, but also we define or describe the main route critical business or service calls. No service management, operation and maintenance can not manage the whole system.

　　Many distributed architecture at the application layer to achieve business isolation, however, is not on the database node. If a non-mission-critical database to a slow death, it will lead to a full stop is not available. Therefore, the database also need to do the appropriate isolation, preferably a line of business with its own set of database. This is the Amazon server practice - between systems can not read each other's database, only the interface coupled services through. This is the micro-service requirements. We not only want to split the service, but also for the corresponding database for each service split.

3. greater probability of failure

　　In distributed systems, machines and services because of the use will be very much, and therefore, the frequency of failure will be greater than the conventional single application. But, the impact was the failure of a large monolithic applications, and distributed systems, although the impact was the fault can be isolated, but because the machine and multi-service, multi-frequency also failed. On the other hand, because of the complex to manage, and no one knows what the whole structure there, it is very easy to make mistakes. Operation and maintenance of distributed systems architecture, comparable nightmare.

　　The following two comparable to the golden rule:

- Failure is not terrible, just terrible failure recovery time is too long.
- Failure is not terrible, the impact was too big failure was terrible.

　　Operation and maintenance team distributed system is very busy, almost all the time in dealing with failures. Many companies tried to add their own system to monitor indicators, this is actually a thankless - too much information, equal to no information. In addition, SLA requires that we define the "Key Metrics", that is, key indicators. Not for the quality and weight, which is hard on tactical, strategic lazy approach.

　　Above all "fire" rather than "fire." When we design or operation and maintenance system, we must consider how to reduce the fault (Design for Failure). If you can not avoid, but also with an automated way to recover the failure, the failure to reduce the impact surface.

　　When the machine and the number of services rises, it becomes a bottleneck people inherently flawed - people can not do all manner of management of complex things, and automation can only help us.

4. The greater the complexity of the operation and maintenance of multi-tier architecture

　　We usually divide the system into four layers:

- Base layer: is our machines, networks and storage devices;
- Platform layer: is our middleware layer, Tomcat, MySQL, Redis, Kafka like software;
- Application layer: is our business software, for example, the various functions of the service;
- Access layer: the user request is an access gateway, load balancing or CDN, DNS such things.

　　Any problem can lead to a whole layer of problems. There is no unified view and management, leading to operation and maintenance are separated, resulting in greater complexity.

　　Many companies are by the division of labor skills, their skills in accordance with the technical teams into product development, middleware development, business operation and maintenance, system operation and maintenance and other sub-teams. The results of such a division of labor is the cause of all management pool, a lot of things are not completely connected together. The entire system will be like a "domino", as a part of a problem, it will fall to a large area. Because there is no single view of operation and maintenance, do not know how a service call after each service and resources, and it leads in case of failure to spend a lot of time on communication and positioning issues.

　　The division of labor is not a problem, the question is whether the collaboration after the division of unified and standardized. This point must pay attention to.