Summary of Cloud Native Architecture-Reading Notes

Advanced Practical Practice of Cloud Native Architecture-Reading Notes

Cloud native concept

The concept of Cloud Native was first proposed by Matt Stine of Pivotal in 2013. This concept has been continuously improved by the community, and the content is becoming more and more abundant. It currently includes DevOps (a combination of Development and Operations), Continuous Delivery (CD), MicroServices (MicroServices), and Agile Infrastructure (Agile Infrastructure). Infrastructure) and the Twelve-Factor App** and other major topics.

The Cloud Native Computing Foundation (CNCF) was established in 2015 and revised the definition of cloud native, believing thatcloud native needs to include application containerization, microservice-oriented architecture and support for containers Arrangement and scheduling, etc.

Insert image description here

Cloud three-tier model VS cloud native architecture

Insert image description here

Cloud native architecture content

In 2015, the Cloud Native Computing Foundation (CNCF) was established and revised the definition of cloud native. It believes that cloud native needs to include application containerization, microservice-oriented architecture, and support for container orchestration and scheduling.

Cloud native mainly includes two parts: cloud native infrastructure and cloud native applications.

Why you need cloud native

Insert image description here

The core idea of ​​cloud is elasticity.

From the perspective of technological development:

Open source has made cloud computing more and more standardized, and containers have become the standard for enterprise application distribution and delivery, which can decouple applications from the underlying operating environment;

Kubernetes has become the standard for resource scheduling and orchestration, shielding the differences in underlying architecture and helping applications run smoothly on different infrastructures;

On this basis, upper-layer application abstractions (such as microservices and service grids) are established to gradually form a standard for the modernization and evolution of application architecture.

Developers only need to focus on their own business logic and do not need to pay attention to the underlying implementation.

Cloud native is reshaping the entire software technology stack and life cycle through methodologies, toolsets, and concepts, helping enterprises and developers build and run systems on the cloud that are elastically scalable, fault-tolerant, easy to manage, and easy to observe.

Cloud native design principles

1. Decentralization principle

SOA generally has a centralized Enterprise Service Bus (ESB) responsible for the registration discovery and call routing of all services;

Although the microservice architecture also has a service registration center, the service registration center is only responsible for service push when the application starts or when the status changes. In actual operation, the mutual calls between microservices are point-to-point direct calls, that is, the runtime is decentralized. ized.

2. Loose coupling principle

1. Implementation of loose coupling

The service consumer does not need to rely on a specific implementation of the service contract, and the consumer can freely switch to other service providers of the contract in the future.

2. Loose coupling of time

Typical is an asynchronous message queue system. Due to the broker, the producer and the consumer do not have to maintain availability and the same throughput at the same time, and the producer does not need to wait for a reply immediately.

3. Loose coupling of location

A typical service registration center is a service registration center. The consumer does not need to directly know the specific location of the server, but searches for services through the registration center to access them.

4. Loose coupling of versions

The consumer does not need to rely on a specific version of the service contract to work, which requires the service contract to provide backward compatibility as much as possible when upgrading.

3. Design principles for failure

It can fail quickly when an exception occurs and then recover quickly to ensure that the business is always online and cannot be left in a stalemate.

Design for failure means that all external calls are fault-tolerant.

When designing the system architecture, it is necessary to consider that every level of the application system, including hardware and software, may fail, and accordingly eliminate single points of failure in the design of the application system architecture to achieve a high availability (HA) system. architecture.

4. Stateless principle

The design of cloud-native application services is as stateless as possible, making the business inherently scalable. During peak and low peak periods of business traffic, it relies on the characteristics of the cloud to automatically and elastically expand and shrink to meet business needs.

Stateless means that when processing a request, the service does not rely on other content except the request itself, and there will be no additional operations other than responding to the request.

The idea of ​​transforming a "stateful" business processing process into a "stateless" process is relatively simple, and there are mainly two methods.

(1) State separation: All state information on the server is stored in an external independent distributed storage (such as cache, message queue, database). (2) The request comes with all status information: put the status information in front, enrich the input parameters of the request, and put all the data that needs to be processed through the upstream client into the input parameters and pass it there.

State-based web services

In state-based web services, the information about the interaction between the Client and the Server (such as user login status) will be saved in the Server's Session. Under this premise, user requests in the Client can only be accepted and understood by the server that saves status information related to the user. This means that the Server in a state-based Web system cannot load balance user requests. Free scheduling (a Client request can only be processed by a designated Server). At the same time, this will also lead to another fault-tolerance problem. If the specified server goes down during the process of the client user making a request, then all the recent interactive operations of this user will not be transferred to other servers, that is, this request will Nullification.

Stateless web services

In a stateless web service, each web request must be independent, and requests are completely separated. The Server does not save the Client's status information, so the request sent by the Client must contain all the information that allows the server to understand the request, including its own status information. This enables a Client's Web request to be answered by any available Server, thus extending the Web system to a large number of Clients.

Stateful and stateless are relative to applications in computer systems:

  • Stateful: means that the application will save some data, which reflects the running status of the application. That is, the running state of the application may change between different requests. This data is typically stored in memory or in a database and can be shared and accessed by multiple users or clients.
  • Stateless: means that the application does not save any data, and each request is independent and does not depend on previous requests. In other words, the application regenerates the results for each request and does not optimize or cache based on previous results. In this case, data can be passed as parameters of the request.
    For example, a calculator program could be a stateless application. Each calculation only relies on the input data (such as addends and subtractions) and will not be optimized based on previous calculation results.
    An online shopping application is usually a stateful application. Because it needs to record the user's shopping cart, order and other information, which changes with the user's login and operation. If the application is a stateless application, then each time you purchase an item, you need to re-enter the order and delivery information, and the user experience will be very bad.
    Java web applications are mostly stateful, and they need to deal with concepts such as session and state. However, if a web application uses a RESTful architectural style, then it is a stateless application and does not save session information or other related data.

5. Invariance principle

It is hoped that all services (including environments) will be configured in a non-differentiated manner and can be standardized and migrated, and that manual operations are not required during the deployment of any service.

The premise for realizing the principle of immutability is that every service and component in the infrastructure can be automatically installed and deployed without manual intervention. All resources can be pulled up and released at any time, and elastic, on-demand computing and storage capabilities are provided through APIs.

In order to improve availability, fault repair time should be minimized, knowing that the speed of replacement is much faster than the speed of repair.

6. Automation driving principles

Automation drivers are divided into stages such as continuous integration, continuous deployment, and continuous delivery, which are used to ensure the formulation of requirements, design, development, and testing, and then the code is quickly and safely deployed into the production environment.

Continuous integration means that whenever a developer submits a change, it is immediately built and tested automatically to ensure that business applications and services meet expectations, so that new code can be determined Whether it can be correctly integrated with the original code.

Continuous deployment refers to using a completely automated process to automatically submit each change to the test environment,triggering automated test cases, after passing the test verification, the application will be safely deployed to the production environment, opening up all aspects of development, testing, production, etc.

Continuous delivery is the ability to release software. After continuous integration is completed, it can be provided to an environment such as pre-release to meet the conditions of the production environment.

Cloud native infrastructure

Cloud native infrastructure is an infrastructure hidden behind abstractions that is managed by software and controlled by APIs. Its goal is to run application systems.

Cloud native infrastructure is not just running infrastructure on the public cloud, nor is it running applications in containers, so public cloud and containerization are not synonymous with cloud native infrastructure.

However, cloud-native infrastructure can be achieved with the help of containerization technology and public cloud technology.

Public cloud is only the implementation of the IaaS layer. Generally, public cloud still requires the help of manpower to apply and distribute.

The cloud native infrastructure is automatically applied using program code.

Containers are just a way of packaging applications, which does not mean that these applications are autonomous.

Even if an application is automatically built and deployed through DevOps pipelines such as continuous integration and continuous delivery, it is not necessarily cloud native infrastructure.

Kubernetes cannot simply be called cloud-native infrastructure.

Kubernetes' container orchestration technology provides necessary platform support functions for cloud native infrastructure.

The key to whether it is a cloud-native infrastructure is whether it uses automated processing.

For example, manually allocating physical volumes in Kubernetes cannot be called a cloud-native infrastructure because the physical volumes are not automatically allocated.

If you use dynamic volumes and rely on volume declarations to allocate capacity and then allocate containers that use the volumes, you meet the requirements of cloud-native infrastructure.

The basic requirements for infrastructure for cloud native applications are:

● Uptime and isolation. ● Resource allocation and scheduling. ● Environmental isolation.

● Service discovery. ● Status management. ● Monitoring and logging.

● Metric aggregation. ● Debugging and tracing.

Cloud native applications and infrastructure collaborate to discover their related dependent services.

Prometheus implements a service discovery method that actively senses changes in system monitoring targets and automatically adds, deletes, and updates services. The following is a code that allows Prometheus to automatically discover a service on Kubernetes. Prometheus.io/scrape: "true" is to make Prometheus aware of the service. The service exposes HTTP protocol access on port 8080 and the endpoint is /metrics.

Insert image description here

Typically, cloud-native applications are built as a set of microservices running in Docker containers, orchestrated in Kubernetes, and deployed and managed using DevOps and Git Ops workflows.

Cloud native architecture is a software development methodology that includes technical implementation and management (organizational processes). The technical implementation part mainly includes agile infrastructure, cloud public basic services and microservices; the organizational process part mainly includes continuous delivery and DevOps.

Cloud native application architecture includes three characteristics: containerization, microservices and DevOps.

Cloud native applications

The key to cloud-native applications is providing elasticity, agility, operability, and observability.

The concept of resilience implies allowing applications to fail rather than trying to prevent them from failing.

Agility allows rapid deployment and rapid iteration of applications, which requires the introduction of DevOps culture.

Operation means controlling the life cycle of an application from within the application, rather than relying on external processes and monitors.

Observability refers to the need for an application to provide information that reflects the state of the application.

Common methods currently used to implement the features required for cloud native applications include:

● Microservices.

● Health status report.

● Automatic measurement data.

● Flexibility.

● Declarative mode rather than reactive mode.

microservices

Traditional applications are managed and deployed targeting a single entity, called a monolithic application.

The benefits of a single application are obvious, but it cannot solve the concurrency problem of providing services to a large number of Internet users, and makes the development process bloated, slow, and maintenance increasingly difficult. One of the best ways to solve these problems is to decompose a monolithic application into many small service modules.

These service modules are independent of each other, allowing developers to maintain these small systems independently, and the development and maintenance process becomes agile.

After being decomposed into microservices, the writing language of each service can also be determined by itself, and it only needs to comply with the overall API priority and communication requirements.

Insert image description here

Microservice architecture

Microservices are more like the practice and transformation of UNIX philosophy. The UNIX philosophy is "A program should focus on one goal and do it as well as possible. Let programs work together with each other."

The same is true for microservices. Services are more focused on their purpose, that is, they should only do one thing and do it well.

However, microservices cannot be equated with cloud native architecture. Microservices are just an implementation of cloud native culture.

health status report

In order for software to control everything, the application must provide metrics that management software can monitor. The metrics of an application are best known to the author who created the application, so building metrics into the application is the best way to design it.

This requires each application to provide the necessary endpoints that management software can access to determine application status. For example, Kubernetes and ETCD provide a large number of metrics through HTTP.

In addition, the application should provide richer and necessary status reporting.

Under the cloud native architecture, everything is code and everything can be controlled by software.

In order to be controllable, each application must provide a measurement interface so that the management software can learn the running status of the application and make necessary responses. For example, when an application crashes, the management program can stop the current application instance and start a new instance. Application health is only one part of being able to automate an application's life cycle, the manager also needs to know if the application is working.

Automatic measurement data

Automated measurement data is the information necessary to make decisions. This data overlaps with health reporting data, but their purpose is different.

The health report informs the management program of the life cycle status of the application under its jurisdiction.

Automated measurement data is the business metrics that inform the application.

The measured indicators are generally called service level indicators (Service Level Indicator, SLI) or key performance indicators (Key Performance Indicator, KPI). These metrics are application-specific data that allow the hypervisor to monitor application performance within Service Level Objectives (SLOs). Automatic measurement data can solve the following problems:

● The number of requests the application receives per minute. ● Are there any errors.

● How long the application is delayed. ● How long does business processing take?

Monitoring data is often captured or pushed to a time series database (such as Prometheus or InfluxDB), and then processed and analyzed by the metric model for subsequent reminders or large-screen display.

In a dynamic self-healing environment, the hypervisor cares less about the life cycle of individual applications and more about the SLO of the application, because if a program crashes, the hypervisor can dynamically restart the application instance to restore normal operating status. .

For example, if you run the following command in Kubernetes, you can see that coredns has been restarted twice and three times, but the hypervisor does not care about this behavior, only whether it is running normally, because the SLO of the hypervisor is to run normally.

Insert image description here

SLO: Service Level Objective, service quality objective, which usually refers to the highest level goal of maintaining the system. It is a goal set for the expected system availability, expressed as a percentage within a period of time.

Resilient handling of failures

Cloud-native applications should embrace failure instead of trying to avoid it. The only systems that should not fail are those that sustain life. Any system should have a reasonable SLO. If the SLO is ignored to avoid failures, the cost will be very huge. To do this, one must assume that the application may fail and take the necessary steps to deal with the failure, which is a pattern for cloud-native applications.

No matter what kind of failure occurs, cloud native applications must adapt and take reasonable adjustment measures to deal with it.

In addition, cloud-native applications also need to design a method to deal with overload. A common way to deal with overload is to moderately degrade.

Cloud native applications require the ability to gracefully degrade services. The most realistic way to deal with this is to downgrade the service, return a partial response or respond with old information in the local cache.

statement communication

Because cloud-native applications run in a cloud environment, they interact with infrastructure and supporting applications differently than traditional applications. In a cloud-native application, the way to communicate with anything is through the network.

Many times, network communication is done through RESTful HTTP calls, but it can also be achieved through other interfaces, such as remote procedure calls (RPC).

In traditional applications, the communication medium may be files or message queues, but these methods are attempts to build to avoid failure, and they have some problems under cloud native architecture. For example, the application writes the results to a file, and the application crashes after writing. At this point, a situation occurs: before the application crashes, the calculation results have been written to the file. According to the cloud native concept, the application will be restarted at this time, the calculation process will be executed again, and the calculation results will be written to the file again. Therefore, developers should stop using reactive communication and start using declarative communication to increase application robustness and reduce application dependencies.

Imperative programming (Imperative): Detailed instructions on how the machine should handle a thing (How) to achieve the result you want (What);

Declarative programming (Declarative): only tells you the desired result (What), and the machine explores the process by itself (How).

for example:

// 命令式编程做法
let res = false;
for(i = 0; i < dataArr.length; i++) {
    
    
    if (i === 3) {
    
    
        res = true;
    }
}
console.log(res);

// 声明式编程做法
let res = dataArr.filter(i => i === 3);
console.log(res);

An example can also be used to illustrate:

Question: I'm right next to Wanda, how do I get to your house?

Imperative reply: Go straight along Zhongshan Road, turn right at the second traffic light ahead, and then go straight for about 100 meters. You will see the house number 666, which is my home.

Declarative reply: My home address is No. 666 Zhongshan Road.

Twelve-factor application

(1) One base code (Codebase), multiple deployments (Deploy)

(2) Explicitly declare dependencies (Dependency)

(3) Store configuration in the environment

Insert image description here

This allows applications to be easily modified between deployments without changing a single line of code.

(4) Treat backing services as additional resources

(5) Strictly separate build, release and operation

(6) Run the application as one or more stateless processes

(7) Provide services through port binding (Port binding)

(8) Extension through process model

(9) Fast startup and graceful termination maximize robustness

(10) Keep development, pre-release and online environments the same as possible

(11) Treat logs as event streams

(12) Background management tasks run as one-time processes

To achieve a high-quality microservices environment, you do not need to strictly follow these elements. However, by keeping these elements in mind, users can build and maintain portable applications or services in a continuous delivery environment. this is very important.

Implementing a cloud-native model

Cloud native infrastructure is maintained by applications, while cloud native applications are maintained by infrastructure. The two are inseparable. This requires that infrastructure and application design must be simple. If an application is relatively complex, it should adopt the microservice model, split the complex functions into minute services, and then assemble these minute services into an application system. However, such a complex system composed of microservices cannot be managed manually and should be managed automatically. This is also a basic feature of cloud native applications.

Under the cloud native architecture, the life cycle of applications is also controlled by software, and ordinary users do not need to care about the life cycle of applications. Application integration, testing, and deployment should be automated, self-service, and conducted in accordance with a DevOps culture.

The life cycle of an application is a process description of the application from creation, operation, to death in the host environment. An intuitive feeling for users is that an application starts , the application has exited, and the application is in the background

microservices

Insert image description here

Monolithic architecture VS microservice architecture

A microservice is basically an application service that can be released independently, so it can be upgraded, grayscaled or reused as an independent component; it has less impact on the entire large application, and each service can be completed independently by a specialized organization, relying on The party can complete the development as long as the input and output interfaces are determined, and even the organizational structure of the entire team will be streamlined, so communication costs are low and efficiency is high.

Microservices split large and complex software applications into multiple simple applications. Each simple application describes a small business. Each simple application in the system can be deployed independently. Each application is loosely coupled, and each application only focuses on completion. A task and completing that task well. Compared with the traditional monolithic architecture, the microservice architecture has the characteristics of reducing system complexity, independent deployment, independent expansion, and cross-language programming.

Microservice architecture is not a technological innovation, but a requirement for technical architecture when the development process reaches a certain stage, which is derived from continuous exploration in practice. The core idea of ​​microservices is to split applications into simplicity.

The microservice architecture does have many attractions, but its introduction also comes at a cost. It is not a silver bullet. Using it will introduce more technical challenges, such as performance delays, data consistency, integration testing, fault diagnosis, etc. Enterprises need to make reasonable introductions according to different stages of business.

At present, in the practice of microservice technology architecture, there are two main implementation forms: intrusive architecture and non-intrusive architecture.

Intrusive Microservice Architecture: Spring Cloud

The traditional intrusive architecture represented by Alibaba HSF, open source Dubbo and Spring Cloud occupies the mainstream position in the microservice market.

The intrusive architecture deploys process components and business systems in one application to realize workflow automation within the business system.

Since the services and communication components of the intrusive architecture are interdependent, when the number of service applications increases, the intrusive architecture will face new challenges at the service governance levels such as inter-service invocation, service discovery, service fault tolerance, service deployment, and data invocation.

Service mesh drives microservice architecture into a new era. The service mesh is a non-intrusive architecture that is responsible for network calls, current limiting, circuit breaking, and monitoring between applications. It can ensure that application call requests can reliably shuttle through complex microservice application topologies.

Non-intrusive microservice architecture: service mesh

A service mesh is a dedicated infrastructure layer that handles service-to-service communication. It is responsible for reliably delivering requests through complex service topologies that encompass modern cloud-native applications. In practice, service meshes are typically implemented as multiple lightweight network proxies (often referred to as the SideCar pattern) that are deployed alongside the application code but do not need to be aware of the application.

The concept of a service mesh as an independent layer is related to the rise of cloud-native applications. In a cloud-native architecture, with the help of an orchestrator like Kubernetes, a single application may contain hundreds of services, each service may have thousands of instances, and each instance may be in a constantly changing state. This makes communication between these service instances not only very complex, but also critical to ensure end-to-end performance and reliability.

Today, service mesh has become a key component of the cloud native stack.

In May 2017, IBM and Google jointly released open source Istio.

Kubernetes is mainly a complete system for managing containerized workloads and services through API or declarative configuration.

Service mesh does not bring new features to software development and deployment. It solves problems that have been solved by other tools, but this time it is for the implementation of Kubernetes environment under cloud native architecture. The main features of service mesh are:

● Middle layer for inter-application communication.

● Lightweight network proxy.

● Application-agnostic.

● Decouple application retries/timeouts, monitoring, tracing and service discovery.

Currently, two popular service mesh open source software, Istio and Linkerd, have been integrated into Kubernetes, but Istio has more acceptance and adoption.

A service mesh helps manage traffic through service discovery, routing, load balancing, health checks, and observability.

At the beginning of 2018, the release of Istio, a project jointly developed by Google, IBM and Lyft, marked the service grid leading the microservice architecture into a new era.

The future of cloud native

(1) Use cloud native infrastructure architecture in a hybrid cloud environment.

(2) Introduce cloud native technology into edge computing to simplify management.

(3) The service mesh continues to develop, with Istio leading the way.

(4) Develop the Kubernetes-based integrated function-as-a-service technology fPaaS (also known as FaaS).

(5) Based on cost considerations, cloud native technology is more often deployed on bare metal or micro-virtual machines customized for containerization.

(6) More and more third-party software providers adopt containerization technology for lightweight deployment.

(7) Support for stateful applications is becoming more and more abundant.

(8) There will be more and more mature projects across Kubernetes.

Agile infrastructure

**Traditional infrastructure is manually managed by operation and maintenance personnel according to the needs of the software system. However, under the cloud native architecture, the load of cloud native applications is dynamically allocated without manual participation at all. **The actual demand cannot be known in advance. For example, during the Taobao Double 11 event, it is impossible to accurately know in advance how many users will visit for shopping.

In this case, enterprises must implement agile infrastructure.

The purpose of agile infrastructure is to use code to automate and dynamically complete server deployment, updates, and dynamic allocation of storage and network resources, so that operation and maintenance personnel can iterate as quickly as developing software systems to quickly meet the immediate needs of various workloads.

Development under cloud native architecture requires automated testing, construction, and deployment.

In a complete DevOps environment, local library images must be implemented for commonly used Yum, Maven, Nuget, NPM and Docker.

There are many ways to deploy a local Repository (local warehouse). For example, you can choose Sonatype Nexus Repository Manager.

Not only is the infrastructure agile, the development and deployment process of cloud native applications is also an agile process. DevOps is a combination of development and operations. It is a culture, process, platform and tool that values ​​communication and cooperation between software developers and IT operation and maintenance technicians. By automating software delivery and architecture change processes, software can be built, tested, and released more quickly, frequently, and reliably.

Insert image description here

DevOps process

Under the cloud native architecture, because the work responsibilities of Dev and Ops are very clear, the two teams become independent and cooperative with each other. The application development team is fully responsible for product development, while the operation and maintenance team serves the agility of the cloud-native infrastructure. Because the standardization of the infrastructure makes it easier to achieve unification in multiple environments.

In other words, DevOps in cloud-native architecture becomes a communication between application developers and infrastructure operators. Communication between operators and infrastructure operators. They each maintain the life cycle of their own services, improve efficiency through professionalism, and communicate through unified technical languages ​​(such as Kubernetes, containerization, microservice architecture, etc.).

Monitoring under cloud native can use Prometheus and so on. Prometheus is an open source monitoring and alerting tool.

Service mesh application

Spring Cloud based on the Netflix OSS project and Apache Dubbo, which is open sourced by Alibaba Group, are both excellent microservice frameworks.

The current implementation of microservice architecture is often built inside the application in the form of code libraries. These code libraries include functions such as service discovery, circuit breaker, and current limiting. The code base approach not only introduces potential version conflicts, but also once the code base changes, even if the application logic does not change, the entire application must be built and changed accordingly.

In addition, microservices in enterprise organizations and complex systems are often implemented using different programming languages ​​and frameworks. There are often differences in the implementation of service governance code bases for heterogeneous environments, and there is a lack of unified solutions to common problems.

In order to solve the above problems, the community began to promote service mesh (Service Mesh) technology.

The term service mesh is often used to describe the network of microservices that make up these applications and the interactions between them.

Istio, Linkerd, etc. are all representative works of service mesh technology.

The goal of the service mesh is to provide an implementation-independent basic protocol stack for service communication.

Insert image description here

Istio architecture

Enterprise cloud

A technological revolution centered on IT technology is sweeping across. New technologies such as cloud computing, big data, artificial intelligence, Internet of Things, and blockchain are accelerating their application. Among these new technologies, as infrastructure, cloud computing is the carrying platform for this technological revolution, fully supporting various new technologies and new applications.

Insert image description here

Issues and challenges faced by traditional enterprise architecture

Insert image description here

Cloud applications bring new development impetus to enterprises

Insert image description here

Enterprise application cloud migration process

Enterprise applications are moved to the cloud. According to the depth of their use of cloud products, they are divided into two types: cloud hosting mode (IaaS to the cloud) and cloud native mode (PaaS to the cloud).

IaaS on the cloud: cloud hosting model

IaaS solves the problem of resource management and resource supply of physical machine resources, and realizes the unified supply of basic resources through the unification and abstraction of computing, storage and network. IaaS does not bridge the gap between development and operations. Developers still need to pay attention to the various basic middleware running on the operating system.

PaaS on the cloud: cloud native model

The goal of PaaS is to provide various basic software required for application operation and provide a development and operation environment for actual business, thereby making the development of business systems easier and more efficient.

Enterprise migration to the cloud means that enterprises deploy their infrastructure, management and business to the cloud through the Internet. They can use the network to conveniently obtain computing, storage, software, data and other services provided by cloud computing service providers, thereby improving the efficiency of resource allocation. , Reduce the cost of information construction, promote the development of the sharing economy, and accelerate the conversion of old to new.

server:

Servers are specialized computers that store, process, and transmit data and communicate with other devices over network connections. Servers are typically physical devices that can be hosted locally or managed externally through a hosting provider. Servers are responsible for managing network resources, providing access to files and applications, and performing data storage and backup functions.

cloud platform:

The cloud platform is a technical architecture based on the cloud computing model, which provides various computing resources and services through the network. Cloud platforms can include various types of servers, storage devices, network equipment and software, and achieve efficient resource utilization through virtualization technology and automated management. The cloud platform is flexible and scalable. Users can obtain the resources they need on demand and dynamically adjust resources according to business needs. The cloud platform also provides many advanced features, such as backup and recovery, security and monitoring, to help users better manage and protect their data and applications.

The difference between server and cloud platform:

First of all, the ** cloud platform is built based on the cloud computing model, and the server is part of the cloud platform. **Cloud platforms can include multiple servers and other devices to provide flexible and scalable resources. The server is the infrastructure of the cloud platform and is responsible for storing and processing data, as well as providing applications and services. The cloud platform can also combine multiple servers through virtualization technology to achieve resource sharing and unified management.

Secondly, the cloud platform provides more advanced and comprehensive functions to help users better manage and utilize resources. Cloud platforms usually provide backup and recovery functions to ensure data security and availability. It also provides monitoring capabilities for real-time monitoring of servers and other devices through alerts and reports. The cloud platform also has advanced security features that enable data encryption and access control to protect user privacy and confidential information. In addition, the cloud platform also provides automated management and self-service capabilities, allowing users to manage and configure resources more easily.

Cloud platforms are typically managed and maintained by third-party service providers. Users can use the cloud platform through subscription services and pay based on actual usage. This eliminates the need for users to purchase and manage their own servers and equipment, thereby reducing costs and risks. In addition, cloud platforms usually have high reliability and stability, and service providers usually provide monitoring and maintenance services to ensure the normal operation and fault recovery of the platform.

For infrastructure migration to the cloud, we generally call it the "cloud hosting model". Enterprises just change the applications originally deployed on the IDC computer room servers to deploy them in virtual machines (or containers) on the cloud, and basically nothing happens to the application architecture. Change, the transformation to the cloud is low-cost and low-risk.

Adopt a cloud-native architecture and use technologies suitable for cloud deployment, such as virtualization, containerization, and microservices. At the same time, the infrastructure must establish corresponding resource pools for computing, network, storage, etc., and adopt computing virtualization, software-defined storage (Software Defined Storage (SDS), Software Defined Network (SDN), container Docker and other technologies, providing IaaS, PaaS, SaaS, CaaS and other cloud services.

Using PaaS cloud, there is no need to create a virtual machine or configure the environment. All the user needs to do is to deploy the application to the cloud platform and then use it. If you want to use a database, you don't need to use a database server, install an operating system, install a database, or complete complex configurations. All you need to do is to create a database service on the cloud platform, then bind it to the application and you can use it. This is the change brought by PaaS cloud.

Iaas、PaaS、SaaS

We compare enterprise information services to building a house. LaaS provides everyone with various bricks according to the needs of construction.

Next we found that building a house brick by brick was too inefficient. At this time we invented prefabricated parts. In the factory, the walls, floors, columns, etc. were prefabricated and assembled directly on site. After the construction is completed, the service that provides prefabricated parts is PaaS. PaaS is a direct transition between the resource provider and the end user.

SaaS does it more simply and directly provides us with a complete house.

Insert image description here

Insert image description here

SaaS cloud service providers now have 3 options:

  1. Rent someone else's IaaS cloud service, and then build and manage the platform software layer and application software layer yourself.

  2. Rent someone else's PaaS cloud service and deploy and manage the application software layer yourself.

  3. Build and manage the infrastructure layer, platform software layer and application software layer yourself.

Insert image description here

Advantages of cloud native architecture

Enterprise application architecture evolution

Enterprise application architecture has also experienced client/server (C/S), browser/server (Browser/Server, B/S), service-oriented architecture (Service-Oriented Architecture, SOA), and front-end and back-end separation. , microservice architecture, Serverless and other stages.

Insert image description here

Enterprise application architecture evolution

Monolithic architecture

A monolith (also called a Jushi application) is not a stand-alone application. Monolith applications in production environments are usually deployed on multiple nodes in a cluster environment. The monolithic application architecture means that all business functions are implemented within one process. The reception of user requests, the invocation of relevant business logic, and the acquisition of data from the database are all completed within one process. An application archive package (such as war format or jar format) contains all the functions of the application and is run in a web container such as Tomcat. We usually call it a single application. The methodology for building a single application is called a single application architecture, which is a relatively traditional architectural style.

Distributed architecture

In order to solve various problems faced by single applications, technicians use different methods such as vertical splitting and horizontal splitting to split a large single application system into several independent small application systems.

Each small application system is developed and maintained by a team. The team can independently choose the system architecture and technology stack of the application. The release and deployment of applications are also more free and flexible. Applications interact through distributed services, using enterprise JavaBeans ( Distributed architecture represented by Enterprise JavaBean (EJB), WebService, and Message Queue (MQ) has gradually become the technology choice for large and complex application systems.

In technological changes, architecture is further subdivided:

(1) The first subdivision: the front and back ends are separated.

(2) The second subdivision: the backend is split into services.

Distributed architecture is a system architecture composed of a group of computer nodes that communicate through a network and work together to complete common tasks.

SOA

With the horizontal + vertical splitting of distributed architecture application systems, the scale of application systems and services has increased sharply, and the mutual calling relationships between services are intricate. System chimneys, data islands, and application collaboration have become issues in enterprise informatization. System integration has become a more important issue, and the service bus came into being. It can also be regarded as the embryonic stage of SOA.

SOA is a component model that connects different functional units of an application (called services) through well-defined interfaces and contracts between these services.

SOA solves the communication problem between enterprise systems from a system perspective, sorting out the original scattered and unplanned mesh structure between systems into a regular and manageable star structure between systems. This step often requires Introduce some products, such as Enterprise Service Bus (ESB), technical specifications, and service management specifications. The core problem solved in this step is "ordering" so that services built in various systems can be organized in a unified and universal manner way to interact.

Through service extraction and service bus, the interconnection of application systems is realized and data islands are eliminated.

Service bus is not SOA, but only solves the problem of interconnection through service extraction and exposure. SOA has a more important task - the overall layout of enterprise applications.

SOA can coordinate the overall situation and achieve global collaboration of enterprise application systems.

The overall architecture of SOA is divided into 3 layers:

The lowest layer is a single application, which can record the system and meet local business capabilities;

The middle layer is the SOA layer, which realizes collaboration between individual applications and meets the core business processes within the enterprise;

The top layer is the portal layer, which realizes the exposure of collaborative services or new services.

Insert image description here

SOA

SOA is a coarse-grained, loosely coupled service architecture in which services communicate through simple and precisely defined interfaces, without involving underlying programming interfaces and communication models.

From a functional perspective, SOA abstracts business logic into "reusable and assembleable" services, and achieves rapid business regeneration through service orchestration. Its purpose is to transform the original inherent business functions into universal business services to achieve rapid reuse of business logic.The core problem solved in this step is "reuse".

SOA is split according to business functional units and divided into interactive services, information services, etc. Each service is a single service, and single services interact with each other through the enterprise service bus.

SOA only splits vertically, and each service is not further split horizontally, so the split of SOA is not complete.

Large international enterprises represented by IBM have played an important role in promoting the development of SOA. The three concepts of "business servitization, service processization, and process standardization" proposed by it have provided methodological guidance for enterprises to implement SOA transformation.

Microservice architecture: Spring Cloud

The design of a single application architecture is designed for the application itself, not for the enterprise.

SOA is a truly enterprise-oriented design, but SOA has not produced the effects expected by SOA designers and has become a tool for application integration.

Microservice architecture is an architectural pattern that advocates dividing a single application into a set of small services. The services collaborate and cooperate with each other to provide ultimate value to users. Each service runs in its own independent process, and services communicate with each other using a lightweight communication mechanism (usually a REST API based on the HTTP protocol). Each service is built around a specific business and can be independently deployed to production environments, production-like environments, etc. In addition, a unified and centralized service management mechanism should be avoided as much as possible. For a specific service, appropriate languages ​​and tools should be selected to build it based on the business context.

Microservices and SOA seem to be both distributed service-oriented architectures, but there are essential differences between them in terms of service design concepts and system operation architecture:

1. Service granularity

The emergence of SOA is actually to solve the historical problem, that is, in the process of informatization, enterprises will have various systems that are isolated from each other, and a mechanism is needed to integrate them, so there is the emergence of ESB.

SOA provides coarse-grained service capabilities, that is, the encapsulation of large blocks of business logic.

Microservices are more lightweight and provide service encapsulation of individual tasks or small pieces of business logic;

According to the original intention of microservices, services should be split according to business functions until each service has a single function and responsibility and cannot even be split anymore.

In this way, each service can be deployed independently, which facilitates expansion and contraction, and can effectively improve utilization. The finer the split, the smaller the coupling degree of the service, the better the cohesion, and the more suitable it is for agile release and rollout.

2. Decentralization

Registration discovery and routing calls of services under SOA generally go through the ESB.

ESB seems to be perfectly compatible with existing isolated heterogeneous systems, and can use existing systems to build a new loosely coupled heterogeneous distributed system. But in actual use, it still has many shortcomings:

First of all, ESB itself is very complex, which greatly increases the complexity and maintenance of the system.

Secondly, because ESB wants to ensure that all services communicate through one channel, which itself is a centralized idea, ESB can easily become a bottleneck in system operation.

Although microservices also have a "service registration center", it is generally only used to register or pull the service address when the application system starts. When it is actually called, it does not go through the service registration center, but is read directly from the service address cache of the local client. Get the target address information and initiate service calls point-to-point, so the operating efficiency is much higher than routing and forwarding through the central node. It is a "decentralized" operating architecture, eliminating single-point bottlenecks and having stronger scalability.

Insert image description here

Service mesh architecture: Istio

Microservice architecture also has some obvious disadvantages:

First, for microservice architecture using RPC protocol, there is a problem of protocol binding when calling between different microservices;

Second, in addition to focusing on the implementation of business logic, developers also need to deal with a series of issues in the microservice module, such as service registration, service discovery, service communication, load balancing, service circuit breaker, request timeout retry, etc. This is very bad. of.

Can business developers only focus on business development and no longer care about inter-service communication and request management functions? The service mesh architecture solves this problem. It is an extension of the microservice concept and aims to free business developers from trivial technologies and focus more on the business itself.

In the service mesh pattern, each service is equipped with a proxy "sidecar" for communication between services.

In the service mesh pattern, each service is equipped with a proxy "sidecar" for communication between services.

These proxies are typically deployed along with the application code and are not aware of the application. These agents are organized to form a lightweight network agent matrix, which is a service mesh.

These agents are no longer isolated components but a valuable network in their own right. The service grid presents a complete support situation, "building" all services on the grid.

Insert image description here

Service mesh architecture

A service mesh is a "dedicated infrastructure layer" for handling microservices and microservice communication, typically implemented as a matrix of lightweight network proxies deployed alongside application code

It manages complex service topologies through these proxies and reliably delivers requests between services. To some extent, these proxies take over the network communication layer of the application and are not visible to the application.

Service mesh is a very critical component in the cloud native technology stack.

The service grid architecture decouples service governance from business through sidecar and lowers it to the infrastructure layer, making applications more lightweight, allowing business development to focus more on the business itself, and solving various problems under the microservice architecture in a systematic and standardized way. A variety of service governance challenges, improving observability, diagnosability, governance capabilities and rapid iteration capabilities.

Serverless architecture

Serverless (serverless) is a complete process for building and managing a microservice-based architecture that allows users to manage application deployment at the service deployment level instead of the server deployment level, and can even manage the deployment of a specific function or port, which enables Developers iterate quickly and develop software faster.

Serverless is an architectural concept. Its core idea is to abstract the infrastructure that provides service resources into various services, and provide them to users in the form of APIs for on-demand calls, truly achieving on-demand scaling and charging based on usage.

Serverless architecture is an extension of the traditional cloud computing platform and the development of PaaS to more fine-grained BaaS and FaaS, truly realizing the original goal of cloud computing. The serverless architecture currently recognized in the industry mainly includes two aspects, namely the function service platform FaaS that provides computing resources and the back-end service BaaS that provides managed cloud services.

(1) Function as a Service (FaaS) is an event-driven function hosting computing service. Through function services, developers only need to write business function codes and set running conditions, without configuring and managing infrastructure such as servers. Function code runs in a stateless container, is triggered by events and is transient and volatile, and is completely managed by a third party. The infrastructure is completely transparent to application developers. Functions run in a flexible and highly reliable manner, and are billed based on actual execution resources. If not executed, no fees will be incurred.

(2) Backend as a Service (BaaS) covers all third-party services that applications may rely on, such as cloud databases, authentication, object storage, message queues and other services. Developers can use APIs and SDKs provided by BaaS service providers to Integrate all the back-end functions you need without having to build back-end applications or manage infrastructure such as virtual machines or containers to ensure the normal operation of your applications.

In terms of scalability, Serverless scale makes full use of the characteristics of cloud computing, so its expansion is smooth. At the same time, because Serverless is based on microservices, and some micro-functions and micro-service cloud computing are paid on demand, it helps Reduce overall operating expenses.

Serverless architecture is developed on traditional container technology and service grid, focusing on allowing users to focus only on their own business logic.

Starting from the earliest physical servers, we have been continuously abstracting or virtualizing servers to provide lighter-weight, more flexible control, and lower-cost infrastructure. The evolution process of servers becoming more and more lightweight is like the evolution process of human beings.

Insert image description here

The evolution of the server

The development of server technology is to provide more agile infrastructure services for upper-layer application systems. Lightweight means greater flexibility and shorter startup time. Elasticity is the core capability of the cloud, and under the cloud-native serverless architecture, this elasticity will be even more extreme. There is no need to evaluate the business and preset resources in advance. All resources can be pulled up on demand in seconds and released as soon as they are used up. to pay-as-you-go.

Insert image description here

Serverless’s extreme elasticity improves resource utilization

The performance of physical machines has been growing at the rate of Moore's Law in the past few decades. This has resulted in a single application being unable to fully utilize the resources of the entire physical machine. Therefore, a technology is needed to solve the problem of resource utilization.

Simply think, if one application cannot occupy the resources of the entire physical machine, just deploy a few more. However, when multiple applications are mixed and deployed on the same physical machine, various conflicts and resource competition problems will arise.

Therefore, developers have come up with various virtualization technologies to share machine resources and improve resource utilization, while ensuring a certain degree of isolation between applications to resolve conflicts and resource competition.

Serverless allows users to use containers as application deployment carriers with zero threshold. Serverless does not mean that there is no server, but the application does not need to care about the server and system, but only needs to focus on the code. The platform provider will handle the rest of the work, such as operation and maintenance. No need to worry. This eliminates the need for traditional massive continuously online server components, reduces the complexity of development and operation and maintenance, reduces operating costs, and shortens the delivery cycle of business systems, allowing users to focus on business logic with higher value density. on the development. Serverless is truly used on demand. It starts running when the request comes, and is billed according to the running time and resources occupied. Due to the uncertainty of business and traffic, for Internet companies in the early stage, it can be significantly reduced. Reduce enterprise IT costs.

Serverless is an architectural idea that allows for second-second startup, extreme flexibility, on-demand billing, unlimited capacity, no need to care about deployment location, and support for fully managed and operation-free back-end services. Based on industry standards such as containers and Kubernetes, through the innovative application of extreme elasticity technology to host traditional applications, users no longer care about capacity evaluation, calmly cope with sudden traffic, and greatly improve operation and maintenance efficiency; accelerate the lightweight of middleware and development framework, making Users can explore and innovate more freely and conveniently on FaaS and light application frameworks.

Cloud native technology

Docker, Kubernetes, Prometheus, microservices

One of the core technologies of cloud native is containers, which add more advantages to cloud native applications. Using containers, we can move microservices and all required configurations, dependencies, and environment variables to a new server node without having to reconfigure the environment, thus achieving strong portability.

Docker is a popular container technology.

Container technology is also a resource isolation virtualization technology.

Container = cgroups (resource control) + namespaces (access isolation) + rootfs (file system) + engine (container life cycle management)

Virtualization technology can create a specific purpose virtual environment for specific application systems.

Insert image description here

Containers VS Virtual Machines

The container runs natively on Linux and shares the host's kernel with other containers. There is no need to simulate operating system instructions. It is an independent process running on the host.

The virtual machine runs a complete guest operating system. Each virtual machine has an independent operating system kernel. It simulates the operating system instructions of the host through software, virtualizes multiple OSs, and then builds a corresponding operating system based on the OS. An independent program running environment, so the isolation effect is better than that of containers.

Due to the standardization, portability, scalability and other advantages of containers, more and more enterprise production systems are beginning to use containers as a carrier for application deployment. With the popularity of distributed microservice system architecture, the number of application nodes in enterprise production systems has increased explosively, ranging from hundreds to thousands to tens of thousands. Faced with so many application nodes (containers), their orchestration and daily operation and maintenance management have become important issues.

There are two camps in the container orchestration market: Swarm clusters and Kubernetes clusters.

The bottom layer of Kubernetes is based on container technologies such as Docker and rkt, and provides powerful application management and resource scheduling capabilities.

Insert image description here

Kubernetes conceptual model

Insert image description here

K8s cluster

K8s (Kubernetes, originated from Greek, meaning sailing master, abbreviated as K8s) is a brand new distributed architecture solution based on container technology, and is a one-stop, complete distributed system development and support platform .

Prometheus (meaning Prometheus) is an open source system monitoring and alerting framework. Prometheus is an open source service monitoring system and time series database written in Go language.

As a new generation of cloud-native monitoring system, Prometheus has the following advantages compared with traditional monitoring systems (such as Nagios or Zabbix):

(1) Powerful multi-dimensional data model.

(2) Flexible and powerful query statements (PromQL)

(3) Efficient and flexible storage solution: high-performance local time series database.

(4) Easy to manage: Prometheus Server is a separate binary file that can start work directly locally without relying on distributed storage; it also provides containerized deployment images, which can easily launch monitoring services in containers.

(5) Support multiple discovery mechanisms: Support the discovery of monitored target objects through static file configuration and dynamic discovery mechanisms, and automatically complete data collection.

(6) Good visualization: There are a variety of visual graphical interfaces. Based on the API provided by Prometheus, users can also implement their own monitoring visual UI.

(7) Easy to scale: Prometheus can be expanded by using functional partitioning (sharing) + federation to form a logical cluster; Prometheus provides client SDKs in multiple languages, which can quickly integrate applications into Prometheus. Under monitoring.

(8) Using HTTP: Use pull method to pull data, which is simple and easy to use.

Microservices framework

High-Spped Service Framework (HSF): HSF is the microservice framework of the RPC protocol that is most widely used within Alibaba.

The HSF microservice framework is a "decentralized" service framework.

Insert image description here

Dubbo: Dubbo is a distributed service framework open sourced by Alibaba in 2012 (currently a top-level Apache project). It is a high-performance, lightweight open source Java RPC framework.

The biggest feature of Dubbo is that it is structured in a layered manner, which allows each layer to be decoupled (or loosely coupled to the maximum extent).

Insert image description here

Dubbo microservice architecture

Spring Cloud: Spring Cloud is built on Spring Boot. Spring Cloud provides a simple and easy-to-use programming model for the most common distributed system patterns, helping developers build elastic, reliable and coordinated applications.

Spring Cloud is an ordered collection of a series of frameworks. It uses the development convenience of Spring Boot to cleverly simplify the development of distributed system infrastructure, such as service discovery registration, configuration center, message bus, load balancing, circuit breaker, and data monitoring. , link tracking, etc. Spring Cloud does not reinvent the wheel, but integrates and encapsulates well-developed modules on the market, thus reducing the development cost of each module.

gRPC: gRPC is a technology framework open sourced by Google in 2015, an RPC implementation based on the HTTP/2 communication protocol and Protobuf serialization.

From the perspective of implementation and features, gRPC considers more client-server communication in mobile scenarios.

gRPC uses Protobuf for serialization and deserialization by default. Protobuf is Google's mature and open source mechanism for serializing structured data. It is a proven and very efficient serialization method. At the same time, gRPC can also use other data formats, such as json.

Insert image description here

gRPC communication model

Service mesh:

In 2016, Buoyant first proposed the concept of service mesh.

Service mesh drives microservice architecture into a new era. A service mesh is a non-intrusive architecture for a dedicated infrastructure layer that handles service-to-service communication. It is responsible for reliably delivering requests through complex service topologies.

In practice, a service mesh is typically implemented as a matrix of lightweight network proxies deployed alongside application code, and it is not perceived by the application.

Service mesh helps applications establish stable communication mechanisms in massive services, complex architectures and networks.

Open source projects such as Linkerd, Envoy, Istio, and SOFAMesh are all service mesh frameworks.

SOFAMesh is a service mesh product open sourced by Ant Financial in 2018. SOFAMesh follows the mainstream of the community in its product route and chooses Istio, which is currently the most influential and promising service mesh.

The goal of SOFAMesh is to create a more pragmatic implementation version of Istio. On the basis of inheriting the powerful functions and rich features of Istio, it makes necessary improvements to meet the performance requirements under large-scale deployment and to cope with the actual situation in implementation practice.

IaaS: Infrastructure as a Service

Insert image description here

Infrastructure layer (IaaS) technical architecture

The cloud native infrastructure layer IaaS includes computing resources, storage resources and network resources, distributed system basic services, and basic cloud services. It is the foundation of the entire cloud computing environment.

Computing resources, storage resources, and network resources are physical hardware devices.

Distributed operating systems and basic cloud services are mainly used for integrated management of hardware devices, breaking through the traditional stand-alone architecture and using virtualization, distributed storage, SDN and other technical means to build dynamically scalable computing resource pools and storage resources. Pools and network resource pools realize the co-construction and sharing of resources, and at the same time build automated cloud services such as resource orchestration and automated deployment to support the stable operation of upper-layer applications and various platforms.

DaaS: Data as a Service

DaaS data services mainly include big data platforms, data resource pools and data integration platforms.

Insert image description here

Insert image description here

Big data platform

Insert image description here

Database services support diverse business scenarios

The cloud native architecture treats the database as a resource service and decouples it from the application system.

PaaS: Platform as a Service

Enterprise-level distributed application service is a PaaS platform centered around applications and microservices, providing enterprises with a highly available and distributed application support platform. It not only provides resource management and service management systems, but also provides distributed service framework, service governance, unified configuration management, distributed link tracking, high availability and data-based operations. Using distributed application services, you can easily build a microservice architecture and build a large-scale distributed system to publish and manage applications, assisting applications in IT system transformation to meet growing business needs.

The distributed application service platform is a PaaS platform that combines application hosting and microservice management, providing full-stack solutions such as application development, deployment, monitoring, and operation and maintenance.

Insert image description here

Distributed application service function module

High availability

Application High Availability (HA) is a comprehensive issue. The goal of the IT system is to ensure the continuous availability of the business. All problems that may cause the business to be unable to provide normal services as expected by users belong to the category of problems to be solved by high availability. .

There are several metrics for system availability, the most core of which is the mean time to failure (MTTF) of the system. MTTF identifies how long an application system can run normally before a failure occurs on average. The higher the availability of the system, the longer the mean time between failures. The maintainability of a system is measured by the mean time to repair (MTTR), which is the average time it takes to repair and restore normal operation after a system failure. The more maintainable the system, the shorter the mean time to repair. Generally speaking, the availability of an application system is defined as MTTF/(MTTF+ MTTR)×100%.

Highly available design

1. Application design
1. Eliminate single points

Single point service is a relatively simple service model. All service functions are implemented on a service program. All clients requesting services are connected to this service. on, communicating directly with the service.

Distribution is the foundation of cloud native architecture design. For an application system to be truly distributed, all levels must be distributed, that is, from traffic access to service invocation, data storage, caching, message queues, object storage, etc. , must be distributed, and there is no single point of problem in any link. Only such a system can have good elasticity.

Insert image description here

Best practice reference for distributed architecture capabilities

Distribution also means decentralization.

2. Stateless

A corollary of the general principle is that services should not be stateful. The key point that prevents a single architecture from becoming a distributed architecture is the processing of status. If the state is all stored locally, whether it is local memory or local hard disk, it will create a bottleneck for the horizontal expansion of the architecture.

Statelessness does not mean that the application system is completely stateless, but that scalable partial services are stateless by externalizing the state.

Save the state in stateful middleware, such as cache, database, object storage, big data platform, message queue, etc. This is what we often call state externalization.

3. Idempotent design

Idempotence generally refers to background services. "Service idempotence" refers to multiple calls to a certain service. As long as the requested parameters are the same, the service return results must be consistent, and the background system will not be affected by multiple calls. any side effects caused by the call.

for example:

Alipay debit service.

After the user purchases the goods and pays (the order serial number will be included in the parameters), the payment is successfully deducted when calling Alipay, but the network is abnormal when the result is returned (the money has actually been deducted at this time). If the front-end application request times out, it will retry calling the Alipay deduction service. The Alipay deduction service will internally determine whether the current order serial number has been deducted. If the payment has been deducted, it will directly return to the front-end application "Deduction Successful"; if the payment has not been deducted, it will be deducted normally and then return to the front-end application "Deduction Successful".

Therefore, even if the front-end application calls the Alipay deduction service multiple times for the same order serial number, the result returned by Alipay should be "deduction successful" and cannot return an exception similar to "repeated deduction". This is Idempotent design of services.

Another example is deleting services, etc., which also need to follow the idempotent design.

5. Elastic expansion and contraction

Elastic scaling includes elastic expansion and elastic contraction. During peak business periods, when the system load is heavy, the load on the original pressure nodes is shared by horizontally expanding new application nodes or pulling up new containers, allowing the system to smoothly survive the front-end The impact of high concurrent traffic on applications. During low business peak periods, some application nodes are released to improve the effective utilization of resources.

Automatic elastic scaling rules generally include basic server performance indicators such as CPU and load, and can also include application-related indicators such as response time and number of threads. During automatic elastic shrinkage, the system will ensure the minimum number of service nodes (generally no less than two nodes).

There is also artificial elastic expansion and contraction.

5. Fault-tolerant design

Fault tolerance refers to the ability of software to detect and recover from errors in the software or hardware running the application. It can usually be measured from the reliability, availability, testability, etc. of the system.

6. Synchronous to asynchronous

In terms of implementation, synchronous requests are usually converted into asynchronous subscription processing through message queues to reduce coupling between systems and prevent core applications from being overwhelmed by non-core applications.

7. Cache design

The main function of caching is to reduce the load on applications and databases, and improve system performance and client access speed. In terms of architecture and business design, you can consider caching the results of queries that have large access volumes, are not modified frequently (such as dictionary tables and system parameters), or have a greater impact on database performance, to improve the overall performance of the system.

8. Separation of movement and static

Static and dynamic separation refers to the architectural design method in which static pages and dynamic pages are accessed separately on different systems, and static resources (such as html, js, css, img, etc.) are deployed separately from back-end services. Static resources are placed on CDN, Nginx and other facilities, and the access path is short and the access speed is fast (a few milliseconds); the dynamic page access path is long and the access speed is relatively slow (database access, network transmission, business logic calculation), requiring dozens of Milliseconds or even hundreds of milliseconds, which requires higher architecture scalability.

9. Flow control downgrade

Flow control downgrading is equivalent to providing an insurance for the back-end application system, giving the system a certain degree of stress resistance. It is widely used in scenarios such as flash sales, message peak shaving, cluster flow control, real-time circuit breaker, etc., from multiple dimensions. Ensure customer business stability.

Flow control, that is, flow control, adjusts the randomly arriving traffic into an appropriate shape, that is, traffic shaping, to prevent applications from being overwhelmed by instantaneous traffic peaks.

Circuit breaker downgrade will limit the call of a resource in the call link when it is in an unstable state (such as call timeout or abnormal ratio increases), allowing the request to fail quickly to avoid affecting other resources and causing cascading errors. .

10. Application health check

Application health check is to ensure that the application of the current node can provide external services normally. Once the application health check fails, the management and control node (such as load balancing, service registration center, K8s control, etc.) must remove the failed node in time to prevent further requests. A faulty node causes business failure.

11. Elegant online and offline

Graceful online and offline operations are an important guarantee for 24/7 uninterrupted business operations. In order to realize the release process of application version changes without being aware of online business, I generally use a batch release strategy when releasing applications.

Suppose there are 10 nodes in an application cluster. We divide it into 5 batches and only release 2 of the nodes in each batch to ensure that the online business is basically not affected by the release in batches.

12. Design to fail fast

The principle of failure-oriented design is that all external calls have fault-tolerant processing, and it is hoped that the failure results will be predictable and designed. When an abnormality occurs in the system, it can fail quickly and then recover quickly to ensure that the business is always online and cannot be left in a stalemate.

2. Data design
1. Data distribution

High availability of databases is generally divided into active and standby databases and distributed databases.

Through distributed database middleware, technologies such as database and table partitioning, smooth expansion, and read-write separation can be achieved to achieve horizontal expansion of the underlying physical library and solve the bottleneck of database throughput and capacity as the application scale becomes larger and larger.

Distributed databases allow database performance to grow quasi-linearly by adding physical resources.

2. Heterogeneous data

Technical reserves and implementation costs should be comprehensively considered based on different scenarios to select the most appropriate database technology.

Data integration and synchronization between heterogeneous data sources are essential elements for designing high-availability application systems. Appropriate synchronization technologies and solutions should be selected based on data synchronization scenarios, such as offline synchronization/online synchronization, full synchronization/incremental synchronization, real-time synchronization/scheduled synchronization, etc.

Insert image description here

Heterogeneous data scenarios
3. Data disaster recovery

Data disaster recovery is to ensure that key data can be recovered within a certain period of time when the main data center computer room is destroyed due to disasters (earthquakes, strong typhoons, fires, floods, etc.) and major human misoperations, so as to reduce business losses. to an acceptable level.

According to different modes of data disaster recovery and data multi-activity, data deployment forms are divided into the following five types:

(1) A-S mode: Active-Standby mode. Only the main center provides normal data services, and the backup center only performs cold backup of data. When a disaster occurs, the main center will switch to the backup center.

(2) A-Q mode: Read-write separation mode (Active-Query). For read scenarios that do not require high real-time data, the data distributed to the backup center provides read-only services to share the pressure on the main center database.

(3) A-A’ mode: Asymmetric operation mode. The active and standby centers provide business reading and writing services, but the services they undertake are different. In order to avoid data cross-coverage conflicts, the business data on the active and standby centers must be completely separated. The backup center only undertakes some non-core services, and data is synchronized in both directions between the primary and secondary servers.

(4) A-A mode: symmetrical operation mode. The active and standby centers provide exactly the same business reading and writing services. The front-end traffic portal is used to identify which data center a specific request should be routed to. There is no cross-coverage problem with data between different users. Data between the active and standby centers is bidirectional. Synchronize.

(5) C-U mode: unit deployment mode (Centre-Unit). Generally, it is a very large-scale application in multiple centers. For example, Alibaba's internal e-commerce application system adopts unit deployment and is divided into central nodes and unit nodes. Traffic is distributed and routed to the nearest unit node according to the user's registration location. Within the unit node, To complete most business closed loops, a few centralized services (such as inventory-related services) must request central nodes.

The corresponding multi-center computer room deployments from simple to complex are: single center, dual centers in the same city, three centers in two places, five centers in three places, seven centers in five places, and multiple centers in multiple places.

3. Compatibility design
1. Data change constraints

As the business changes, the database table design also needs to change. We may need to add some field content. If some of the previous data table fields are no longer used, can they be deleted? Any design is based on currently known business scenarios and business models, and it is difficult for us to predict future changes. Because of this, we hope to leave a certain degree of flexibility in the architectural design so that when the business changes in the future, the system can calmly cope with these changes without having to reinvent the wheel.

In terms of data changes, we also hope to achieve backward compatibility through good database design. You can refer to the following principles:

(1) 1+N master-slave table design: Put relatively stable basic data in the master table, and put easily changing data related to specific business in a separate slave table (sub-table).

(2) Redundant field design: A certain number of redundant fields are reserved in the database table for future business expansion. However, this model leads to unclear design and high later operation and maintenance costs.

(3) Column-to-row design: A separate field extension table (setting table) is used to store business extension information. The extension table has three core columns, namely the primary key of the main table, the extension field name, and the extension field content. Each extended field of the main table is saved with a row of records in the extended table (that is, the columns of the main table are converted into rows of the extended table). Through the extended table design of column-to-row conversion, the columns of the main table can theoretically be scalable arbitrarily.

(4) Dictionary table design: A certain column in the data table may have different value meanings and value ranges in different business scenarios. In this case, business rules can be defined through a separate dictionary table.

2. Database compatible

The same application code may need to adapt to different databases. A set of MyBatis files is implemented through the databaseIdProvider solution officially provided by MyBatis, which can run in a compatible manner in a variety of relational databases. The main idea is that SQL in each database has a differentiated way of writing, and the databaseId tag needs to be added to the mapper file of MyBatis.

3. Interface change constraints

In a distributed application architecture, the mutual calls between services are complicated. By default, when a server-side application is iteratively upgraded, the caller interface should be backward compatible, that is, the caller does not need to be aware of the server-side application upgrade. If the business changes are indeed large and the original interface cannot meet the requirements, it is recommended to solve the problem by adding a new interface (or if the service supports multiple versions, it can also be distinguished by different service version numbers). The new and old interfaces run in parallel for a period of time, giving the caller a certain amount of time to smoothly upgrade and transition. When all services are eventually migrated to the new interface, the old interface will be offline.

4. Capacity design
1. Capacity estimation

The so-called capacity estimate refers to the maximum business load that the system can withstand before being overwhelmed. This is an important indicator for technicians to understand system performance. Capacity estimation generally uses various methods such as full-link stress testing, linear analysis, similar analogies, and empirical judgments to comprehensively evaluate the business capacity that the current application system can carry. Business capacity refers to the peak number of transactions per second (TPS), throughput, response time (Reaction Time, RT) and other performance indicators that the system can handle for specific business scenarios.

2. Capacity planning

(1) When should service nodes be added? When should service nodes be reduced (for example, what level of traffic does the server receive)? For example, "Double 11", big sales, flash sales.

(2) In order to meet business needs such as "Double 11", promotions, flash sales, channel expansion and traffic, what order of magnitude of services need to be expanded to ensure the availability and stability of the system and save costs?

Capacity planning is based on capacity estimation and according to the business development plan, it points out the business capacity scale that the system can bear in the future in order to adapt to the needs of business development, and what kind of investment the system needs to go through in order to achieve such capacity scale. and transformation.

High availability solution

1. Full-link stress testing solution

The most effective way to improve the usability of a system is to verify it through testing.

The best verification method is to allow events to occur in advance, that is, to allow real traffic to access the production environment, to achieve a full range of real business scenario simulations, and to ensure that the performance, capacity, and stability of each link are foolproof. This is the background for full-link stress testing, and it is also an all-round upgrade of performance testing to give it “predictive capabilities.”

Insert image description here

Inputs and outputs of testing different solutions

Full-link stress testing refers to a system stress testing method that covers all services and all links. The full-link stress testing solution simulates real business scenarios of massive users through technical means, making all performance problems invisible.

The implementation process of the full-link stress test solution includes stress test scenario analysis (performance goals, timing diagrams, state diagrams, system architecture, data architecture, technical architecture), full-link stress test execution steps, technical key points, and environmental transformation (shadow tables, Sequence isolation, baffle interception, mocking, log filtering, stress test identification transfer), base data import, stress test data (parameterization) preparation, tool support, etc.

Insert image description here

Full-link stress testing environment transformation
2. Flow control downgrade plan

Application flow control downgrade is widely used in flash sales, big promotions, cluster flow control, real-time circuit breakers and other scenarios to ensure that the system operates normally within a predictable range and ensure the stability and reliability of the business from multiple dimensions.

Insert image description here

Schematic diagram of flow control downgrade scheme
3. Failure drill plan

Insert image description here

Fault layered portrait

The purpose of fault drills is to normalize drills, classify faults, and make drills intelligent. Use regular drills to drive stability improvement instead of remedial training before the big promotion; enrich fault scenarios and define minimum fault scenarios and processing methods; intelligent drills based on architecture and business analysis, intelligently recommend fault drill scenarios based on application architecture. Precipitation industry fault drill solutions.

4. Fault isolation plan

Fault isolation means that when certain modules or components in the system have abnormal faults, these faults are isolated in some way so that they are not connected with other systems. Even if there is a problem with the isolated application or system, It will not affect other applications.

The basic principle of fault isolation is to cut off the fault source in time when a fault occurs. The isolation range from high to low is: data center isolation, deployment isolation, network isolation, service isolation, and data isolation.

5. Elastic scaling solution

The goal of elastic scaling is to achieve linear expansion of service capacity, that is, service capacity expands linearly with the number of instances deployed by the service provider. The premise of elastic scaling is agile infrastructure and resource pool sharing. Agile infrastructure ensures that the required virtual computing resources can be quickly created when applications need to scale. Resource pool sharing ensures that resources released by one application can be used by other applications.

(1) Flexible expansion

(2) Elastic shrinkage

(3) Elastic self-healing: Auto-scaling provides a health check function (regularly requests the specified service address to verify whether the expected return is obtained), automatically monitors the health status of virtual machine instances in the scaling group, and avoids healthy virtual machine instances in the scaling group. below the preset minimum value. When a virtual machine instance is detected to be in an unhealthy state, Auto Scaling automatically releases the unhealthy node and creates a new service node, and automatically mounts the newly created service node to the load balancing instance.

6. Apply emergency plans

Application contingency plans are a pre-designed means of protection to prevent system anomalies from causing uncontrollable losses to the business when various anticipated and unanticipated abnormalities occur in the system. They are the last line of defense for system operation and maintenance. The emergency plan is an all-round and comprehensive fault solution, including both technical emergency plans and management process emergency plans.

Each emergency plan includes a clear effect description, triggering conditions, execution steps, involved systems, scope of impact, verification methods, etc., as well as relevant people of the plan, including decision-makers, executors, etc.

Tools can be used to improve efficiency, accelerate fault recovery, improve monitoring and alarming, and locate faults accurately and quickly.

Data consistency solutions

1. Strong consistency solution

In a distributed environment with strong consistency, a transaction request performs integrity and consistency operations on data from multiple data sources to meet the characteristics of the transaction. Either all succeed or all fail, ensuring atomicity and visibility. Strong consistency ensures that dirty reads and writes will not occur in distributed concurrency data by locking resources, but at the expense of performance.

Generally speaking, the performance of distributed transactions with strong consistency will be about one order of magnitude lower than that of local transactions on a single machine. Therefore, when using it in actual application scenarios, you need to carefully evaluate whether the business must require strong consistency transactions and whether it can be done in the business. Some trade-offs and compromises, or changing to an eventual consistency solution with better performance.

The XA protocol is the interface between the global transaction manager and the resource manager.

The reason why XA is needed is that theoretically speaking, two machines cannot achieve a consistent state in a distributed system, so a single point for coordination is introduced. Transactions managed and coordinated by the global transaction manager can span multiple resources and processes, responsible for the commit and rollback of individual local resources. The global transaction manager generally uses the XA Two-Phase Commit (2PC) protocol to interact with the database.

It is actually difficult for the current mainstream phased submission protocol to achieve 100% data consistency. In the end, "asynchronous verification + manual intervention" must be used to ensure it.

2. Weak consistency solution

There are only two ways to classify data consistency in a strict sense, strong consistency and weak consistency. Strong consistency is also called linear consistency. In addition, all other consistency are special cases of weak consistency.

The so-called strong consistency means that replication is synchronous; weak consistency means that replication is asynchronous.

Eventual consistency is a special case of weak consistency, which ensures that users can eventually read updates to system-specific data caused by an operation.

Weak consistency is mainly for data reading. In order to improve the data throughput of the system, a certain degree of "dirty reading" of data is allowed. A process updates the copy's data, but the system cannot guarantee that subsequent processes can read the latest value. A typical scenario is the separation of reading and writing. For example, for a relational database with one master and one backup asynchronously replicating, if you read from the standby database (or read-only database), you may not be able to read the updated data of the main database, so it is weakly consistent. sex.

3. Eventually consistency solution

Since the implementation cost of strong consistency technology is high and the running performance is low, it is difficult to meet the high concurrency requirements in real business scenarios. Therefore, in actual production environments, final consistency solutions are usually used.

Eventual consistency does not pursue the requirement that the system can meet the requirements of complete and consistent data at any time. The system itself has certain "self-healing" capabilities. After a period of time that is commercially acceptable, the system can achieve the goal of complete and consistent data.

There are many solutions to eventual consistency, such as distributed subscription processing through message queues, data replication, data subscriptions, transaction messages, try-confirm-cancel (TCC) transaction compensation and other different solutions.

1. Message queue solution

The core point of the solution is to establish two message topics, one to handle normal business submissions, and the other to handle exception correction messages.

The request service sends normal business execution messages to the business submission topic, and different business modules subscribe to consume the topic and execute normal business logic.

If all business executions are normal, the data will naturally be complete and consistent;

If there are any exceptions during business execution, an exception correction message will be sent to the business correction topic and other business execution modules will be notified to perform correction and rollback to achieve the ultimate consistency of the data.

This solution cannot solve problems such as dirty reads and dirty writes, and certain business trade-offs need to be made when using it.

Insert image description here

Message queue-based eventual consistency solution
2. Transaction message plan

The core point of this solution is that the message queue product must support semi-transactional messages (such as RocketMQ). Semi-transactional messaging provides a distributed transaction function similar to the X/Open This solution will not cause problems such as dirty reads and dirty writes, and is relatively simple to implement. However, it requires that the message queue product must support semi-transactional messages, and once the message is delivered, the default design business execution must be successful (through the retry mechanism). If business execution ultimately fails due to some anomalies and the system cannot heal itself, you can only wait for manual intervention through alarms.

Insert image description here

Eventual consistency scheme based on transaction messages
3. Data subscription plan

The core point of this solution is that data subscription products (such as Alibaba's Data Transmission Service (DTS), Data Replication Center (DRC)) can receive the update log of the database (such as MySQL's binlog, Oracle's archive log) and Convert it into a message stream for the consumer to subscribe to. The business execution module consumes data change messages and executes change synchronization processing logic to achieve ultimate data consistency. Since data subscription is asynchronous, there will be a certain message delay, and the delay depends on the amount of data changes and the performance of data subscription processing.

Insert image description here

Eventually consistent solution based on data subscription
4. TCC transaction compensation

TCC is a two-stage programming model of service. Its three stages of try, confirm, and cancel are all implemented by business coding.

Insert image description here

An eventual consistency scheme based on TCC transaction compensation

In addition, there are Saga transaction mode, Seata global transaction middleware and other modes to implement eventual consistency solutions.

Insert image description here

Final consistency solution based on Saga transaction model

Insert image description here

Seata-based eventual consistency solution

Disaster recovery and multi-active solution

A disaster recovery system refers to the establishment of two or more systems with the same functions in distant places. The systems can monitor each other's health status and switch functions. When one system is affected by an accident (such as fire, flood, earthquake, When the system stops working (such as man-made sabotage, etc.), the entire application system can be switched to another location so that the system can continue to work normally. The disaster recovery system needs to have relatively complete data protection and disaster recovery functions to ensure the integrity of data and business continuity when the production center cannot work normally, and the disaster recovery center can take over in the shortest time to restore the normal operation of the business system. Minimize losses.

There are data disaster recovery solutions, active-active solutions in the same city, three-center solutions in two places, active-active solutions in remote locations, unitized solutions, etc.

Data disaster recovery refers to the establishment of an off-site data system, which is a real-time copy (or offline copy, depending on the acceptable level of data loss for the business) of local key application data. When a disaster occurs to the local data and the entire application system, the system will at least have a copy of the key business data available off-site.

The same-city active-active solution refers to the simultaneous deployment of services in two computer rooms in the same city. The bottom layer shares the same set of storage, and the primary and backup storage are deployed in different computer rooms. Generally, it is an architecture of active-active data active and backup services, but few actually implement active-active data because the implementation of active-active data is too difficult, and it is difficult to guarantee 100% data consistency in dual centers.

Three centers in two places refers to a commercial disaster recovery and backup solution of "dual centers in the same city" + "off-site disaster recovery". It is a compromise solution that comprehensively evaluates business high availability and data security. It has relatively high input and output, so The core application systems of many financial institutions and large enterprises are built using this solution.

The three-center solution in two places adds an off-site disaster recovery center to support data disaster recovery services on the basis of active-active in the same city to ensure that important business data is not lost when major natural disasters (earthquakes, tsunamis, fires, etc.) occur in the two centers in the same city ( or only a small amount is lost).

Remote active-active (or multi-active) is a necessary condition to truly achieve high business availability. When a major disaster occurs in any city, the remote computer room can still ensure the continuity of core business.

Different from the local active-active solution, the distance between the two computer rooms in remote active-active is much farther, the network delay has exceeded the allowable range of the business, and the single-center operation mode of data active-standby architecture is no longer feasible in terms of performance.

In order to avoid conflicts caused by bidirectional replication of the same data between two computer rooms, the remote active-active (or multi-active) architecture requires that the data in different computer rooms be based on specific vertical splitting rules (such as taking modulo based on user ID). Completely split, no overlapping data exists.

The unitized solution is a further extension of the remote active-active solution. When the number of remote computer rooms exceeds two, pairwise data replication between multiple computer rooms will become very difficult, and as the number of computer rooms increases, this The complexity will rise dramatically.

The unitized architecture consists of several unit nodes and a central node.

There will be a central node in the unitized solution. The central node stores central applications and global data that cannot be split by unit, such as seller-related data and inventory-related data in the e-commerce system. All units are required to see these global data. It's all the same. In order to realize that the central node can take over the business of any unit, the central node will also save all unit data, that is, all units will copy the unit's data to the central node. In this way, when a certain unit fails, its traffic will be taken over by the central unit, and the central unit accesses all unit data of the central node.

Guess you like

Origin blog.csdn.net/yinweimumu/article/details/135005851