Summary of software architecture business and technical complexity analysis

Table of contents

1. Overview and analysis

2. Business complexity analysis

(1) Domain modeling

(2) Field stratification

(3) Service granularity

(4) Process arrangement

3. Technical complexity analysis

(1) High availability

underlying logic

CAP principle

BASE theory

Availability and reliability

Resilience Engineering

chaos engineering

Observability

(2) Performance

(3) Event-driven architecture

event

service definition pattern

EventBridge

(4) Cloud native

What counts as cloud native

Immutable infrastructure

Mecha runtime

service mesh

Distributed application runtime

Apache Camel

Scalability and elasticity

Serverless architecture


1. Overview and analysis

Complexity analysis of software architecture usually covers two aspects: business complexity and technical complexity. These two aspects interact with each other and jointly determine the complexity of the final software system.

Business complexity:

  1. Business process complexity: A software system's business process may involve multiple steps, actors, and conditions. Complex business processes may require more logic and interactions, increasing system complexity.

  2. Domain knowledge complexity: If the software system involves complex domain knowledge, business rules, and workflows, the development team needs to have an in-depth understanding of these domain concepts, which may increase the difficulty of development and communication.

  3. Data model complexity: If the business needs to handle large amounts of data, different types of data relationships, and complex query requirements, the design and management of the data model may become more complex.

  4. Frequency of business changes: If business requirements change frequently, software systems need to be adaptable and flexible, which may lead to complexity in system design.

Technical complexity:

  1. Architectural design complexity: Choosing the appropriate architectural pattern, component distribution, and communication methods to meet business needs can be a complex task.

  2. Distributed system complexity: When decomposing a software system into multiple services, microservices, or modules, distributed system issues such as communication, data consistency, and load balancing need to be considered.

  3. Technology stack selection: Different technology stacks have different advantages and limitations. Choosing the right technology stack may require consideration of multiple factors such as performance, scalability, security, and more.

  4. Integration and interaction complexity: If the system needs to be integrated with other systems, unification of protocols, data formats, etc. may introduce additional complexity.

  5. Performance and scalability: High performance and scalability requirements may require consideration of issues such as caching, load balancing, database optimization, etc., which increases the complexity of the system.

  6. Security and Privacy: The security requirements of the system may lead to the introduction of encryption, authentication, authorization, etc., which will increase the technical complexity of the system.

  7. Error handling and fault tolerance: Handling errors, faults, and fault tolerance mechanisms require additional code and design, increasing the complexity of the system.

Taken together, the interaction between business complexity and technical complexity will determine the final software system complexity. During the design and development process, various needs, challenges, and limitations need to be weighed to find the most appropriate balance point to ensure that the system can meet business needs and have good maintainability and scalability.

2. Business complexity analysis

(1) Domain modeling

The rise of microservice architecture has triggered reflection on the shortcomings of monolithic architecture. Therefore, in the process of domain modeling and splitting of services, the domain-driven design (DDD) principle proposed by Eric Evans in 2003 was adopted as a guiding principle.

DDD highlights a series of important concepts to assist with domain modeling and service design in microservices architecture:

  1. Ubiquitous Language: Emphasis on creating a consistent business language across the entire team to ensure accurate communication between developers and business experts.

  2. Model-Driven Design: advocates converting business logic into actual code models, which helps directly map business requirements and code implementation.

  3. Context Map: Defines the relationship between different bounded contexts to coordinate the interaction between different microservices.

  4. Bounded Context: Dividing the system into a series of clear business contexts, each with its own model and language, helps isolate complexity.

  5. Identification of Duplicate Concepts and False Cognate: Focus on identifying and eliminating concepts that are similar but have different meanings in different contexts to ensure consistency.

  6. Crunching Knowledge: means jointly understanding the business across the entire team and transforming this knowledge into consistent models and code.

By using programming concepts such as aggregate roots, entities, value objects, domain services, application services, and resource libraries, domain models can be designed and implemented at the code level to reflect business needs. This approach helps build a clear, consistent, and maintainable domain model that adapts to the microservices architecture. The entire process is similar to building a Tower of Babel, reducing cognitive load by breaking knowledge into smaller parts while maintaining consistency.

In domain-driven design (DDD), there are also some important patterns and principles for identifying, communicating, and selecting model boundaries and relationships between different subsystems to ensure the integrity of the domain model and the efficiency of collaboration between organizations.

Here is an explanation of these patterns and principles, expressed in a clearer way:

  1. Shared Kernel: When two teams decide to share a core area or common subarea, they can reduce duplication by sharing the kernel, making integration between the two subsystems easier. The shared kernel can be used to solve situations such as event sharing between producers and consumers.

  2. Customer-Supplier: In the customer-supplier model, the customer's services are one-way dependent on the provider's services. The client determines the provider's development freedom, but both parties have their own domain model boundaries and context and can develop independently.

  3. Follower (Conformist): In the follower model, the client's services one-way rely on the provider's services, while the provider tends to be altruistic and share information with the client. As a follower, the client eliminates the complexity of transitions between bounded contexts by strictly adhering to the provider model.

  4. Anticorruption Layer: In the anticorruption layer model, client services one-way rely on provider services. Since the provider evolves the domain model independently, the client, in order to reduce the conversion cost caused by changes in the provider model, establishes an anti-corruption layer by using the Facade or Adapter mode to isolate the conversion logic.

  5. Separate Ways: In the Separate Ways model, both parties do not consider integration and develop independently, which is suitable for situations where close collaboration is not required.

  6. Open Host Service: When a subsystem is cohesive and meets the common needs of other subsystems, it can be encapsulated as a service and used by all subsystems that need to be integrated with it through an open protocol.

These patterns and principles help teams make informed decisions when doing domain modeling and service design in microservice architecture to ensure collaboration efficiency between different services and domain model consistency.

(2) Field stratification

The goal of a layered architecture is to isolate the domain model, separate concerns in a hierarchical manner, and ensure that changes in the business complexity of this layer do not have a negative impact on the lower layers.

In domain-driven design, Eric Evans proposed a four-layer model, dividing different components into the interface layer, application layer, domain layer and infrastructure layer. This division helps achieve clear architecture and separation of responsibilities.

In addition, Vaughn Vernon proposed the port and adapter architecture, also known as the Hexagonal Architecture, in "Implementing Domain-Driven Design", which has become the de facto standard for microservice layered architecture.

The design idea of ​​this architecture is to divide the system into an internal core (domain) and external adapters (interfaces), and interact through interfaces (ports). The core contains the domain model and business logic, while the adapter is responsible for interacting and communicating with external systems. This layered and separated structure helps achieve better maintainability, testability, and flexibility.

In short, layered architecture and hexagonal architecture are to achieve system decoupling, modularization and easy management, and are especially suitable for domain-driven design and microservice architecture. These principles and architectures help manage complexity and ensure that systems can make appropriate changes at different levels without affecting other levels.

Then, Jeffrey Palermo proposed the onion architecture . Based on the port and adapter architecture, the idea of ​​putting the domain at the center of the application and placing the communication mechanism (UI) and the infrastructure used by the system (ORM, search engine, third-party API...) at the periphery is implemented. Internal layers have been added: the outer layer representing the communication mechanism and infrastructure, and the inner layer representing business logic. The port and adapter architecture has the same idea as the onion architecture. They both liberate the application core from focusing on the infrastructure by writing adapter code and prevent the infrastructure code from penetrating into the application core. In this way, the tools and communication mechanisms used by the application can be easily replaced, which can avoid technology, tool or vendor lock-in to a certain extent.

Then, Uncle Bob ( Robert C. Martin ) summarized various layered architectures into Clean Architecture . Herberto Graça further summarized it as Explicit Architecture .

(3) Service granularity

Compared with monolithic architecture, microservice architecture can decompose business complexity into individual microservices more effectively. Therefore, the granularity of splitting services becomes a key topic. In his book Software Architecture: The Hard Parts, Neal Ford proposes two metrics to objectively measure the size of a service:

  1. Number of statements: This is a metric used to measure the size of a service and provides an objective view of the logic contained within the service. A smaller number of statements usually means clearer, more focused service.

  2. Number of public interfaces: This is the number of public interfaces or operations exposed by the measurement and tracking service. Fewer public interfaces usually mean that the service is more focused on a specific functional area.

While there is still some subjectivity and variability in these metrics, they are by far the closer to objectively measuring and evaluating service granularity.

In addition, the following are 6 criteria for measuring service split:

  1. Service scope and functionality: Evaluate whether the service does too many unrelated things, is cohesive enough, and follows the single responsibility principle.

  2. Volatility of the code: Consider whether changes can be localized to specific parts of the service, and whether there are areas that change frequently.

  3. Scalability and throughput: Analyze whether different parts of the service can scale in different ways to meet different throughput needs.

  4. Fault tolerance: Evaluate whether there are errors within the service that could cause critical functions to fail, and whether these errors occur frequently.

  5. Security: Consider whether some parts of the service require a higher level of security to ensure secure access to sensitive data and functionality.

  6. Scalability: Whether the analytics service is always expanding to add new context to account for future changes.

In addition, the following are 4 criteria for measuring service consolidation:

  1. Database transactions: Consider whether ACID transactions are required between services to ensure data integrity and consistency.

  2. Workflow and Choreography: Evaluate whether there is a need to impact performance or choreography patterns involving Saga transactions and inter-communication between services.

  3. Shared code: Consider whether there is a need to share code, and whether the shared code base changes frequently, is specific to shared domain functionality, and whether there are version control issues.

  4. Data relationships: Does analysis service splitting ensure that the data used can also be split to avoid data relationship issues.

When deciding how to split and merge services, these criteria need to be weighed to ensure that the system is architected for good quality and performance across different aspects.

(4) Process arrangement

Business processes represent the execution sequence of business logic. Arranging them and displaying them in a visual way can effectively connect different business links, reduce cognitive load, and thereby reduce business complexity.

In the microservice architecture, process orchestration within and between services is an important issue. Microservices architecture presents some challenges in handling business processes, monitoring status, and handling failures in processes. Therefore, we need to enrich the design principles and architectural decisions on the microservice architecture to explore and continuously improve the process orchestration mechanism based on microservices.

In the past SOA era, many single services used BPEL engines or enterprise service buses (ESB) for process orchestration. ESB implements communication interactions between various subsystems through message pipelines, allowing services to communicate with each other under ESB scheduling without direct dependence. However, this "smart pipe and dumb endpoint" model moves a lot of logic into the network, making the system expensive, complex, and nearly impossible to troubleshoot.

In today's era, we focus more on building small, single-purpose microservices that connect and communicate through "Smart Endpoints and Dumb Pipes". Components within microservices have more powerful business process processing capabilities, and the process orchestration between services follows the saga model to achieve transaction operations across multiple services.

In short, the orchestration of business processes is of great significance in the microservice architecture. It is necessary to consider how to implement process monitoring, coordination and fault handling under the microservice architecture to meet the needs of modern systems for business processes. At the same time, the emphasis of the microservice architecture is to strengthen the independence and processing capabilities of each service, while adopting a flexible process orchestration method between services.

3. Technical complexity analysis

(1) High availability

underlying logic

CAP principle

Also known as Brewer's theorem, it was proposed by Eric Brewer, a computer scientist at the University of California, in 1998, and later proved by Gilbert and Lynch in 2003. It focuses on three core characteristics in distributed system design: consistency, availability, and partition tolerance. The CAP principle states that in a distributed system, at most two of these three characteristics can be satisfied at the same time, but all three cannot be satisfied at the same time.

in particular:

  • Consistency : On all nodes in the distributed system, for a read operation of the same data, the most recent write result should be returned. That is, the system guarantees data consistency, even in a distributed environment.

  • Availability : Each request can get a non-error response, that is, the system ensures that the service is always available and will not be unable to provide services due to partial node failure.

  • Partition tolerance : The system is still able to continue running when encountering communication partitions (network failures) between nodes. Partitioning means that nodes cannot communicate with each other, but the node can still communicate with the outside world.

The importance of the CAP principle is that due to the characteristics of distributed systems, data consistency, availability, and partition tolerance cannot be guaranteed simultaneously in all situations. Therefore, when designing a distributed system, you must weigh and choose a satisfactory combination of features based on specific needs and scenarios.

In short, the CAP principle reminds us that we need to make trade-offs in distributed system design, and choose to satisfy two of consistency, availability, and partition tolerance based on actual conditions.

Here we analyze consistency. In distributed systems, consistency has different levels and types, including the following:

  1. Strong consistency (Atomic Consistency/Strong Consistency): In a distributed system, strong consistency requires that read operations on all nodes return the latest write results, that is, read operations will wait until all replicas have been updated and consistent state. This ensures strict consistency of the data, but may result in some performance penalty.

  2. Eventual Consistency: Eventual consistency relaxes the requirements for real-time consistency and allows data to be inconsistent for a period of time, but will eventually tend to a consistent state. The system will ensure that after a period of time, all replicas will eventually converge to a consistent state.

  3. Sequential Consistency: Sequential consistency requires that in a distributed system, the execution of operations must satisfy a certain partial order relationship, that is, execute in a certain order. This does not necessarily require a global total order, it only needs to satisfy some specific partial order relationships. For example, Zookeeper implements sequential consistency.

  4. Linearizable Consistency: Linear consistency requires that the execution of operations must satisfy the global total order relationship, that is, the operations must be executed sequentially in a certain global order. This generally requires higher performance but guarantees strong consistency. For example, etcd implements linear consistency.

These different levels of consistency provide options for trading off performance and data consistency according to your needs in a distributed system. Strong consistency provides the highest data consistency, but may sacrifice performance; while eventual consistency allows higher performance, but may cause data inconsistency over a period of time. Sequential consistency and linear consistency also provide different trade-offs in different situations. Depending on the needs of your application, you can choose the appropriate consistency level to balance performance and data consistency.

BASE theory

BASE theory is a design idea in distributed systems. It is the abbreviation of three phrases: Basically Available, Soft State and Eventually Consistency. It was proposed by eBay architect Dan Pritchett.

In BASE theory, the meanings of each concept are as follows:

  • Basically Available: When a system failure or abnormal situation occurs, partial availability is allowed to be reduced, but the availability of core functions is still guaranteed. This means the system can continue to provide services with limited losses without being completely paralyzed.

  • Soft State: Allows the system to exist in intermediate states for a period of time without affecting the overall availability of the system. This usually involves delays in replica synchronization, where data may be inconsistent between different nodes for a period of time.

  • Eventually Consistency: This is the level of consistency pursued in distributed systems. It means that although data may be inconsistent on different nodes for a period of time, eventually the data will reach a consistent state at some point. The system can handle data synchronization in an asynchronous manner to gain greater flexibility in performance.

The BASE theory is actually an evolution of the CAP principle. It emphasizes that in a distributed system, strong consistency does not necessarily have to be pursued, but consistency and availability can be weighed according to the characteristics and needs of the business. Each application can adopt appropriate methods to achieve eventual consistency based on its own circumstances, that is, after a period of time, the data will eventually reach a consistent state. This method has been practiced and verified in large-scale Internet systems.

Availability and reliability

When talking about high availability, we often refer to SRE (Site Reliability Engineering). High availability means that the system can remain available for a large part of the time, while SRE focuses on the reliability engineering of the system, that is, ensuring that the system can work stably and reliably during operation.

Availability is a description of the time a system is available for use, depending on how long the system is running during a specific period of time. Higher system availability means that the system is available for a longer period of time with shorter interruption times.

Reliability is a description of the system failure interval, which takes into account the number of failures that occur in the system during operation. The goal of SRE is to reduce the number of system failures through a series of engineering practices, thereby improving system reliability.

availability level

business availability

Annual downtime

Weekly downtime

Daily downtime

99%

Basic availability

87.6 hours

1.68 hours

14 minutes

99.9%

higher availability

8.76 hours

10.1 minutes

86 seconds

99.99%

High availability

52.6 minutes

1.01 minutes

8.6 seconds

99.999%

High availability (telecommunications)

5.26 minutes

6.05 seconds

0.86 seconds

99.9999%

Extremely high availability (aerospace)

To measure reliability, the following three indicators are usually used: (MTBF = MTTF + MTTR)

  1. Mean Time Between Failure (MTBF): It indicates the average time interval between failures after the system has been running for a period of time. A longer MTBF means the system is less likely to fail.

  2. Mean Time To Failure (MTTF): It represents the average running time of the system under normal working conditions. A longer MTTF indicates that the system can operate stably under normal conditions.

  3. Mean Time To Repair (MTTR): It represents the average time from system failure to successful recovery. A shorter MTTR means the system can quickly return to normal after a failure.

AvailabilityA = UpTime/(UpTime+DownTime) = MTBF / (MTBF + MTTR)

To sum up, high availability emphasizes the availability of the system most of the time, while SRE focuses on reducing the frequency and duration of system failures by improving the level of reliability engineering, thereby achieving stable and reliable operation of the system.

99.9 = (1 year = 365 days = 8760 hours) * 0.1% = 8760 * 0.001 = 8.76 hours

annual downtime

Number of downtimes per year

System A

5 minutes

50 times

System B

1 hour

1 time

A. Availability > B. Availability

A. Reliability < B. Reliability

Resilience Engineering

Resilience engineering refers to an architectural feature that allows a system to remain tolerant of expected failures. This includes various aspects, such as fault tolerance, disaster tolerance, self-healing and disaster recovery.

Key indicators on resilience:

  • Recovery Point Objective (RPO): This is the maximum amount of data loss that the system can tolerate. It measures the data redundancy backup capability of the disaster recovery system. The smaller the RPO value, the lower the system's tolerance for data loss.

  • Recovery Time Objective (RTO): This is the maximum time that the system can tolerate service outage. It measures the business recovery capability of the disaster recovery system. The smaller the RTO value, the faster the business recovery time required by the system.

In "Resilience Engineering" by David D. Woods, resilience engineering focuses on the following four concepts:

  • Robustness: The ability of a system to absorb expected disturbances.

  • Rebound: The ability of a system to recover quickly from a traumatic event.

  • Graceful Extensibility: The system's ability to adapt to and handle unexpected situations.

  • Sustained Adaptability: The system's ability to continuously adapt to changing environments, stakeholders, and needs.

Taking the current Service Mesh as an example, the microservice resilience technologies involved include:

  • Service timeout (Timeout)
  • Service retry (Retry)
  • Service rate limiting (Rate Limiting)
  • Circuit Breaker
  • Fault Injection
  • Bulkhead isolation technology (Bulkhead)

In addition, it also includes fast startup and graceful offline during service expansion and contraction, load balancing and outlier eviction in service registration and discovery, as well as grayscale release, AB testing, blue-green deployment and other technologies in traffic control. These approaches all contribute to building highly resilient systems that remain stable and reliable in the face of failures and uncertainties.

chaos engineering

Chaos engineering is the discipline of experimenting in distributed systems with the goal of building resilience and confidence in the system under runaway conditions in a production environment. Chaos engineering adopts the scientific experimental method of "falsifiability" and focuses on verifying whether the defined system steady-state hypothesis can be falsified under the turbulent conditions of the production environment, thereby enhancing confidence in the system. Through chaos engineering, you can build a culture of resilience in the face of unexpected system conditions.

In "Antifragility: Profiting from Disorder" published by Nassim Nicholas Taleb in 2012, fragility refers to suffering losses due to volatility and uncertainty, while antifragility refers to the loss through adaptation. To benefit from, and even profit from, chaos and uncertainty.

Chaos engineering presents ideas that are at odds with resilience engineering, human factors, and safe systems research. It suggests that in improving system robustness one should look for and eliminate system weaknesses rather than just looking for things to do wrong through resilience engineering. Chaos engineering emphasizes choosing the right experiments to understand the steady state of the system under test, which requires domain experts to have a deep understanding of the internal logic of the system, especially how the system is designed with fault tolerance principles and expected failures in mind. It also means understanding not only what faults the system can tolerate, but also how the system responds to expected conditions in actual operation.

Chaos engineering for SRE and development teams is not only about diagnosing errors, but more importantly, studying which operations are correct. This requires selecting appropriate experiments to gain a deep understanding of the state of the system under test. This approach emphasizes an in-depth understanding of the system, including its fault-tolerance logic and actual operating conditions, thereby improving system resilience and stability.

Observability

Observability plays an important role in resilience engineering and chaos engineering, and is one of the three major characteristics of service mesh, which is designed for cloud native service governance.

Observability refers to the real-time collection of telemetry data in a distributed system, observing these data on the management and control side (control plane), deciding whether to intervene, and issuing intervention rules to the operation side (data plane). This is a key approach to achieving high availability, with the ultimate goal of governance for microservices or cloud-native services.

Observability involves three key aspects:

  1. Metrics : Used to record continuous, aggregated data. For example, the depth of the queue or the number of HTTP requests. This helps detect anomalies and set alerts.

  2. Logging : Used to record discrete events, such as application debugging information or error information. It is the basis for problem diagnosis.

  3. Tracing (call chain tracking) : used to record information within the scope of the request, such as the execution process and time-consuming of remote method calls. It helps troubleshoot system performance issues.

These three are the basis of observability and complement each other: Metrics is used to find anomalies, Tracing is used to locate problems, and Logging is used to find the source of errors. This process is iterative, and Metrics are adjusted based on previous problem analysis to detect or prevent similar problems earlier.

In terms of distributed tracing infrastructure, most are based on Google's Dapper design. For example, Twitter’s Zipkin and Uber’s Jaeger are both distributed tracing tools. OpenTracing is an open standard for call chain tracing graphs. It is widely adopted and compatible with tools such as Zipkin and Jaeger.

OpenCensus is launched by Google. Unlike OpenTracing, it also includes Metrics and provides Agent and Collector for data collection. OpenTelemetry combines OpenTracing and OpenCensus and enters the CNCF's Sandbox project. It is committed to unifying the ultimate solution of Metrics, Tracing and Logging, providing unified context storage and dissemination to achieve comprehensive observability.

Extension: Apache application performance monitoring (Application Performance Monitor, APM) tool SkyWalking

(2) Performance

In the field of technology architecture, the word "high performance" can be seen as a rhetorical expression to a certain extent. In the context of software architecture, "high performance" is actually closer to "Performance" in English, which emphasizes the performance and efficiency of the system in all aspects.

The following are some connotations of high performance in technical architecture:

Different dimensions of performance:

In different contexts, performance may include one or more of the following aspects:

  1. Short response time/low latency: This emphasizes the rapid response of the system when processing requests, ensuring that users or customers are not dissatisfied due to long waiting times.

  2. High throughput: This focuses on the workload that the system can handle, that is, how many tasks can be completed in a given time to ensure efficient processing.

  3. Low resource utilization: This means that the system efficiently utilizes computing resources while completing tasks to avoid wastage of resources.

  4. Capacity: This refers to the carrying capacity of the system, that is, how many requests or tasks the system can handle at the same time. It is usually related to the architecture and scale of the system.

Response time and latency:

Although response time and latency are often used synonymously, they are not actually the same. Response time refers to the overall wait time perceived by a customer or user, including processing time, network latency, and queuing latency. Latency specifically refers to the time a request waits for processing. During this time, the request is in a dormant state.

Dimensions of performance efficiency:

Performance efficiency involves many aspects and can be understood from three dimensions: time behavior, resource utilization and capacity:

  1. Temporal behavior: This includes the response time and processing speed of the system, i.e. how long it takes to process a request and the number of requests the system is able to handle.

  2. Resource utilization: This focuses on the resource utilization of the system during operation, such as CPU, memory, network, etc., to avoid resource waste and overuse.

  3. Capacity: Capacity refers to the load that the system can withstand, including peak loads and long-term loads, ensuring that the system can operate normally under various conditions.

Performance considerations in different scenarios:

Performance concerns will vary for different types of systems. For example, a batch system may be more concerned with throughput, while an online system is more concerned with response time. Depending on the specific scenario, different strategies can be adopted to improve performance, such as improving throughput through asynchronous mechanisms and reducing retrieval latency through caching and indexing.

When building a technical architecture, taking multiple aspects of performance into consideration and optimizing according to specific needs can help the system demonstrate high efficiency and performance under various circumstances.

(3) Event-driven architecture

Microservices and event-driven architecture (EDA) are regarded as two core concepts in modern architectural styles. The former focuses on data-centered request-driven, while the latter emphasizes event-centric driving.

Martin Fowler proposed three different types of event patterns that play an important role in building distributed systems:

  1. Event Notification: This pattern involves the system notifying other interested parties when an event occurs in the system. This pattern can be used for real-time notifications, publish-subscribe patterns, etc. to perform appropriate actions when an event occurs.

  2. Event carries state transition: In this mode, the event is not only a notification, but also carries some status information so that the receiver can process it based on the status of the event. This mode is often used in scenarios such as state machines and workflows to help the system coordinate when status changes.

  3. Event sourcing: Event sourcing is a way of recording system state changes. By recording the occurrence of each event, the historical state of the system can be reconstructed. This is useful for auditing, troubleshooting, and implementing features such as time travel.

These event patterns all help to implement event-driven architecture in distributed systems. Through the publishing, subscribing and processing of events, the system can respond to changes and interactions more flexibly.

event

An event is a fact that has happened and is immutable. In contrast, a message is raw data produced by one service for consumption or storage by another service, and the message can be modified.

The event producer generates and delivers the event honestly. It does not care who, why, or how the event will be handled. The producer of the message knows who is consuming it, and knows what factors to encapsulate into the message for the consumer to process.

The event broker is designed to provide factual logging. Events are deleted after a timeout period, which is defined by the organization or business. The message Broker is designed to handle various problems. When the consumer perceives the message, the message can be deleted.

event

information

Data

The fact that it has happened and is immutable (Immutable)

Raw data produced for consumption or storage

Producer/Consumer

The producer has no idea who the consumer is and what to do with it

The producer knows who the consumer is and what to do with it

Broker

Events are deleted after providing fact log timeout

After handling various problems and being perceived by consumers, the messages are deleted.

  • Discrete events: describe changes in state and are executable

  • Continuous events: describe the condition (condition) and are analyzable

Usually, events are discrete, used to describe changes in the state of a thing, and can be executed. Consumers perform corresponding actions based on the status described by discrete events.

Events can also be part of a continuous data stream and are used to describe the current state of a thing. These continuous events are analyzable, and consumers can analyze certain trends and the reasons behind them based on changes in these states.

Events should be designed to be of minimum size, simplest type, and have a single purpose. Here we will focus on CloudEvents. CloudEvents entered the CNCF Foundation's sandbox project in May 2018, and then became an incubation project of CNCF in just over a year. Its development speed is very fast. CloudEvents will become the standard protocol for event communication between cloud services. At the same time, it should be emphasized that CloudEvents has released multiple binding specifications for message middleware.

service definition pattern

We already know that the producer of an event does not know who the consumer is, so the format of the message cannot be predefined like a message. Therefore, in an event-driven architecture, a Schema Registry is needed to provide serialization basis for producers and deserialization basis for consumers.

Schema is similar to the proto definition in gRPC. In the request-driven mode, the gRPC server and client will generate stub template code according to the proto definition respectively. Then provide the template code to your own upper-level code call to achieve serialization and deserialization.

Similarly, in the event-driven mode, after the consumer obtains the event, it can parse out the Schema and Content (usually binary) according to the CloudEvents standard protocol, and then call the Schema Registry service through the consumer to deserialize the Content into an event. body.

It can be seen that the event service definition pattern can fully decouple the producers and consumers of events.

EventBridge

EventBridge is a Serverless event bus service that provides users with a loosely coupled , distributed event-driven architecture. EventBridge's event transmission and storage follows the CloudEvents protocol.

In EventBridge, the producer of events is called the event source , the medium that transmits and stores events is called the event bus , and the consumer of the event is called the event target . Events are transformed, matched, aggregated, and routed to event targets by event rules .

EventBridge connects the two ends of event production and consumption, providing users with low-code, loosely coupled, and highly available event processing capabilities. EventBridge is based on standard event protocols, which is conducive to promoting the unification of event standards for various event sources, and gradually integrating event islands into a complete event ecosystem. (Extended reading: AWS EventBridge )

(4) Cloud native

What counts as cloud native

Cloud native is a software development and deployment methodology that aims to maximize the benefits of cloud computing environments to improve application flexibility, scalability, and maintainability. For an application to be considered cloud native, it needs to meet a set of characteristics and principles to ensure it takes full advantage in a cloud environment.

Among them, cloud native application services have the following characteristics:

  1. Automation: Cloud-native applications are deployed, managed, and scaled through automation. Automation can reduce manual intervention and improve system resilience and reliability.

  2. Elasticity: Cloud-native applications can automatically scale according to changes in load to meet different needs. This allows applications to efficiently handle peaks and valleys in traffic.

  3. Observability: Cloud-native applications can monitor and analyze application performance and status in real time by integrating various monitoring, logging, and tracing tools to discover and solve problems in a timely manner.

  4. Loose coupling: Cloud-native applications use a microservices architecture to split the application into small, independent services. These services can be developed, deployed, and scaled independently, making teams more efficient.

  5. Containerization: Cloud-native applications are packaged and deployed using container technology. Containers can run in different environments to ensure consistent application behavior in various scenarios.

  6. Continuous delivery: Cloud native applications achieve rapid code submission, construction and deployment through continuous integration and continuous delivery processes, thereby accelerating the release of new features.

  7. Infrastructure as code: Cloud native applications incorporate the management and configuration of infrastructure into code, defining and managing infrastructure programmatically to ensure consistency and repeatability.

  8. Service grid: Cloud native applications use service grid to achieve communication and coordination between services, decoupling communication logic from application code, and providing better observability and control.

In short, cloud-native applications build and run applications in a more modern and agile way, taking full advantage of cloud computing and container technology to provide better performance, reliability, and scalability. It emphasizes automation, elasticity, observability and loose coupling to meet rapidly changing business needs.

Immutable infrastructure

The "Immutable Infrastructure" principle introduced by Chad Fowler means that once instances of the infrastructure are created, they enter a read-only state, and any modifications or upgrades need to be achieved by replacing new instances. This approach emphasizes simplicity, reliability, and consistency of deployment, effectively reducing many common problems and failures.

The advantage of immutable infrastructure is that it increases the consistency and reliability of the infrastructure while simplifying the deployment process and providing a more predictable environment. It avoids common problems in mutable infrastructure such as configuration drift and snowflake servers. However, effectively utilizing immutable infrastructure often requires the ability to automate deployments, the ability to quickly provision servers in a cloud computing environment, and solutions to handle stateful or ephemeral data such as logs.

In comparison, the way things were done in the past treated servers as "pets", such as calling a mail server "Bob". If something goes wrong with "Bob" it could bring down the entire system. The modern approach treats servers as numbered "cows", such as "www001" to "www100". When a server fails, it can be decommissioned and replaced with a new server for quick replacement.

"Snowflake servers" are similar to "pets" in that they require manual management and may cause environments to become unique due to frequent updates and adjustments. In contrast, a "Phoenix server" is similar to a "Cow", always built from scratch through an automated process, and can be easily recreated or "reborn".

"Infrastructure as Code" (IaC) is a method of describing the infrastructure layer as code and versioning it through the code base. This approach uses tools, such as Terraform, to automate the creation and management of infrastructure.

Mecha runtime

In the book "Multiple Runtime Microservice Architecture", Bilgin Ibryam divides the requirements of modern distributed applications into four main types:

  1. Lifecycle : Covers the packaging, deployment, and running processes of components, as well as recovering from errors and expanding services. This type of requirement focuses on the life cycle management of the entire application, including how to efficiently deploy, runtime failure handling, and automatic expansion.

  2. Networking : Involves service discovery, error recovery, tracking and telemetry, etc. In addition, it includes message exchange modes such as point-to-point communication and publish/subscribe modes, as well as intelligent routing, etc. The requirements in this area focus on how to build a powerful network architecture to support application communication and interaction.

  3. State : In this type of requirement, state refers to both the state of the service itself and the state of the service management platform. State management is important in performing reliable service orchestration, distributed task scheduling, temporary task scheduling (such as scheduled jobs), idempotent operations, stateful error recovery, caching, etc. You need to pay attention to the underlying state management mechanism.

  4. Binding : The components of a distributed system need to communicate with each other and integrate with external systems. This means that the connector needs to be able to support various protocol conversions, message exchange modes, format conversions, custom error recovery procedures and security mechanisms, etc. This type of requirement focuses on how to achieve powerful binding and integration capabilities.

Bilgin Ibryam proposed Mecha as a future architecture trend, which is a concept as an external extension mechanism (Sidecar) for business services. Mecha has the following characteristics:

  • Mecha is a general-purpose, highly configurable and reusable component that provides distributed primitives that can be directly used to build applications.
  • Each Mecha instance can be configured for use with a single Micrologic (business component) or shared with multiple components.
  • Mecha makes no assumptions about the Micrologic runtime and can be used with multi-language microservices or even monolithic systems, using open protocols and formats.
  • Mecha is declaratively configured via simple text formats (e.g. YAML, JSON), defining the functionality to be enabled and how to bind it to Micrologic's endpoints.
  • For specific API interactions, specifications can be attached, such as OpenAPI, AsyncAPI, ANSI-SQL, etc.
  • The design goal of Mecha is to merge agents with different functions into one, such as network agents, cache agents, binding agents, etc., to provide integration of multiple capabilities.
  • Some issues related to lifecycle management of distributed systems can be provided by management platforms (such as Kubernetes or other cloud services), while Mecha runtimes follow common open specifications (such as Open App Model).

The goal of this architectural thinking is to provide more flexible, scalable, reliable, and easy-to-manage microservice applications while taking full advantage of modern cloud-native technologies and open protocols.

service mesh

Service mesh is an infrastructure layer used for communication between cloud-native services. It uses the Sidecar model to provide capabilities such as resilience, traffic transfer, communication security, and observability. Its core idea is to use container technology to upgrade and evolve the microservice architecture in a non-invasive way.

In a service grid, the main advantage of the Sidecar model is to sink infrastructure capabilities and deploy them independently. This means that the upgrade of basic capabilities will not affect the service itself, thus achieving decoupling. However, a major challenge with this model is that the business container and the sidecar container need to coexist in the same POD, which increases technical complexity.

Currently, the main service mesh products include Istio, led by Google (where the sidecar core of the data plane is Envoy) and Linkerd maintained by CNCF. These products are designed to provide a flexible and powerful solution to support communication needs between cloud-native applications and bring better maintainability and scalability to microservices architectures.

Distributed application runtime

Distributed Application Runtime (Dapr for short) also uses the Sidecar model and is implemented with the help of container technology. However, the core goal of Dapr is different. It provides a framework during the development phase and provides Mecha-like capabilities at runtime, somewhat similar to the concept of Java EE. Specifically, Dapr is a portable, serverless, event-driven runtime environment. It is designed to enable developers to easily build elastic, stateless and stateful microservices that can run in the cloud and edge environments, lowering the barriers to building modern cloud-native applications.

Through Dapr, developers can get rid of complex underlying implementation details and focus on writing business logic. Dapr provides a series of building blocks and components for handling common distributed system problems, such as inter-service communication, state management, event handling, etc. These functions are deployed in Sidecar mode and coexist with the business container in the same application instance. This design is designed to make it easier for developers to create reliable, observable, and scalable distributed applications.

In short, Dapr's goal is to simplify the development and deployment of distributed applications and provide a higher level of abstraction so that developers can focus more on the implementation of business logic without having to think too much about the underlying complexity.

Apache Camel

Apache Camel is a framework based on Enterprise Integration Patterns (EIP), which provides a rich set of connectors and components for integrating various systems and applications. The goal of Apache Camel is to simplify integration between different applications and make it easier for data to flow between different systems. It can be seen as a new trend in Mecha binding.

Here are some Apache Camel projects:

  1. Camel Core : This is the core project of Apache Camel and provides a framework for integrating various systems to produce and consume data.

  2. Camel Karaf : Supports running Camel on OSGi container Karaf.

  3. Camel Kafka Connector : Use all Camel components as Kafka Connect connectors, enabling Camel to integrate with Apache Kafka.

  4. Camel Spring Boot : Provides automatic configuration for the Camel context by automatically detecting the Camel routes available in the Spring context, and registers key Camel utilities as beans to facilitate integration and use in Spring Boot applications.

  5. Camel Quarkus : Over 280 Camel components ported and packaged as extensions for Quarkus to make it easier to use Camel in Quarkus applications.

  6. Camel K : is a lightweight integration framework built on Apache Camel, designed for serverless and microservice architectures, and runs on Kubernetes. It is designed to provide integration capabilities for applications running in Kubernetes environments.

In short, Apache Camel provides developers with a rich set of tools and components, making it easier and more efficient to build data flows and achieve integration between different systems.

Scalability and elasticity

Cloud-native architecture provides out-of-the-box capabilities to services deployed on it, meaning services can automatically scale up or down as needed. This capability ensures high availability of services as it dynamically adjusts resources based on load conditions to ensure services are always available. At the same time, this elastic feature also brings us the ability to load peaks and valleys, which means that the service can automatically expand when it needs to handle high loads, and automatically shrink resources when the load decreases. This not only improves the performance of the service, but also reduces the cost of the service because resource utilization is optimized to the greatest extent.

Neal Ford clearly defined these two concepts in his book "Software Architecture Fundamentals":

  • Scalability: refers to the system's ability to handle a large number of concurrent users without severe performance degradation. In other words, when the number of users increases, the system is able to maintain a relatively stable performance level without obvious performance problems due to increased load.

  • Resilience: refers to the system's ability to handle sudden requests. When systems face sudden high loads or abnormal conditions, resilient systems are able to adapt to these conditions and continue to provide reasonable quality of service without crashing or experiencing unacceptable delays.

Overall, scalability and elasticity are two important characteristics of cloud-native architecture, which together ensure high availability, performance optimization and cost-effectiveness of services.

Serverless architecture

The core concept of serverless architecture is to allow developers to focus on business logic without paying too much attention to the underlying technical details. Under a serviceless architecture, developers do not need to worry about technical issues such as deployment and resource management. This architecture provides extremely high flexibility and, when combined with Event-Driven Architecture (EDA), has broad application prospects in scenarios such as offline batch processing and streaming computing.

Currently, serverless architecture is mainly divided into the following two parts:

  1. Backend as a Service (BaaS): This part covers infrastructure such as databases and message queues, which are used to support the business without business logic. BaaS abstracts away the underlying technical details, allowing developers to use these services directly without caring about their underlying implementation.

  2. Function as a Service (FaaS): This section covers business logic functions. In FaaS, developers only need to write functions that handle specific tasks and then deploy these functions to the cloud platform. The cloud platform will trigger these functions based on events to implement business logic. This architecture allows developers to focus more on business logic without having to consider underlying resource management and deployment.

The FaaS panorama shows the function-as-a-service ecosystem in a serviceless architecture, including more than 50 products and tools that can help developers build and manage serviceless applications. The main goal of serviceless architecture is to reduce the burden on developers so that they can focus more on business innovation and value creation.

Guess you like

Origin blog.csdn.net/xiaofeng10330111/article/details/132515850