Architecture Paradigm 1 - Event Driven Architecture (EDA)

1. What is EDA architecture?

EDA is a message asynchronous communication architecture based on the publish/subscribe model, which you can understand as an observer model at the architectural level. It is mainly divided into the following seven core objects, and the specific collaboration mode can refer to the following schematic diagram.

  • Event (Event): the object to be processed, which can be discrete or ordered; its format can be either JSON, XML or bank-specific 8583 message;

  • Event bus: responsible for receiving events pushed from the outside and serving as a carrier for the same event to flow between different managers; general selection can use MQ, Kafka or Redis (of course, the pub/sub mode of Redis cannot message Persistence);

  • Worker Manager: Responsible for assigning the event obtained by subscribing to the topic to the Worker;

  • Worker: As an executor, it processes and responds to events;

  • MonitorManager: As a monitor, it is responsible for monitoring the processing of events, which can be regarded as the daemon thread of events.

  • Message Broker: As a message intermediary, it is responsible for assigning and coordinating the processing sequence of multiple workers (or we can call it a process) for the event, and the broker is responsible for maintaining the routing rules used in the processing sequence.

The general process is:

  • Event is published to Event bus as an event;

  • Message Broker and Rule work together. All events are completely unclear about "where am I from" and "where am I going", and are dispatched by Message Broker to the corresponding WorkManager according to the routing rules defined by Rule;

  • When WorkManager finds a corresponding event in the subscribed topic, it will allocate work according to the load of the worker at that time; then, the worker will execute the corresponding business process;

  • Throughout the entire process, the MonitorManager is responsible for monitoring the status of each event. The specific implementation is (taking Kafka as an example) that it can be agreed that all WorkManagers will throw all exception handling events into a dead letter topic, and then the MonitorManager will be responsible for subscribing and monitoring the topic and performing corresponding alarm processing.

2. Applicable scenarios

First, because it is asynchronous, it is especially suitable for the following:

1) Quasi-real-time or non-real-time scenarios with long transaction processing links, such as bill management;

2) Or a fan-out-based broadcast type scenario, such as a series of actions that need to be followed up after a mobile shopping mall (SMS notification, delivery application, update order status, etc.), such scenarios are generally non-real-time and time-tolerance (accept a certain time tolerance);

3) The scene of cutting peaks and filling valleys. For example, the upstream application system pushes a large amount of system logs to ELK, and ELK is more for storage and statistical analysis, and does not need to respond to the upstream in real time.

Second, if the entire main process of the business model does not emphasize strong consistency and the process changes rapidly, then this architecture can be properly considered.

Third, because it communicates asynchronously through pipelines, it is not recommended to use this asynchronous architecture if your system has high requirements for real-time transactions or is strongly related to 2C terminal page interaction;

3. Advantages

First, in this mode, the system is generally decomposed into multiple independent services or modules that have a certain relationship with each other. This mode truly reflects high cohesion and low coupling, and well reflects the Y-axis expansion. A bill processing system that the author has been in charge of is this kind of EDA architecture. Each worker is only responsible for one process (satisfies high cohesion). When adding additional processes, you only need to inherit the base class to add a new type of worker, and Just match the new rule. Does the blink of an eye seem to be the embodiment of the chain of responsibility pattern in the system architecture? (Digression: The scalability theory can refer to the AKF scale cube model in "The Art of Scalability");

Second, the benefit of high cohesion is that whenever a new function is added, there is a high probability that only the worker of a certain node needs to be changed, and the impact of the change can be limited within a certain range (that is, within a certain worker);

Third, the worker can theoretically be expanded horizontally indefinitely to support large-scale business volumes; when the manager becomes a bottleneck, the manager can also be appropriately expanded from a single instance to a cluster;

Fourth, based on the event (event) is actually persisted to the Event bus, so it is convenient for error handling and improves the overall operability of the system. For example, event1 will not continue to be processed after manager2 fails to process it, so that it is convenient for IT personnel to check and repair the event and reroute the event to the same manager for processing.

4. Idempotency

Since the event-triggered architectural pattern of EDA is used, it is bound to face a common scenario as follows:

  • The same event is repeatedly routed to the same manager for processing due to routing rule errors;

  • The event is consumed repeatedly (for example, it may come from Kafka's rebalancing);

  • In other words, it is manually fished out from the dead letter topic for processing.

Therefore, idempotent design is particularly necessary under this architecture. All business processes or operations are, in the final analysis, two categories: change of state of affairs and query in the view of the database. If it is a query-like operation, it doesn't matter whether it is idempotent or not. If it is an operation to change the class, then you need to consider the idempotent design. Generally speaking, idempotence can be achieved through tokens, status codes, optimistic locks, etc.

In fact, this idempotence is very critical at the interface level. Some systems that the author is responsible for have many idempotent design deficiencies that lead to some production failures. Here I will summarize how to solve this problem according to my previous experience, some of which I saw from the code of my colleagues, and some of my previous experience, although they are roughly the same as those found on the Internet.

Idempotency Implementation Scheme
Way        Realization principle
deduplication table

Fundamental:

By designing an independent table, the table has a unique index (which can be an independent index or a joint index), and then when the request comes in, through the unique constraint of the database, if the insertion is normal, it will continue to execute, and if it fails, it will return failure.

specific methods:

Define an interface on the server side, in which the client is required to send the client's own UUID_1, and then the server will generate the server's own UUID_2 when processing, and use UUID_1 and UUID_2 as unique indexes in the independent table insert operation. In this case, even if the user submits repeatedly at the front end, the server will be successfully rejected due to the failure of the insert operation (DuplicateKeyException).

The disadvantage of this solution is that there are database operations, so it can be considered when the conventional concurrency is not high. After all, the overall cost is relatively low (complexity and cost), which is in line with the KISS principle.

token+redis mechanism

The token mechanism is actually that the server provides two interfaces: 1) the interface that provides the token; 2) the real business interface.

First, the client first goes to the server to apply for a token, and the server returns the token and caches it in redis for later verification;

Then, the client takes the token to call the service interface of the server, and the server first tries to delete the token in redis:

           If the deletion is successful, proceed to process the business logic;

           If the deletion fails, it means that the same interface has been called repeatedly, that is, the request is a repeated request, and the server directly rejects it.

Of course, there will be debate on the Internet whether to 1) delete the token first and then process the business; or 2) delete the token after the business is processed; it can only be said that each has its own advantages and disadvantages, but for an industry that prioritizes security like the banking industry, it must be the priority first option.

In addition, here is also the cost necessity of this solution. First of all, this mechanism requires the client to call the interface twice each time, but duplication is definitely not the norm. That is to say, in order to solve the 1% of the problems, 99% of the requests must follow this model, and this solution does not solve the problem perfectly. I personally think that this solution is uneconomical.

status code mechanism

The status code is actually adding a status for the transaction in the transaction table or the main table (provided that the table has a globally unique transaction flow), and avoiding repeated submissions for submitted transactions with the same transaction flow. Condition.

This approach is generally used in non-2C systems, because in most cases it is not necessary to consider factors such as traffic and concurrency.

optimistic locking mechanism The specific method is to add a field (here you can use #version or #timestamp) as a version number field in the table to be updated.
Distributed lock mechanism

Fundamental:

Here is an example of how redis is implemented. If it is based on redis, mainly use SETNX+EXPIRE to implement distributed locks (but the lock timeout depends on the business scenario, or is generally greater than the timeout of the caller). In this case, within the lock timeout period, the repeated submission of the same interface (of course, refers to the case where the interface parameters are the same) will be successfully rejected due to the failure of calling SETNX.

specific methods:

Customize an idempotent annotation, and then cooperate with AOP to intercept the method, generate a key for the intercepted request information (including method name + parameter name + parameter value) according to fixed rules, and then call the setnx method of redis. If it returns ok, The method is called normally, otherwise it is called repeatedly. This can ensure that the repeated request interface will only be successfully processed once within a certain period of time.

The specific implementation of distributed locks by SETNX still has some details about the release of locks caused by key timeouts. For details, please refer to my other article " Distributed Locks Based on Redis ".

The methods mentioned above actually have a common point, by generating or assigning a unique corresponding token (such as token, or #version of optimistic lock) to a certain business request, and then the server calls the business interface Perform token validation and reject processing if rules cannot be met. Of course, each of the above methods also has its limitations, so the design scheme used for production is generally a combination of the above two or more methods. The most important thing is that how or to what extent each solution is combined should be combined with the actual business scenario, and you should not make a dogmatic mistake for the so-called technical pursuit.

5. Final Consistency

The EDA architecture achieves the eventual consistency of the business layer by implementing a reliable event pattern. What is the reliable event pattern? Reliable events are actually to ensure that events (events) can be successfully delivered, received, and processed, which is simply an enhanced version of the TCP connection. The reliability is guaranteed through the following three dimensions. Talk is cheap, show u the pic.

1. Delivery reliability

First of all, message buses under the EDA architecture generally use various message middleware as a bridge for message delivery, while mainstream open source message middleware (such as RabbitMQ/RocketMQ/Kafka) use the At least Once delivery mechanism (that is, each The message must be delivered at least once), simply put, the message sender (here referred to as "event bus") sends a message to the message receiver (here referred to as "downstream") and listens for the response, if not received within the specified time, Then the sender of the message will resend the message at a certain frequency until a response is received.

Of course, if it is "upstream" delivery to the "event bus", it also needs to do reliable fault-tolerant processing from the upstream application level. Interested students can take a look at Kafka's ACK mechanism.

2. Transmission reliability

Because message middleware is used in the architecture, most middleware currently have a corresponding message persistence mechanism to ensure that data will not be lost until it is successfully confirmed by the downstream, even if the middleware itself is down and restarted. Of course, this function varies with middleware, and some middleware is released to the client for control.

当然,中间件本身也有相应的数据容错策略。举个例子,Kafaka通过分区复制的策略保证数据不丢失。具体大致逻辑是由生产者(producer)首先找到领导者(leader)(这里的leader是Broker1上的Partition0)并把消息写到leader分区,然后leader分区会通过内部管道把消息复制到其它broker上的分区,这就是所谓的分区复制,这里附上原理图方便大家理解。

3、处理可靠性

我的理解,这里的处理可靠性更多指的是应用层的消息路由逻辑。就是说,当一个事件(event)被一个节点(worker)处理完后,会按照路由策略表严格指定该事件(event)的下一个节点(worker)是谁。我认为它的可靠性是相对平时的代码接口调用或者过程式代码的这种风格而言的,这也正是它的可靠性所在。

  • 路由规则:这套路由规则被抽象出来作为核心资产进行统一管理,因为它是定义整个业务流转规则,明确-简单-严格;

  • 异常处理规则:某个节点处理出现异常,处理过程会被终止。保证经过人工介入才能重新触发继续往后处理;

  • 监控规则:某个事件或业务流程要整体成功跑完所有按路由规则要求的所有节点,否则监控会进行告警并触发人工介入。

六、监控

EDA的这种架构还有一个突出的特点,就是因为每个节点都是解耦的,所以哪个节点都不清楚进来的每一个event当前的状态是怎样的,究竟是已经处理完毕呢还是被丢到死信主题呢。这就好像流水线上的工人,个人只会完成自己的工序并再放回到流水线上。

当然,我们可以通过定义每个节点的worker的异常处理逻辑(即发生异常时指定错误码并顺带进行告警),但是这种方法有两个弊端:

  • 业务流程处理跟告警处理耦合在一起;如果这种告警是通过API接口调用的话就更麻烦,因为如果告警系统有任何问题且大面积的event出现异常时候分分钟拖死你这个worker,继而耗尽线程资源导致系统假死;

  • 缺乏系统的整体错误情况看板;

因此,需要定义单独的monitor对这种异常进行监控并告警。如上图,MonitorManager和worker的协同方式一般可以有以下几种方式。

  • 由worker指定错误码后,并同时生成一个告警event及推倒告警主题;

  • MonitorManager监听该告警主题,在发现有event后做响应告警处理;但因为MonitorManager一般只负责监控告警,且问题解决后MessageBroker后续还是得把它重新路由到之前的worker重试,所以一般使用fanout模式;

七、后话

今天差不多,后面可以讲一下另外一个跟事件驱动架构相关或者有关联的事件溯源架构。

Guess you like

Origin blog.csdn.net/justyman/article/details/125576535