Distributed queue programming: models, actual combat

introduce

As a basic abstract data structure, queues are widely used in various programming. The era of big data puts forward higher requirements for cross-process and cross-machine communication. Compared with the past, the application of distributed queue programming is almost ubiquitous. However, this common basic thing is often overlooked, and users tend to ignore two points:

  • When using a distributed queue, you don't realize it's a queue.
  • When there are specific needs, forget the existence of distributed queues.

The article starts from the most basic requirements, and analyzes the demand source, definition, structure and diversity of the distributed queue programming model in detail. Through the explanation in this part, the author hopes to help readers in two aspects: on the one hand, to provide a systematic way of thinking, so that readers can associate specific requirements with the distributed queue programming model, and have the ability to implement distributed queue architecture; On the one hand, through a comprehensive explanation, readers can quickly identify various distributed queue programming models encountered in their work.

The second part of the article is the actual combat article. According to the author's actual work experience in Xinmei University, some specific applications of queue programming in a distributed environment are given. The basic model of these examples is not the first to appear in the Internet documentation, but all examples are explained in the three steps of challenge, conception, and architecture. This way of explaining can give readers a journey of "building distributed queue programming from requirements".

Distributed Queue Programming Model

The model chapter starts from the basic requirements to think about when and how to use the distributed queue programming model. The modeling link is very important, because most middle and senior engineers are faced with specific requirements, and the first step after receiving the requirements is modeling. Through the explanation of this article, I hope readers can build a bridge from requirements to the distributed queue programming model.

When to choose distributed queues

Communication is the most basic need of people, and it is also the most basic need of computers. For engineers, when programming and technology selection, the concepts that are easier to enter the brain are RPC, RESTful, Ajax, and Kafka. Behind these specific concepts, the most essential thing is "communication". So, most modeling and architecture needs to start with the basic concept of "communication". When it is determined that there is a communication requirement between systems, engineers need to make a lot of decisions and balances, which directly affects whether engineers choose the distributed queue programming model as the architecture. From this perspective, there are four factors that affect modeling: When, Who, Where, and How.

When: Synchronous vs Asynchronous

A fundamental question of communication is: When does an outgoing message need to be received? This question leads to two fundamental concepts: "synchronous communication" and "asynchronous communication". According to the theoretical abstract model, the most essential difference between synchronous communication and asynchronous communication comes from the presence or absence of the clock mechanism. Both parties in synchronous communication need a calibrated clock, and both parties in asynchronous communication do not need a clock. The reality is that there is no fully calibrated clock, so there is no absolutely synchronous communication. Likewise, absolutely asynchronous communication means that there is no control over when an outgoing message is received, and waiting for a message indefinitely is obviously meaningless. Therefore, all communication in actual programming is neither "synchronous communication" nor "asynchronous communication"; in other words, both "synchronous communication" and "asynchronous communication". Especially for the communication of the application layer, the underlying architecture may contain both "synchronous mechanism" and "asynchronous mechanism". The criteria for judging "synchronous" and "asynchronous" messages are too deep to be expanded upon. Here are some heuristic suggestions from the author:

  • Whether the sent message needs to be confirmed, if not, it is more like asynchronous communication. This kind of communication is sometimes called one-way communication (One-Way Communication).
  • If confirmation is required, it can be judged according to the length of time that needs to be confirmed. Long time is more like asynchronous communication, short time is more like synchronous communication. Of course, the concept of length of time is a purely subjective concept, not an objective criterion.
  • Whether the sent message blocks the execution of the next instruction, if blocked, it is more like synchronization, otherwise, it is more like asynchronous.

In any case, engineers cannot live in chaos, and not making a decision is often the worst decision. When analyzing a communication requirement or implementing a communication architecture, engineers are forced to make "synchronous" or "asynchronous" decisions. When the conclusion of the decision is "asynchronous communication", the distributed queue programming model is an alternative.

Who: sender receiver decoupling

Another basic question that needs to be answered when analyzing communication requirements is whether the sender of the message cares about who receives the message, or conversely, whether the receiver of the message cares about who sends the message. If the engineer's conclusion is that the sender and receiver of a message don't care who and where the other is, the distributed queue programming model is an alternative. Because in this scenario, the decoupling brought by the distributed queue architecture can bring these benefits to the system architecture:

  • Whether it is the sender or the receiver, it only needs to communicate with the message middleware, and the interface is unified. Unification means lower development costs.
  • Under the premise of not affecting performance, the same set of message middleware deployment can be shared by different businesses. Sharing means lower operational costs.
  • The unilateral change of the deployment topology of the sender or the receiver does not affect the corresponding other party. Decoupling means flexibility and extensibility.

Where: message staging mechanism

When designing the communication sender, the problem that bothers engineers is: what if the messages cannot be processed quickly and accumulate, and can they be discarded directly? If according to the requirements analysis, it is confirmed that there is a message backlog, and the message should not be discarded, the distributed queue programming model architecture should be considered, because the queue can temporarily store the message.

How: how to pass

Architecting the communication requirements presents a series of fundamental challenges, including:

  • Availability, how to ensure the high availability of communication.
  • Reliability, how to ensure that messages are delivered reliably.
  • Persistence, how to ensure that messages will not be lost.
  • throughput and response time.
  • Cross-platform compatibility.
    Unless engineers have enough interest in building wheels and have enough time, adopting a distributed queue programming model that meets various metrics is an easy choice.

Distributed queue programming definition

It is difficult to give a precise definition of the distributed queue programming model. Since this article focuses on application, the author does not intend to fully refer to a standard model. In general: The distributed queue programming model contains three types of roles: sender (Sender), distributed queue (Queue), and receiver (Receiver). Sender and receiver refer to the application or service that produces and receives messages, respectively.
The concept that needs to be emphasized is the distributed queue, which is an application or service that provides the following functions: 1. Receive the message entity produced by the "sender"; 2. Transmit and temporarily store the entity; 3. Provide the "receiver" with Function to read this message entity. In certain scenarios, it can of course be message middleware such as Kafka and RabbitMQ. But its presentation form is not limited to this, for example:

  • A queue can be a database table where senders write messages to the table and receivers read messages from the data table.
  • If a program writes data into a memory cache such as Redis, and another program reads it from the cache, the cache is a kind of distributed queue here.
  • The data stream transmission in streaming programming is also a queue.
  • In the typical MVC (Model-view-controller) design pattern, if the change of the Model needs to cause the change of the View, it can also be transmitted through the queue. The distributed queue here can be a database or a piece of memory on a server.

abstract model

The most basic abstract model of distributed queue programming is the point-to-point model, and other abstract architecture models reside in different topology diagrams caused by changes in the number and interaction of roles on the basic model. Specifically, different numbers of senders, distributed queues, and receivers combine to form different distributed queue programming models. Remembering and understanding typical abstract model structures is critical for requirements analysis and modeling, as well as for learning and understanding open source frameworks and other people's code.

Point-to-point model

In the base model, there is only one sender, one receiver, and one distributed queue. As shown below:
peer-to-peer model

Producer-consumer model (Producer-consumer)

If both sender and receiver can have multiple deployment instances, even different types; but share the same queue, this becomes the standard producer-consumer model. In this model, the three roles are generally called Producer, Distributed Queue, and Consumer.
Producer Consumer Model

Publish-Subscribe Model (PubSub)

If there is only one type of sender, the sender will distribute the generated message entities to different logical queues according to different topics. Each topic queue corresponds to a class of receivers. This becomes the typical publish-subscribe model. In this model, the three roles are generally referred to as Publisher, Distributed Queue, and Subscriber.
publish-subscribe model

MVC model

If the sender and receiver exist in the same entity, but share a distributed queue. This is much like the classic MVC model.

MVC model

programming model

In order to give readers a better understanding of the concept of distributed queue programming mode, here are some comparisons with some easily confused concepts.

Distributed queue model programming and asynchronous programming

The communication mechanism of the distributed queue programming model generally adopts an asynchronous mechanism, but it is not equivalent to asynchronous programming.
First of all, not all asynchronous programming needs to introduce the concept of queue, for example: most of the asynchronous I/O operations of operating systems are implemented through hardware interrupts (Hardware Interrupts).
Second, asynchronous programming does not necessarily require cross-process, so its application scenario is not necessarily a distributed environment.
Finally, the distributed queue programming model emphasizes the architecture of the three roles of sender, receiver, and distributed queue. These three roles are not much related to asynchronous programming.

Distributed queue pattern programming and streaming programming

With the wide application of streaming frameworks such as Spark Streaming and Apache Storm, streaming programming has become a very popular programming mode. But the distributed queue programming model and streaming programming described in this article are not the same concept.
First of all, the queue programming pattern in this article does not depend on any framework, while streaming programming is programming within a specific streaming framework.
Second, the distributed queue programming model is a demand solution, focusing on how to model distributed queue programming according to actual requirements. The data streams in the streaming framework are generally transmitted through queues. However, the focus of streaming programming is more focused. It focuses on how to obtain message streams from the streaming framework, and perform transformation operations such as map, reduce, and join, and generate The new data flow is finally summarized and counted.


Distributed queue programming practice articles

All the projects here are real cases of the author's work at New American University. The focus of the actual combat chapter is to train the modeling ideas, so these examples are explained in three steps: challenge, conception, and architecture. Due to confidentiality requirements, some details are not given, but these details do not affect the integrity of the explanation. On the other hand, particularly specific requirements are easy to understand. In order to make the explanation smoother, the author will also use some more easy-to-understand examples. Through the explanation of this article, I hope to practice "how to build a distributed queue programming model from the requirements" with the readers.

It should be stated that the solution here is not the optimal solution for the scenario. However, for any slightly complex problem, there is no optimal solution, let alone the only solution. In fact, every day engineers are looking for feasible solutions that meet certain constraints. Of course, different constraints will lead to different solutions, and the slack of constraints determines the breadth of options for engineers.

Information collection and processing

Information collection and processing are widely used, such as advertising billing, user behavior collection, etc. The specific project that the author encounters is to design a high-availability collection and billing system for the advertising system.
The typical advertising CPC and CPM charging principle is: collect the clicks and browsing behaviors of users on the client or web pages, and charge according to the clicks and browsing. The charging service has the following typical characteristics:

  • Collectors and processors are decoupled, collection occurs on the client side, and billing occurs on the server side.
  • Billing is all about money.
  • Double billing means disaster.
  • Billing is a dynamic real-time behavior and needs to accept budget constraints. If consumption exceeds the budget, advertising needs to be stopped.
  • User browsing and clicks are very high.

challenge

The typical characteristics of billing services bring us the following challenges:

  • High throughput - The number of views and clicks on advertisements is very large, and we need to design a high-throughput collection architecture.
  • High Availability - Loss of billing information means immediate monetary loss. A crash of any processing server should not render the system unusable.
  • High Consistency Requirements - Billing is a real-time dynamic process subject to budget constraints. The collected browsing and click behavior, if not processed quickly, may result in overspending of budgets or inaccurate CTR estimates. Therefore, the collected information should be transmitted to the billing center for billing in the shortest possible time.
  • Integrity constraints - this includes anti-cheating & cheating rules, single user behavior can not be repeated billing and so on. This requires billing to be a centralized rather than a distributed one.
  • Persistence requirements - The billing information needs to be persisted to avoid the loss of collected data due to machine crashes.

idea

The high availability of the collection means that we need multiple servers to collect at the same time. In order to avoid the failure of a single IDC, the collection server needs to be deployed in multiple IDCs.
Achieving a highly available, high-throughput, and high-consistency information delivery system is obviously a challenge. In order to control project development costs, using open source message middleware for message transmission has become an inevitable choice.
Integrity constraints require centralized billing, so the billing system takes place in the core IDC.
The billing service doesn't care where the collection point is, and the collection service doesn't care who does the billing.
According to the above concept, we believe that the collection billing is in line with the typical "producer-consumer model".

Architecture

The architecture diagram of the collection and billing system is as follows:

  • User Click/View Collector is deployed in multiple computer rooms as a producer to improve the availability of the collection service.
  • The data collected in each computer room is sent to the core computer room IDC_Master through the message queue middleware.
  • Billing service is deployed as a consumer in the core computer room for centralized billing.
    Billing Collection Architecture

Using this architecture, we can further optimize the following aspects:

  • Improve scalability. If a Billing deployment instance cannot meet the performance requirements, topic partition (Topic Partition) billing can be performed for the collected data, that is, the publish-subscribe model is used to improve scalability (Scalability).
  • Global deduplication and anti-cheating & cheating. The centralized billing structure solves the problem of click-to-browse reloading. On the other hand, it also provides global information for anti-cheating & cheating.
  • Improve the usability of the billing system. The following single instance service optimization strategy is adopted to improve the availability of the billing system while ensuring the centrality of the billing system.

Distributed Cache Replacement

Caching is a very broad concept that exists at almost every level of the system. A typical cache access process is as follows:

  • After receiving the request, read the cache first, and return the result if it hits.
  • 如果缓存不命中,读取DB或其它持久层服务,更新缓存并返回结果。
    cache update
    对于已经存入缓存的数据,其更新时机和更新频率是一个经典问题,即缓存更新机制(Cache Replacement Algorithms )。典型的缓存更新机制包括:近期最少使用算法(LRU)、最不经常使用算法(LFU)。这两种缓存更新机制的典型实现是:启动一个后台进程,定期清理最近没有使用的,或者在一段时间内最少使用的数据。由于存在缓存驱逐机制,当一个请求在没有命中缓存时,业务层需要从持久层中获取信息并更新缓存,提高一致性。

挑战

分布式缓存给缓存更新机制带来了新的问题:

  • 数据一致性低。分布式缓存中键值数量巨大,从而导致LRU或者LFU算法更新周期很长。在分布式缓存中,拿LRU算法举例,其典型做法是为每个Key值设置一个生存时间(TTL),生存时间到期后将该键值从缓存中驱逐除去。考虑到分布式缓存中庞大的键值数量,生存时间往往会设置的比较长,这就导致缓存和持久层数据不一致时间很长。如果生存时间设置过短,大量请求无法命中缓存被迫读取持久层,系统响应时间会急剧恶化。
  • 新数据不可用。在很多场景下,由于分布式缓存和持久层的访问性能相差太大,在缓存不命中的情况下,一些应用层服务不会尝试读取持久层,而直接返回空结果。漫长的缓存更新周期意味着新数据的可用性就被牺牲了。从统计的角度来讲,新键值需要等待半个更新周期才会可用。

构思

根据上面的分析,分布式缓存需要解决的问题是:在保证读取性能的前提下,尽可能地提高老数据的一致性和新数据的可用性。如果仍然假定最近被访问的键值最有可能被再次访问(这是LRU或者LFU成立的前提),键值每次被访问后触发一次异步更新就是提高可用性和一致性最早的时机。无论是高性能要求还是业务解耦都要求缓存读取和缓存更新分开,所以我们应该构建一个单独的集中的缓存更新服务。集中进行缓存更新的另外一个好处来自于频率控制。由于在一段时间内,很多类型访问键值的数量满足高斯分布,短时间内重复对同一个键值进行更新Cache并不会带来明显的好处,甚至造成缓存性能的下降。通过控制同一键值的更新频率可以大大缓解该问题,同时有利于提高整体数据的一致性,参见“排重优化”。

综上所述,业务访问方需要把请求键值快速传输给缓存更新方,它们之间不关心对方的业务。要快速、高性能地实现大量请求键值消息的传输,高性能分布式消息中间件就是一个可选项。这三方一起组成了一个典型的分布式队列编程模型。

架构

如下图,所有的业务请求方作为生产者,在返回业务代码处理之前将请求键值写入高性能队列。Cache Updater作为消费者从队列中读取请求键值,将持久层中数据更新到缓存中。
Distributed cache update
采用此架构,我们可以在如下方面做进一步优化:

  • 提高可扩展性,如果一个Cache Updater在性能上无法满足要求,可以对键值进行主题分区(Topic Partition)进行并行缓存更新,即采用发布订阅模式以提高可扩展性(Scalability)。
  • 更新频率控制。缓存更新都集中处理,对于发布订阅模式,同一类主题(Topic)的键值集中处理。Cache Updater可以控制对同一键值的在短期内的更新频率(参见下文排重优化)。

后台任务处理

典型的后台任务处理应用包括工单处理、火车票预订系统、机票选座等。我们所面对的问题是为运营人员创建工单。一次可以为多个运营人员创建多个工单。这个应用场景和火车票购买非常类似。工单相对来说更加抽象,所以,下文会结合火车票购买和运营人员工单分配这两种场景同时讲解。典型的工单创建要经历两个阶段:数据筛选阶段、工单创建阶段。例如,在火车票预订场景,数据筛选阶段用户选择特定时间、特定类型的火车,而在工单创建阶段,用户下单购买火车票。

挑战

工单创建往往会面临如下挑战:

  • 数据一致性问题。以火车票预订为例,用户筛选火车票和最终购买之间往往有一定的时延,意味着两个操作之间数据是不一致的。在筛选阶段,工程师们需决定是否进行车票锁定,如果不锁定,则无法保证出票成功。反之,如果在筛选地时候锁定车票,则会大大降低系统效率和出票吞吐量。
  • 约束问题。工单创建需要满足很多约束,主要包含两种类型:动态约束,与操作者的操作行为有关,例如购买几张火车票的决定往往发生在筛选最后阶段。隐性约束,这种约束很难通过界面进行展示,例如一个用户购买了5张火车票,这些票应该是在同一个车厢的临近位置。
  • 优化问题。工单创建往往是约束下的优化,这是典型的统筹优化问题,而统筹优化往往需要比较长的时间。
  • 响应时间问题。对于多任务工单,一个请求意味着多个任务产生。这些任务的创建往往需要遵循事务性原则,即All or Nothing。在数据层面,这意味着工单之间需要满足串行化需求(Serializability)。大数据量的串行化往往意味着锁冲突延迟甚至失败。无论是延迟机制所导致的长时延,还是高创建失败率,都会大大伤害用户体验。

构思

If the final rule of user filtering is stored as a message and sent to the work order creation system. At this point, the work order creation system will have the global information required to create a work order, and the ability to make overall planning and optimization under the conditions that various constraints are met. If a single instance deployment is used in the work order creation phase, data locking problems can be avoided, and it also means that there are no lock conflicts, so there are no deadlocks or task delays.
Based on the above ideas, in the model of the multi-work order processing system, the rule creation system in the screening phase will play the role of a producer, the work order creation system will play the role of a consumer, and the screening rules will be passed as messages between the two. This is a typical distributed queue programming architecture. Depending on the amount of work order creation, a database or open source distributed message middleware can be used as a distributed queue.

Architecture

The architecture flow is as follows:

  • The user prefers to create rules, this process is mainly some search and filtering operations;
  • The user clicks to create a work order, and the TicketRule Generator will assemble all the filters into rule messages and send them to the queue;
  • As a consumer, Ticket Generator reads work order creation requests from the queue in real time and starts to actually create work orders.
    Work order creation
    Using this architecture, we can get better results in data locking, operational research optimization, and atomicity problems:
  • Data locking is postponed to the work order creation stage, which can reduce the scope of data locking and minimize the impact of work order creation on other online operations.
  • If overall optimization is required, the Ticket Generator can be deployed in singleton mode (see Singleton Service Optimization). In this way, Ticket Generator can read work order requests over a period of time and perform global optimization. For example, in our project, under certain conditions, operators need to meet the principle of hierarchical fairness, that is, the number of work orders for operators at the same level should be close, and the number of work orders for operators at different levels should be differentiated. Implementing such optimization rules would be difficult without centralized optimization.
  • Constraint integrity is guaranteed. For example, in our scenario, there is a limit to the number of work orders that each operator can process per day. If parallel processing is used, this integrity constraint will be difficult to enforce.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326942927&siteId=291194637