Design of Distributed Rule Engine Framework

MirAIe Rules Engine is an extensible and extensible rules engine framework that allows users to group and automate multiple activities.

        While developing the MirAIe IoT platform over the past few years, we realized the need for an extensible and extensible rules engine framework. The rules engine enables you to group, manage and automate various actions and can be used in a variety of applications such as home automation, fraud detection, risk management and workflow automation. At Panasonic, we are involved in several initiatives in the fields of mobility, Industry 4.0, building management and home automation. Therefore, the framework must be able to adapt to various applications. In this article, I describe the high-level design of our rule engine framework.

rule analysis

        Rules essentially allow users to group and automate a large number of tasks. Examples of user-generated rules include turning on the air conditioner in the bedroom whenever the room temperature rises above 27 degrees; turning on a set of lights in the office lobby at 6pm every night; Sends a push notification to your electric car when the battery level drops below 30%, and finds nearby charging stations with vacancies.

 

        Rules have trigger conditions that specify when the rule is activated. Triggers may depend on the location of the user, the condition of the sensors, a specific moment of time, external weather conditions, or any other factor. Multiple triggers can be coupled using one or more logical operators that together specify the triggering conditions for the rule.

        Rules also have one or more actions that activate when these trigger conditions are met. Actions can be as simple as turning on a light, or as complex as creating a report and sending it as an attachment to multiple users.

design considerations

        Rule triggering conditions may depend on multiple input sources. Data sources can be internal, such as sensors installed in a laboratory that send periodic status every few seconds, or external, such as weather data, which needs to be retrieved and stored daily because the data does not change often. The challenge is to take in multiple sources of input to create a single value and determine whether a rule should activate or not.

        Since our framework must support a wide range of applications, we decided to focus on two fundamental design principles. The first principle is that the architecture must be extensible to support various input triggers and output operations. For home automation use cases, the most common triggers may depend on time of day, device status, or external weather conditions. Whereas for our smart factory use case, the trigger might be user activity or aggregated data collected over time, such as machine efficiency. The idea is to ensure that new trigger types can be added without making too many changes to the schema. Likewise, rule actions are extensible. Common actions can be calling an API, sending a notification, or creating a task and pushing it to a task queue. 

        Our second guiding principle for this design was to prioritize flexibility, allowing each component to scale independently. A flexible system can adapt to changing needs without scaling the entire system, increasing resiliency and cost efficiency. For example, if a large number of rules need to be triggered at 6 o'clock in the evening, the system only needs to appropriately expand the timer trigger service and the rule execution service, while other services continue to operate at the original scale. By providing this flexibility, the system can efficiently and effectively meet different needs.

Rule Engine Components

        In order to achieve scalability, we decoupled the rule trigger service, rule processing engine and rule execution service. A rule trigger service is a collection of microservices, and each microservice can handle a specific type of rule trigger logic. The rules engine combines the various firing states based on the rule definition to determine whether the rule should fire. Finally, the Rule Execution Service is the application-specific rule execution logic that executes the intended actions specified in the rules. Each component is developed independently. They implement well-defined interfaces and can be extended independently.

 

rule trigger service

        The rule triggering service implements logic to determine when a rule should be triggered. It's a collection of microservices, each capable of handling a very specific type of trigger. For example, the logic for point-in-time activations and duration-based triggers is handled by timer-triggered microservices. Also, different services handle device state triggers or weather-based triggers. 

        Depending on the rule firing conditions in the rule definition, a rule registers itself with one or more rule firing services when it is first created. Each trigger service provides three main APIs to register, update and unregister rules. The actual payload of a registered trigger may vary from service to service, however, the rule creation/update API is designed in such a way that the rule management service can quickly identify the trigger type and delegate parsing and interpretation of the trigger conditions to the appropriate trigger service. Endpoints for individual trigger types can be shared as part of configuration or environment variables, or they can be discovered at runtime using standard service discovery patterns.

        Each trigger type has different activation service logic. Triggers can capture data from one or more input sources, process it, cache it if necessary, and emit events when deciding whether a rule should fire. The rule triggering service emits a boolean true when a specific rule meets the trigger conditions, and false otherwise. 

         A rule can include one or more triggers, each of which establishes a specific condition that must be met for the rule to execute. For example, consider a rule to turn on the living room lights at 6 pm every day or when the ambient lighting level falls below 100 lux. The rule uses OR logic to combine two conditions, the first is a time-based trigger and the second is a device (ALS sensor) status trigger. More complex rules can also be created by combining multiple triggers and logical operators. 

 

        To manage the state of each trigger, a persistent cache is used, which is updated by the corresponding trigger service. This ensures that the latest trigger state is always available to the rules processing engine, allowing it to evaluate conditions and invoke appropriate actions. In the figure above, the red trigger status indicates that the current trigger condition is not met, and the green status indicates that the trigger condition has been met. Once the trigger status of a rule changes, the corresponding trigger service will add the rule id to the queue for processing, and then it will be consumed by the rule processing engine. 

        Each rule triggering service is designed to be horizontally scalable and independent of other system components based on the number of registered rules. This decoupling also allows the activation logic for each trigger to evolve independently as the application evolves. Furthermore, new trigger types can be added to the system with minimal changes.

rule processing engine

        The rule processing engine processes rules from the pending rule queue and executes the rules according to the trigger status. If the firing logic is a combination of one or more rule triggers, the processing engine combines the states of each input trigger according to the firing logic specified in the rule definition to compute the final Boolean value. Once it determines that the rule must be triggered, it calls the rule execution service to execute the rule.

        There are roughly two types of trigger states. Point-of-time triggers are only valid when the trigger state changes, such as activating a rule at 6 pm or triggering a device state change (such as turning off the fan when the air conditioner is turned on). Such a rule should activate immediately after the event, provided all other trigger conditions are also met. The rule processing engine resets the value of such triggers immediately after processing the rule.

         The second type of trigger represents the persistent state of an entity over a longer period of time. For example, consider a scenario where the porch light should turn on if motion is detected between 6 PM and 6 AM. The timer trigger service sets the trigger value to true at 6pm and then to false at 6am. These states are not reset by the rules processing engine and remain unchanged until explicitly modified by the trigger service. This enables the system to maintain persistent state of entities and make decisions based on their persistent state.

 

Rule Execution Service

        The Rule Execution Service can invoke HTTP APIs, send MQTT messages, or trigger push notifications to execute rules. The list of actions a rule can perform is application-specific and extensible. Like the rule trigger service, the rule action service is decoupled from the core rule engine and can be extended independently. 

 

        One way to decouple the rule execution service from multiple rule action services is to use a message queue, such as Kafka. Depending on the operation type, the Rule Execution Service can publish rule actions to individual Kafka topics, which can be consumed by a group of consumers and perform related actions. Rule action payloads can be specific to the action type and are captured as part of the rule definition and passed as-is to the task queue.

Extended trigger service

        Rule-triggered services can be stateful, so scaling them can be a challenge. There is no general way to extend all trigger services, as their underlying implementation varies depending on the trigger type and the external services they may depend on. In this section, I explain the scaling methods used for two important trigger types.

Device Status Trigger

        To trigger a service registration rule with a device state, the rule management service will provide the device identifier, device properties and their corresponding thresholds. The device state trigger service will store these in a shared cache (such as Redis) and ensure they are accessible using only the device ID.

         In the given example, notifications about device state changes are sent over the MQTT protocol and then added to the Kafka message queue. A Kafka consumer responsible for device state receives and processes each incoming event. It checks the rule trigger cache to see if any rules are associated with the device. Based on this information, the consumer updates the corresponding trigger state cache to reflect the current state of the device's trigger. This mechanism ensures that the trigger state cache is kept in sync with the latest device state changes, allowing the system to accurately evaluate rules against the latest information.

        All of our services are containerized and run in Kubernetes clusters. The device status trigger service is a standard API service that is scaled through application load balancing and auto-scaling groups. The device state consumer group scales based on the rate of incoming device state change events. Kubernetes Event-Driven Autoscaling (KEDA) can drive scaling of device state consumers based on the number of events that need to be processed. Additionally, there are tools that predict Kafka workloads, which can be used to scale consumers more quickly, thereby improving performance.

timer trigger

        The timer trigger service handles point-in-time triggers and duration triggers. A trigger request payload can be as simple as a specific time of day, or as detailed as the specification of a Unix cron job. The service doesn't need to keep all registered rule requests in memory, because a rule might not fire for days or months. Instead, once it receives a rule registration request, it calculates when the rule should next run and stores the rule details in the database along with the next run time.
        At regular intervals, the service fetches and pulls into memory all the rules that need to be activated in subsequent time windows. This can be done by filtering on the next run time field of the rule. Once it has identified all the rules that need to run, the service sorts the rules by firing time and spawns one or more Kubernetes Pods to process them. Each Pod will only be assigned a subset of the rules. Rules assigned to different Pods can be shared through Zookeeper or Kubernetes Custom Resource Definitions (CRDs).

 

        Kubernetes CRDs can be used to share data and distribute work among multiple Pods by defining custom resources that represent specific tasks. A timer-triggered service uses this functionality by dividing all rules that need to be processed in the next time window into separate tasks and storing them in the CRD. Multiple units of work are then created and assigned specific tasks. Each Pod then processes the rule and updates the trigger state cache accordingly. 

Maintainability and Scalability

        The rule management service and the rule execution service are stateless services, and the logic is quite simple. Rule Management provides a standard API for rule creation, update and deletion. The Rule Execution Service works independently to execute rule actions, primarily invoking application-specific actions.

        Communication between the rule management service and the triggering service can happen asynchronously, eliminating the need for service discovery. For example, each trigger service could have its own dedicated queue in the Kafka message broker. The rule management service can add the trigger request to the corresponding queue according to the trigger type, which can be processed and consumed by the Kafka consumer (trigger service) of the trigger type.

        Adding support for new trigger types and new action types is simple because all key components are decoupled from each other. The rule management service has a plugin-based design, where for each supported trigger type and action type a plugin is added to validate the corresponding trigger and action payloads in the rule definition. Rule trigger and action type names can be converted to corresponding Kafka queue names for communication between the rule management service and the trigger service, and between the rule processing service and the execution service.

Systems can be tested on multiple levels. It is relatively easy to cover individual services with unit tests and focus on specific functionality. To facilitate easy debugging and troubleshooting in a distributed environment, best practices for distributed logging and tracing must be adhered to. Properly implemented logging and tracing mechanisms will allow us to trace the flow of requests between different services, effectively identify issues and diagnose problems. Following these best practices ensures better understanding of system behavior and simplifies the debugging process.

reliability

        Let's first identify potential challenges that may arise. It is important to acknowledge that any service within the system may experience unplanned downtime, the system load may grow faster than its ability to scale effectively, and some services or infrastructure components may experience temporary unavailability. 

        Using Kafka for communication helps achieve multiple levels of reliability. Kafka provides features that facilitate message delivery and consumer reliability, including message persistence, strong durability guarantees, fault-tolerant replication, load distribution among consumer groups, and at least one delivery semantic.

        The most immediate reliability case involves triggering services. For device state triggers, Kafka is configured to guarantee at-least-once delivery so that state change events are not lost. However, achieving reliability for timer-triggered services requires additional steps. Here, it is important not to overwhelm any single worker with the large number of events that need to be handled simultaneously.  

        Our approach is to chronologically order the list of rules to be processed in the next time window and distribute them among worker services in a round-robin fashion. Additionally, the number of worker pods is proportional to the number of timer tasks within the time window and the maximum number of tasks to execute at any given time. This ensures that there is a sufficient number of worker nodes to handle potentially large numbers of timer tasks concurrently. Also, it is beneficial to configure a unit of work to restart automatically when it crashes, allowing it to recover and complete assigned tasks without manual intervention.  

        Additionally, Kubernetes has the advantage of defining resource limits and minimum requirements for each service. This includes specifying the maximum amount of CPU or RAM resources a service can utilize and the minimum resources required to successfully start. With Kubernetes, concerns about issues such as "noisy neighbor" issues (where resource-intensive behavior of one Pod affects other Pods on the same cluster node) can be alleviated. Kubernetes provides isolation and resource management capabilities that help maintain overall system stability and reliability. 

summarize

        MirAIe Rules Engine is an extensible and extensible rules engine framework that allows users to group and automate multiple activities. The framework supports various internal or external triggers and focuses on two design principles: extensibility and flexibility. The architecture had to be extensible to support a wide variety of input flip-flops and output operations, allowing new types to be added without too many changes. The system also prioritizes flexibility to enable independent scaling of each component, increasing resiliency and cost efficiency.

Guess you like

Origin blog.csdn.net/qq_28245905/article/details/132140742