RocketMQ Streams: Integrating a lightweight real-time computing engine into a messaging system

Author | Yuan Xiaodong, Cheng Junjie

With the popularization and development of mobile internet and cloud computing technology in all walks of life, big data computing has been deeply rooted in the hearts of the people, and the most common ones are flink, spark, etc. These big data frameworks adopt a centralized Master-Slave architecture, which is heavy on dependencies and deployments, and each task also has a large overhead and a large usage cost. RocketMQ Streams focuses on building a lightweight computing engine. Except for message queues, it has no additional dependencies. It has made a lot of optimizations for filtering scenarios, improving performance by 3-5 times and saving resources by 50%-80%.

RocketMQ Streams is suitable for scenarios of large data volume -> high filtering -> light window computing. The core creates light resources and high performance advantages. It has great advantages in resource-sensitive scenarios. The minimum 1core and 1g can be deployed. Recommended application scenarios (safety , risk control, edge computing, message queue flow computing).

RocketMQ Streams is compatible with Blink (Alibaba's internal version of Flink) SQL, UDF/UDTF/UDAF, and most Blink tasks can be directly migrated to RocketMQ Streams tasks. In the future, an integrated version with Flink will be released. RocketMQ Streams can be directly published as Flink tasks, which can not only enjoy the high performance and light resources brought by RocketMQ Streams, but also can be operated and managed in a unified manner with existing Flink tasks.

What are RocketMQ Streams?

This chapter provides an overall introduction to RocketMQ Streams from three aspects: basic introduction, design ideas and characteristics.

1. Introduction to RocketMQ Streams

1) It is a Lib package, which can be started and run directly, and is directly integrated with business; 2) It has SQL engine capabilities, is compatible with Blink SQL syntax, and is compatible with Blink UDF/UDTF/UDAF; 3) It contains ETL engine, which can realize data without coding 4) It is based on the data development SDK, and a large number of practical components can be used directly, such as: Source, sink, script, filter, lease, scheduler, configurable and unlimited flow scenarios.

2. Features of RocketMQ Streams

Based on the above implementation ideas, RocketMQ streams can see that it has the following characteristics:

  • Lightweight

It can be deployed with 1 core and 1g, and the dependencies are light. In the test scenario, the main method can be directly written in the Jar package to run. In the formal environment, it depends on the message queue and storage at most (storage is optional, mainly for fragmentation. fault tolerance when switching).

  • high performance

Implement high filtering optimizer, including pre-fingerprint filtering, automatic merging of homologous rules, hyperscan acceleration, expression fingerprinting, etc., which improves the performance by 3-5 times and saves resources by more than 50% compared with that before optimization.

  • Dimension table JOIN (supported by dimension tables with tens of millions of data)

Design high-compression memory to store data, no java header and alignment overhead, storage close to the original data size, pure memory operation, maximize performance, and provide multi-threaded concurrent loading for Mysql to improve the speed of loading dimension tables.

  • High expansion capability

1) Source can be expanded on demand, implemented: RocketMQ, File, Kafka; 2) Sink can be expanded on demand, implemented: RocketMQ, File, Kafka, Mysql, ES; 3) UDF/UDTF/UDAF can be expanded according to Blink specification ;4) Provides a lighter UDF/UDTF extension capability, and can complete the function extension without any dependencies.

  • Provides rich big data capabilities

Including accurate calculation of a flexible window, dual-stream join, statistics, windowing, various conversion filtering, to meet various scenarios of big data development, and support elastic fault tolerance.

Use of RocketMQ Streams

RocketMQ Streams provides two kinds of SDKs, one is DSL SDK and the other is SQL SDK, which users can choose as needed; DSL SDK supports real-time scene DSL semantics; SQL SDK is compatible with Blink (Flink's internal version of Ali) SQL syntax, most Blink SQL can be run via RocketMQ Streams. Next, we introduce these two SDKs in detail. 0

1. Environmental requirements

1) JDK 1.8 or above; 2) Maven 3.2 or above.

2、DSL SDK

When using the DSL SDK to develop real-time tasks, you need to do the following preparations:

Dependency preparation

<dependency>
    <groupId>org.apache.rocketmq</groupId>
    <artifactId>rocketmq-streams-clients</artifactId>
    <version>1.0.0-SNAPSHOT</version>
</dependency>

After the preparation work is completed, you can directly develop your own real-time program.

code development

DataStreamSource source=StreamBuilder.dataStream("namespace","pipeline");

source.fromFile("~/admin/data/text.txt",false)
    .map(message->message + "--")
    .toPrint(1)
    .start();

in:

1) Namespace is business isolation, the same business can be written as the same Namespace. The same Namespace can run in the process in task scheduling, and can also share some configurations;

2) The pipelineName can understand that the achievement is the job name, which only distinguishes the job;

3) DataStreamSource mainly creates Source, and then the program runs. The final result is that "--" will be added to the original message, and then it will be printed out.

  • rich operator

RocketMQ streams provide rich operators, including:

1) Source operator: including fromFile, fromRocketMQ, fromKafka and the from operator that can customize the source source;

2) sink operator: including toFile, toRocketMQ, toKafka, toDB, toPrint, toES and to operator that can customize sink;

3) Action operator: including Filter, Expression, Script, selectFields, Union, forEach, Split, Select, Join, Window and other operators.

  • Deployment execution

Based on the DSL SDK to complete the development, use the following commands to make a jar package, execute the jar, or directly execute the main method of the task.

mvn -Prelease-all -DskipTests clean install -U
java -jar jarName mainClass &

3、SQL SDK

Dependency preparation

  <dependency>
      <groupId>com.alibaba</groupId>
      <artifactId>rsqldb-clients</artifactId>
      <version>1.0.0-SNAPSHOT</version>
</dependency>

code development

First develop business logic code, which can be saved as a file or directly used as text:

CREATE FUNCTION json_concat as 'xxx.xxx.JsonConcat';

CREATE TABLE `table_name` (
    `scan_time` VARCHAR,
    `file_name` VARCHAR,
    `cmdline` VARCHAR,
) WITH (
     type='file',
     filePath='/tmp/file.txt',
     isJsonData='true',
     msgIsJsonArray='false'
);
-- 数据标准化
create view data_filter as
select
     *
from (
    select
        scan_time as logtime
        , lower(cmdline) as lower_cmdline
        , file_name as proc_name
    from
        table_name
)x
where
    (
        lower(proc_name) like '%.xxxxxx'
        or lower_cmdline  like 'xxxxx%'
        or lower_cmdline like 'xxxxxxx%'
        or lower_cmdline like 'xxxx'
        or lower_cmdline like 'xxxxxx'
    )
;

CREATE TABLE `output` (
     `logtime` VARCHAR
    , `lower_cmdline` VARCHAR
    , `proc_name` VARCHAR
) WITH (
    type = 'print'
);

insert into output
select
    *
from
    aegis_log_proc_format_raw
;

in:

1) CREATE FUNCTION: Introduce external functions to support business logic, including flink and system functions;

2) CREATE Table: create source/sink;

3) CREATE VIEW: Perform field transformation, splitting, filtering;

4) INSERT INTO: data is written to sink;

5) Function: built-in function, udf function.

  • SQL extension

RocketMQ streams support three SQL extension capabilities, please see the specific implementation details:

1) Extend SQL capabilities through Blink UDF/UDTF/UDAF;

2) To extend SQL capabilities through RocketMQ streams, just implement a java bean whose function name is eval;

3) Extend the SQL capability through the existing java code, the create function function name is the method name of the java class.

  • SQL execution

You can download the latest RocketMQ Streams code from and build.

cd rsqldb/
mvn -Prelease-all -DskipTests clean install -U
cp rsqldb-runner/target/rocketmq-streams-sql-{版本号}-distribution.tar.gz 部署的目录

Unzip the tar.gz package and enter the directory structure

tar -xvf rocketmq-streams-{版本号}-distribution.tar.gz
cd rocketmq-streams-{版本号

Its directory structure is as follows:

1) bin command directory, including start and stop commands

2) conf configuration directory, including log configuration and application related configuration files

3) jobs store sql, which can be stored in two levels of directories

4) ext stores the extended UDF/UDTF/UDAF/Source/Sink

5) lib dependency package directory

6) log log directory

  • execute SQL
#指定 sql 的路径,启动实时任务
bin/start-sql.sh sql_file_path
  • Execute multiple SQL

If you want to execute a batch of SQL in batches, you can put the SQL in the jobs directory, there can be up to two layers, put the sql in the corresponding directory, and specify a subdirectory or sql through start to execute the task.

task stop

# 停止过程不加任何参数,则会将目前所有运行的任务同时停止
bin/stop.sh


# 停止过程添加了任务名称, 则会将目前运行的所有同名的任务都全部停止
bin/stop.sh sqlname
  • log view

All current run logs are stored in the log/catalina.out file.

Architecture design and principle analysis

1. RocketMQ Streams design ideas

After understanding the basic introduction of RocketMQ Streams, let's take a look at the design ideas of RocketMQ Streams. The design ideas are mainly introduced from two aspects: design goals and strategies:

Design goals

1) Less dependencies, simple deployment, 1 core 1g single instance can be deployed, and the scale can be expanded at will;

2) Create scenario advantages, focusing on creating scenarios with large data volume -> high filtering -> light window computing, with full functional coverage to achieve the required big data features: Exactly-ONCE, flexible windows (scrolling, sliding, session windows) );

3) On the premise of maintaining low resources, there must be performance breakthroughs in high filtering to create performance advantages;

4) Compatible with Blink SQL, UDF/UDTF/UDAF, making it easier for non-technical personnel to get started.

Strategy (adaptive scenario: large data volume > high filtering/ETL > low window calculation)

1) Using the distributed architecture design of shared-nothing, relying on the message queue for load balancing and fault tolerance mechanism, a single instance can be started, increasing the instance to achieve capacity expansion, and the concurrency capacity depends on the number of shards;

2) Use message queue sharding for shuffle, and use message queue load balancing to achieve fault tolerance;

3) Utilize storage to realize state backup and realize the semantics of Exactly-ONCE. Use structured remote storage for fast startup without waiting for local storage recovery.

4) Gravity builds a filter optimizer, through pre-fingerprint filtering, homologous rules are automatically merged, hyperscan is accelerated, and expression fingerprints improve filtering performanceinsert image description here

2. Implementation of RocketMQ Streams Source

1) Source requires the semantics of at least one consumption. The system implements it through the checkpoint system message. Before submitting the offset, a checkpoint message is sent to notify all operators to refresh the memory. 2) Source supports automatic load and fault tolerance of sharding

When the shard is removed, the data source sends a removal system message to let the operator complete the shard cleanup;

When there is a new shard, send a new shard message to let the operator complete the shard initialization.

3) The data source starts consumemr to obtain the message through the start method; 4) The original message is encoded, and the additional header information is packaged into a Message and delivered to the subsequent operator.insert image description here

3. Implementation of RocketMQ Streams Sink

1) Sink is a combination of real-time and throughput;

2) To implement a sink, just inherit the AbstractSink class and implement the batchInsert method. The meaning of batchInsert is that a batch of data is written into the storage, which requires subclasses to call the storage interface to implement, and try to apply the batch interface of storage to improve throughput;

3) The conventional usage is to write message->cache->flush->storage. The system will strictly ensure that the amount of storage written in each batch does not exceed the batchsize. If it exceeds, it will be split into multiple batches. write;insert image description here

4) Sink has a cache, data is written to cache by default, and batches are written to storage to improve throughput (one cache per fragment);

5) Automatic refresh can be turned on. Each shard will have a thread to refresh cache data to storage regularly to improve real-time performance. Implementation class: DataSourceAutoFlushTask;

6) Refresh the cache to storage by calling the flush method;

7) Sink's cache will have memory protection. When the number of messages in the cache is greater than batchSize, it will be forced to refresh and release the memory.

4. RocketMQ Streams Exactly-ONCE Implementation

1) Source ensures that when the commit offset occurs, the checkpoint system message will be sent, the component that receives the message will complete the save operation, and the message will be consumed at least once;

2) Each message will have a message header, which encapsulates queued and offset;

3) When the component stores data, it will store the queued and the maximum offset processed. When there are repeated messages, they will be deduplicated according to the maxoffset;

4) Memory protection, a checkpoint cycle may have multiple flushes (triggered by the number of entries) to ensure that the memory usage is controllable.insert image description here

5. Implementation of RocketMQ Streams Window

1) Support scrolling, sliding and session windows, support event time and natural time (the time when the message enters the operator);

2) Support Emit syntax, you can update the data every n periods before or after the trigger; for example, in a 1-hour window, before the window is triggered, you want to see the latest results every minute, and after the window is triggered, you hope not to lose the data that is late within a day. data, and the data is updated every 10 minutes.

3) Support high-performance mode and high-reliability mode, high-performance mode does not rely on remote storage, but there is a risk of losing window data when shard switching;

4) Quick start, no need to wait for local storage recovery, asynchronously recover data from remote storage when an error occurs or shard switching, and directly access remote storage for computing at the same time;

5) Using message queue load balancing to achieve capacity expansion and capacity reduction, each queue is a group, and a group is only consumed by one machine at the same moment;

6) Normal computing relies on local storage and has similar computing performance to flink.insert image description here

Best Practices of RocketMQ Streams in Security Scenarios

1. Background

Moving from the public cloud to the private cloud has encountered new problems. Because proprietary cloud SaaS services such as big data are not required to be output, and the minimum output scale is relatively large, user costs will increase a lot, and it is difficult to implement, resulting in security capabilities that cannot be quickly synchronized to the proprietary cloud.insert image description here

2. Solutions

  • Application of RocketMQ Streams in Cloud Security-Stream Computing

1) Create a lightweight computing engine based on security scenarios. Based on the characteristics of high-security and high-filtering scenarios, it can be optimized for high-filtering scenarios, and then perform heavy statistics, windows, and join operations. Because the filtering rate is relatively high, it can be used lighter The scheme implements statistics and join operations;

2) Both SQL and engine can be hot-upgraded.

  • business results

1) Rule coverage: self-built engine, covering 100% rules (regular, join, statistics);

2) Light resources, the memory is 1/24 of the public cloud engine, and the cpu is 1/6. It depends on the filter optimizer. The resources do not increase linearly with the rules, and the new rules have no resource pressure. Through the high compression table, it supports tens of millions of intelligence;

3) SQL release, through the c/s deployment mode, the SQL engine hot release, especially in the network protection scenario, can quickly launch the rules;

4) Performance optimization, special performance optimization of core components to maintain high performance, more than 5000qps per instance (2g, 4 cores, 41 rules).

Future plans for RocketMQ Streams

1. Build RocketMQ integrated computing capabilities

1) Integrate with RocketMQ, remove DB dependencies, and integrate RocketMQ KV;

2) Co-located with RocketMQ, supports local computing, and uses local characteristics to create high performance;

3) Create best practices for edge computing 02

2. Connector enhancement

1) Supports the pull consumption method, and the checkpoint is refreshed asynchronously;

2) Compatible with blink/flink connector. 03

3. ETL capacity building

1) Increase the data access capability of files and syslog;

2) Compatible with Grok parsing, increasing the parsing capability of common logs;

3) Best practices for building log ETL 04

4. Build stability and ease of use

1) Window multi-scene test to improve stability and performance optimization;

2) Supplementary test cases, documentation, application scenarios.

open source address

The above is the overall introduction to RocketMQ Stream this time, I hope it will be helpful and inspiring to everyone.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3996014/blog/5386140