How to improve the efficiency of large-scale regular matching

background

Regular expressions are widely used in daily work, and regular expressions are used to define rules and then match data. Here's a look at the regular application requirements in two security scenarios

Scenario 1, the data is stolen after the FTP account is successfully brute-forced

• Data source: FTP server log

• Association logic: brute force cracking for a specific account, then use the specific account to log in successfully, and then use the specific account to download a large number of files

• Alarm content: FTP account ${user_name} was successfully brute-forced and data was stolen

• Alarm Severity: High Risk

In Scenario 1, regular expressions are used to match the behavior of multiple account logins in the log.

Scenario 2, Deep packet inspection (DPI), such as filtering network threats and traffic violating security policies, etc.

• Data source: network packets

• Detection Rule Condition: Data Hit Rule Set

In Scenario 2, regular expressions are used for security detection between multiple data packets in time series.

In fact, only one way in which FTP is attacked is listed in Scenario 1, and there are many other means of FTP attack, so another feature of the regular matching scenario for detecting FTP being attacked is that the entire rule set may be large; in Scenario 2, the use of A set of known intrusion behaviors is constructed. By detecting network data packets, it is found whether there are behaviors that do not conform to security policies or signs of being attacked. This requires detection of the payload part of the data packets. will affect the user experience.

On the other hand, the regular used here is not the same as the traditional usage. The traditional usage of regular is, given a text, use one or a few regular rules to match the text and find the matching data in the text. The problem we are facing now is that the number of rules is large, tens of thousands or more than 100,000 rule sets. If the previous practice is still adopted, using | to split, or the outer layer is matched with a loop, then the processing time will be reduced. It is very long and consumes a lot of resources, which is basically unacceptable. Secondly, when matching, the data to be matched is not a complete whole. For example, network data packets are received one by one, which is a streaming form. , the traditional regular processing engine cannot process streaming data very well, and needs to cache a batch of data to match, so the matching is not timely enough, and there is a big problem in regular processing at present, if the regular expression is not well written, then Matching will be slow. Therefore, a solution is needed to address these challenges:

• Large number of rules

• match faster

• Support for streaming data

• Resource consumption should not be too large

Introduction to Hyperscan Operator

In response to the challenges encountered in the above regular matching, after research and comparative testing of the mainstream regular matching engines on the market, we finally chose Hyperscan.

Hyperscan is Intel's open source high-performance regular expression matching library, which provides C language API and has been used in many commercial and open source projects.

Hyperscan has these features:

• Supports most PCRE regular grammars (if using the Chimera library, all grammars are supported)

• Supports streaming matching

• Supports multi-mode matching

• Accelerates matching using specific instruction sets

• Easy to expand

• Internal multiple engine combination

Hyperscan was originally designed to better handle stream matching and multi-mode matching. The support for stream mode greatly facilitates regular users, and no longer requires users to maintain received data and cache data; multi-mode matching allows Pass in multiple regular expressions and match them at the same time.

Because a specific instruction set is required, Hyperscan has requirements for the CPU, as shown below:

The CPU must support the SSSE3 instruction set at least, and the instruction set in the bottom line can speed up the matching

Similar to most regular engines, Hyperscan also includes compilation and matching stages. Compilation is to parse the regular expression and build it into the database required internally. This database can be used multiple times to match later; Regular expressions need to have a unique identifier id, which will be used when matching. The compilation process is shown in the following figure:

When matching, Hyperscan will return all hit results by default, unlike some regular engines, which return greedy matching results when greedy is specified, and lazy results when specifying lazy. If there is a hit when matching, the user will be notified in the form of a callback function which regular expression id was hit at which position. The matching process is shown in the following figure:

The disadvantage of Hyperscan is that it can only be executed on a single machine and has no distributed capability. It can solve the problem of delay, but it cannot solve the problem of throughput. To solve the problem of throughput, it can rely on the mainstream real-time computing framework Flink. Flink is a framework and distributed processing engine for state computing on unbounded and bounded data streams. Unbounded is data that has a beginning but no end. Unbounded data stream computing is stream computing. Bounded is data that has a beginning and an end. Bounded data stream computing is batch processing.

Flink can be used in many computing scenarios. Here are three of them. Flink can handle event-driven programs. In addition to simple events, Flink also provides CEP library to handle complex events; Flink can also be used as a data pipeline to do some data cleaning. Operations such as filtering, conversion, etc., transfer data from one storage system to another system; Flink can do stream or batch data analysis and indicator calculation for large-screen display, etc. Flink has become the industry-recognized first choice for streaming.

Integrating the regular matching engine into Flink, with the help of Flink's powerful distributed capabilities and strong alliances, will exert greater power. Therefore, such a solution is provided, as shown in the following figure:

This solution implements a custom UDF operator. The operator supports specifying only certain fields in the input data to be matched. The output of the operator is the field text to be matched, and the final state of the match, including hit, miss, and error , there are four states of timeout. If it is a hit state, the id of the regular expression in the match will also be returned. The output also includes the input original data. If there is subsequent processing, this will not be affected; in order to further facilitate the use of users, the extension A new datastream, called Hyperscanstream, encapsulates the operator in it. Users only need to convert the datastream to Hyperscanstream when using it, and then use a regular operator by calling a method. The entire solution is provided to users as an independent jar package, so that the original habit of writing Flink jobs can be maintained and decoupled from the core framework of Flink.

The process of data flow is like this. The data source reads a record and sends it to the downstream Hyperscan operator. The Hyperscan operator passes the data to the Hyperscan sub-process. After the sub-process matches, it returns the result to the Hyperscan operator, and then the Hyperscan operator. The operator passes the original record and the matched result to subsequent operators.

Operator Instructions

Private deployment

For the privatization deployment scenario, the usage is as follows. The user first needs to edit the regular expression file, and then use the tool to compile the regular expression into a database and serialize it into a local file. If there is HDFS in the deployment environment, then the serialized The file is uploaded to HDFS, if not, then do not upload, and then develop a Flink job, refer to the serialized file to match the data.

Why should there be a tool to compile and serialize this step? After editing the regular expression, can't it be used directly in the Flink job? As mentioned earlier, Hyperscan execution includes compilation and matching stages. If only regular expressions are referenced in the job, and if the job is set to have a parallelism of 5, then each task needs to be compiled once, a total of 5 times, which wastes resources; and compilation It is a relatively slow action in hyperscan, so the compilation process is separated out to speed up the execution of flink jobs as soon as possible. Compiling ahead of time is also useful for knowing if the regular expression has syntax errors or unsupported conditions ahead of time, rather than knowing it after the job starts.

During privatization deployment, hyperscan-related dependent programs will be provided to users, and the dependent programs are compiled fully statically, so there is no need to add dependencies, as long as the machine supports the required instruction set.

Internal use within the company

Example of use

Suppose now that we want to match the Host field and Referer field in the HTTP message, as shown in the following figure:

The code example is as follows:

The whole logic is divided into four steps. The first step is to construct the input stream from the data source, the second step is to convert the input stream to a Hyperscanstream, and the third step is to call the hyperscan method and then use the Hyperscan operator. Specify the matching in the first parameter HyperscanFunction are the Host and Referer fields. The fourth step uses the result returned by the match. The returned result is a Tuple2 object. The first field, Event, is the original record, in this case, the entire HTTP message. The second field is composed of HyperScanRecord. The List, HyperScanRecord class includes matching fields, such as Host or Referer in this example, the regular expression id of the matching hit (if the matching hits) and the final status of the matching.

After testing using 10,000 rule sets and samples of different sizes to be matched, the solution achieved the expected performance. The test results are as follows:

Some suggestions for using the Hyperscan operator, as shown below:

As mentioned earlier, when the himera library is not used, Hyperscan has some PCRE grammars that are not supported. Pay attention when using it. The following figure lists the unsupported grammars (using the Chimera library will affect the matching performance)

future outlook

On the one hand, the Hyperscan operator has been used in security and threat-aware scenarios, but it is hoped that it can be tested in more scenarios. In theory, it can be used in all regular matching scenarios, such as text auditing and content extraction. Wait.

On the other hand, it is also improving the usability of the Hyperscan operator. For example, when the current rules change, the job needs to be restarted to take effect. In the future, it is hoped that the rules can be dynamically hot loaded.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324079866&siteId=291194637