Flink CEP calculates the loss of merchandise orders

Suppose there is a demand that needs to calculate the loss of orders for goods in real time. The rules are as follows:

  • The user clicks on product A, but purchases similar product B, then product A is recorded as an order loss;

  • The effective time window from clicking product A to buying similar product B should be less than 12 hours;

  • Clicking on product A multiple times within the effective window is regarded as a loss of order.

The third rule can be understood as data flow de-duplication, which I have already introduced in the previous section. In order to focus more on calculating the loss of orders for goods, this article no longer focuses on data deduplication.

Seeing this demand, I thought that the ProcessFunction in the previous section could be used for state management, for example, based on users, and then each user maintains a state and a valid time window, triggers the purchase of similar events and performs data statistics, after the validity period has passed give up.

But, is there a more elegant way?

The answer is yes, we can use Flink's own CEP to achieve it.

The following first briefly introduces FlinkCEP, and then gives the code practice.

1.FlinkCEP

1.1 What is CEP

The full name of CEP is Complex Event Process, which is a complex event processing (CEP) library implemented on top of Flink. It allows you to detect event patterns in an unbounded stream of events, giving you the opportunity to grasp the important things in the data.

One or more event streams composed of simple events are matched through certain rules, and then output the data that users want to meet the complex events of the rules.

feature:

Goal: Discover some high-order features from an orderly stream of simple events

Input: one or more event streams composed of simple events

Processing: Identify the internal connection between simple events, and multiple simple events that meet certain rules constitute a complex event

Output: complex events that meet the rules

CEP is used to analyze low-latency, frequently generated event streams from different sources. CEP can help find meaningful patterns and complex relationships in complex and irrelevant event streams, in order to get notifications in near real-time or quasi-real-time and prevent some behaviors.

CEP supports pattern matching on the stream. According to the different conditions of the pattern, it is divided into continuous conditions or discontinuous conditions; the conditions of the pattern are allowed to have a time limit. When the conditions are not met within the condition range, the pattern will be caused. The match timed out.

It seems simple, but it has many different functions:

Input stream data, produce results as soon as possible

On the 2 event streams, perform aggregation calculations based on time

Provide real-time/quasi-real-time warnings and notifications

Generate correlations and analyze patterns in diverse data sources

High throughput, low latency processing

There are many CEP solutions on the market, such as Spark, Samza, Beam, etc., but none of them provide specialized library support. But Flink provides a special CEP library.

Let me cite a few more classic examples:

  • Anomaly detection: The order has not been completed 12 hours after the taxi billing; the user has continuously completed multiple orders in a short time;

  • Real-time marketing: users compare prices on different platforms;

  • Data monitoring: Detect certain indicators, such as the amount of lost orders.

1.2 FlinkCEP principle

FlinkCEP is implemented internally by  "NFA (Non-Deterministic Finite Automata) ", a state diagram composed of points and edges, with an initial state as a starting point, and a series of intermediate states to reach the final state. Point is divided into There are three types of " initial state" , "intermediate state" , and "final state" . The edges are divided into three types:  "take" , "ignore" and "proceed"  .

  • "Take" : There must be a conditional judgment. When the incoming message meets the take edge conditional judgment, the message is put into the result set and the state is transferred to the next state.

  • " Ignore " : When a message arrives, you can ignore the message and spin the state unchanged at the current state, which is a transition from yourself to your own state.

  • "Proceed" : Also called the empty transition of the state, the current state can be directly transferred to the next state without depending on the arrival of the message. For example, when a user purchases a product, if there is a customer service consultation behavior before the purchase, the two messages of the customer service consultation behavior and the purchase behavior need to be put together in the result set and output downstream; if there is no customer service consultation behavior before the purchase, only It is enough to put the purchase behavior in the result set and output downstream. That is to say, if there is the behavior of consulting customer service, there will be message saving on the status of consulting customer service. If there is no behavior of consulting customer service, there will be no message saving on the status of consulting customer service. The status of consulting customer service is determined by a proceed side and The downstream purchase status is connected.

Of course, we won't involve too many complicated concepts in our scene.

2. FlinkCEP is easy to get started

Flink provides a special Flink CEP library for CEP, which contains the following components:

  • Event Stream

  • pattern definition

  • pattern detection

  • Generate Alert

First, the developer needs to define the mode conditions on the DataStream stream, and then the Flink CEP engine performs mode detection and generates alarms when necessary.

In order to use Flink CEP, we need to import dependencies:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-cep-scala_2.11</artifactId>
    <version>1.9.1</version>
</dependency>

2.1 Single Pattern

Let's start with simple content. See how Flink CEP matches under a single Pattern.

2.1.1 Usage of each API

In the process of learning Flink CEP, it is easy to find similar blog posts. The table lists the functions of each API in the article. However, it is easy for everyone to find that this thing is too much like regular expressions (in fact, the implementation of the underlying matching logic should be similar to regular expressions). Therefore, it seems very fast to understand these APIs in combination with regular expressions, so I made my own claim and added regular expressions with similar functions. For example, we want to match the letter x with CEP:

2.1.2 Write a program using only where and or

For example, we now have a simple requirement to match all data beginning with x or y in the input data stream:

public class CepDemo {
    public static void main(String[] args) throws Exception {
        var environment = StreamExecutionEnvironment.getExecutionEnvironment();
        var stream = environment.setParallelism(1).addSource(new ReadLineSource("Data.txt"));

        // 使用 where 和 or 来定义两个需求;
        // 当然也可以放在一个 where 里。
        var pattern = Pattern.<String>begin("start").where(new IterativeCondition<>() {
            @Override
            public boolean filter(String s, Context<String> context) {
                return s.startsWith("x");
            }
        }).or(new IterativeCondition<>() {
            @Override
            public boolean filter(String s, Context<String> context) throws Exception {
                return s.startsWith("y");
            }
        });

        // CEP.pattern 的第一个参数是数据流,第二个是规则;
        // 然后利用 select 方法抽取出匹配到的数据。
        // 这里用了 lambda 表达式
        CEP.pattern(stream, pattern).select((map ->
                Arrays.toString(map.get("start").toArray()))
        ).addSink(new SinkFunction<>() {
            @Override
            public void invoke(String value, Context context) {
                System.out.println(value);
            }
        });
        environment.execute();
    }
}

For the input data stream:

x1
z2
c3
y4

We have output:

读取:x1   
[x1]   
读取:z2   
读取:c3   
读取:y4   
[y4]

As you can see, Flink CEP can match according to each piece of data entered. A single piece of data can be a string in this article, it can be a complex event object, and of course it can also be a character. If every piece of data is a character, then CEP is very similar to regular expressions.

2.1.3 Add quantifiers

Next, in a single Pattern, we add the quantifier API to study how Flink CEP matches multiple pieces of data. From here, things have some gaps with regular expressions. The gap is mainly in the number of results. Because it is a stream calculation, in the actual processing, Flink cannot know the subsequent data, so it will output all matching results.

For example, using the timesOrMore() function to match the occurrence of 3 or more occurrences of the string beginning with a, first write the code (other codes are exactly the same as the above example, and will not be listed in order to save space, the same below):

var pattern = Pattern.<String>begin("start").where(new IterativeCondition<>() {
    @Override
    public boolean filter(String s, Context<String> context) {
        return s.startsWith("a");
    }
}).timesOrMore(3);

Then enter the following string sequence in Data.txt:

a1
a2
a3
b1
a4

Run the program and output the following results:

读取:a1
读取:a2
读取:a3
[a1, a2, a3]
读取:b1
读取:a4
[a1, a2, a3, a4]
[a2, a3, a4]

Let's analyze the execution process. After the program starts, wait for data to flow in. When a1 and a2 are input, because the conditions are temporarily not met, no result is produced, but the data is stored in the state. After the arrival of a3, the matching condition is met for the first time, so the program outputs the result [a1, a2, a3]. Then, b1 is input, and the condition is not met; then a4 is input. At this time, a1, a2, and a3 are still stored in the state, so they can still participate in the match. Matching can produce multiple results, but there are two principles:

  1. Must strictly follow the order of data flow;

  2. The result produced must contain the current element;

Principle 1 is well understood. Since the inflow of data is in the order of a1 -> a2 -> a3 -> a4, the resulting sequence must also be in this order. The intermediate data cannot be deleted, and the order cannot be disrupted. Therefore, the results of [a1, a2, a4] and [a3, a2, a4, a1] are impossible to generate. Principle 2 is better understood. The data is generated because of the inflow of a4, and considering that the quantifier condition we set is "three and more", the result can only be [a2, a3, a4] and [a1, a2, a3, a4].

Similarly, if we add a line a5 at the end of Data.txt, the output of the program is as follows:

读取:a1
读取:a2
读取:a3
[a1, a2, a3]
读取:b1
读取:a4
[a1, a2, a3, a4]
[a2, a3, a4]
读取:a5
[a1, a2, a3, a4, a5]
[a2, a3, a4, a5]
[a3, a4, a5]

According to this kind of thinking, if we continue to add a6, a7, a8,..., a100, then each data will produce more and more results, because Flink CEP will store all eligible data in the state. "This will not do, or else it could not afford the memory」 . Therefore, functions such as oneOrMore() and timesOrMore() are generally followed by the until() function to specify the termination condition.

2.1.4 Replace the quantifier with times()

If you use the same data as above, but replace the quantifier with times(3), what will be the result? We first modify the code:

var pattern = Pattern.<String>begin("start").where(new IterativeCondition<>() {
    @Override
    public boolean filter(String s, Context<String> context) {
        return s.startsWith("a");
    }
}).times(3);

Since only three matches are fixed, coupled with the constraints of the two principles mentioned above, the result is obvious:

读取:a1
读取:a2
读取:a3
[a1, a2, a3]
读取:b1
读取:a4
[a2, a3, a4]
读取:a5
[a3, a4, a5]

The logic from a1 to b1 is exactly the same. When a4 is read, because there are only 3 matches, and the result must contain a4, the result can only be [a2, a3, a4]. Similarly, after reading a5, since the result must contain a5 and only match 3, the result can only be [a3, a4, a5]. In this case, the expired data will be cleaned up, and mom no longer has to worry about my memory being insufficient.

In addition to fixed parameters, the times() function also supports times(from, to) to specify the boundary. The matching result in this case is similar to the above, I believe you can easily derive it, so I won't repeat it here.

2.1.5 Use strict mode

You may have noticed that there has always been an annoying b1 in the Data.txt above. Since our basic matching conditions are not met, b1 is directly ignored by our program. This is because Flink CEP uses a non-strict matching mode by default. In some cases, this data cannot be ignored. At this time, you can use the consecutive() function to specify a strict matching mode. Modify the code as follows:

var pattern = Pattern.<String>begin("start").where(new IterativeCondition<>() {
    @Override
    public boolean filter(String s, Context<String> context) {
        return s.startsWith("a");
    }
}).times(3).consecutive();

Run the program and produce the following results:

读取:a1
读取:a2
读取:a3
[a1, a2, a3]
读取:b1
读取:a4
读取:a5

At this time, because a1, a2, and a3 are closely connected, they are successfully matched. However, a2, a3, a4 and a3, a4, and a5 cannot be matched in strict mode because there is an extra b1 between them. It can be seen that the matching strategy in strict mode is more like regular expressions.

2.2 Multiple Patterns

Generally speaking, tasks that require the use of CEP have to rely on multiple Patterns to solve them. At this point, you can use functions such as followedBy() and next() to create a new Pattern, and connect the new Pattern with the previous Pattern according to different logic.

2.2.1 Use followedBy() to create a new Pattern

Let's take a look at how to deal with multiple Patterns. For example, we need to match input data that contains 2-3 strings starting with a and 1-2 strings starting with b.

// 我们用 times(2,3) 来控制匹配 2-3 次;
// followBy 用于控制两个具有顺序的关系的 Pattern。
var pattern = Pattern.<String>begin("start").where(new IterativeCondition<>() {
    @Override
    public boolean filter(String s, Context<String> context) {
        return s.startsWith("a");
    }
}).times(2, 3).followedBy("middle").where(new IterativeCondition<String>() {
    @Override
    public boolean filter(String s, Context<String> context) throws Exception {
        return s.startsWith("b");
    }
}).times(1, 2);

CEP.pattern(stream, pattern).select(map -> {
    // 把匹配的结果装进 list 中。
    var list = map.get("start");
    list.addAll(map.get("middle"));
    return Arrays.toString(list.toArray());
}).addSink(new SinkFunction<>() {
    @Override
    public void invoke(String value, Context context) {
        System.out.println(value);
    }
});

Here we use the followedBy () function, which creates a new pattern named "middle", which contains a reference to the original pattern. What has also changed is the lambda expression in the select function. In the expression, in addition to getting the data in the pattern named "start", we also get the data in the pattern named "middle" and put them together. This is very similar to the sub-expression in regular expressions. In fact, we can approximate each Pattern as a sub-expression. When reading the result, use the name of the Pattern to extract the result from the map.

The input of the data is:

a1
a2
a3
b1
a4
a5
b2

The data output is:

读取:a1
读取:a2
读取:a3
读取:b1
[a1, a2, a3, b1]
[a1, a2, b1]
[a2, a3, b1]
读取:a4
读取:a5
读取:b2
[a1, a2, a3, b1, b2]
[a1, a2, b1, b2]
[a2, a3, a4, b2]
[a2, a3, b1, b2]
[a3, a4, a5, b2]
[a3, a4, b2]
[a4, a5, b2]

With so much data generated all at once, I was still confused at first. Next, we will analyze step by step:

  1. a1, a2 are read in order, not satisfying the overall condition, but satisfying the "start" condition, and producing an intermediate result [a1, a2], which is in the state;

  2. a3 is read in, it does not meet the overall condition, but the "start" condition is met, and two results [a2, a3] and [a1, a2, a3] are produced;

  3. b1 is read in, and the "middle" condition is met, and the intermediate result of [b1] is produced. At this time, the overall conditions are met, so the above intermediate results are combined to output [a1, a2, a3, b1], [a1, a2, b1] and [a2, a3, b1];

  4. Read in a4, continue to meet the "start" condition, produce [a2, a3, a4] and [a3, a4]; two results, but because these two results are produced after b1 is read in, these two results Cannot be combined with [b1];

  5. Read in a5, continue to meet the "start" condition, and produce two intermediate results [a3, a4, a5] and [a4, a5], which cannot be combined with [b1] in the same way;

  6. Read b2, continue to meet the "middle" condition, and produce two intermediate results [b1, b2] and [b2]. It is getting more complicated here, and it needs to be analyzed strictly in combination with time sequence. Since b1 is read before a4, the sequence [b1, b2] containing b1 can only be associated with [a1, a2], [a2, a3] and [a1, a2, a3]. And [b2] can be associated with the four sequences of [a2, a3, a4], [a3, a4], [a3, a4, a5] and [a4, a5] that contain a4 or a5, so the result is output at this time as follows:

[a1, a2, a3, b1, b2]    // [a1, a2, a3] 和 [b1, b2] 关联
[a1, a2, b1, b2]        // [a1, a2] 和 [b1, b2] 关联
[a2, a3, a4, b2]        // [a2, a3, a4] 和 [b2] 关联
[a2, a3, b1, b2]        // [a2, a3] 和 [b1, b2] 关联
[a3, a4, a5, b2]        // [a3, a4, a5] 和 [b2] 关联
[a3, a4, b2]            // [a3, a4] 和 [b2] 关联
[a4, a5, b2]            // [a4, a5] 和 [b2] 关联

Then there is a question, why can't [b2] be associated with [a1, a2], [a2, a3] and [a1, a2, a3]? Still have to explain from the perspective of time series. Because only b1 follows these three elements, only two sequences containing b1 ([b1] and [b1, b2]) can be associated with them. This is the meaning of followedBy. In order to verify this point of view, we add a b3 at the end of Data.txt. When the other codes are unchanged, after finally reading in b3, the output is as follows:

[a2, a3, a4, b2, b3]
[a3, a4, a5, b2, b3]
[a3, a4, b2, b3]
[a4, a5, b2, b3]

The analysis is as follows: when b3 is read, the "middle" condition is satisfied, and [b2, b3] and [b3] are generated. Among them, only [b2, b3] contains b2. Since b2 is the closest data to the four sequences [a2, a3, a4], [a3, a4], [a3, a4, a5] and [a4, a5], Therefore, only [b2, b3] can be associated with the above four sequences. Since [b3] does not contain b2, it cannot be associated with them.

2.2.2 Replace followedBy() with next()

You can think of next () as an enhanced version of followedBy (). In followedBy, the two Patterns are directly allowed to be not closely connected, such as [a1, a2] and [b1] above, and there is an a3 between them. This kind of data will be discarded in next (). Use the same data above (excluding b3), replace followedBy in the code with next, and modify it as follows:

var pattern = Pattern.<String>begin("start").where(new IterativeCondition<>() {
    @Override
    public boolean filter(String s, Context<String> context) {
        return s.startsWith("a");
    }
}).times(2, 3).next("middle").where(new IterativeCondition<String>() {
    @Override
    public boolean filter(String s, Context<String> context) throws Exception {
        return s.startsWith("b");
    }
}).times(1, 2);

After running, you see the following results:

读取:a1
读取:a2
读取:a3
读取:b1
[a1, a2, a3, b1]
[a2, a3, b1]
读取:a4
读取:a5
读取:b2
[a1, a2, a3, b1, b2]
[a2, a3, b1, b2]
[a3, a4, a5, b2]
[a4, a5, b2]

Analyzing with the previous results, it is found that [a1, a2, b1], [a1, a2, b1, b2], [a2, a3, a4, b2] and [a3, a4, b2] in the results are all excluded. Because they are missing a3, a3, a5, and a5 compared to the original sequence.

2.2.3 What does greedy() do

Regarding the usage of greedy (), it can be said to be very confusing. I have read many articles, and the descriptions of greedy () are almost all in one stroke. The description is mostly "as many matches as possible", but in fact, in most cases, adding or not adding greedy () makes almost no difference. "Because greedy () is classified as a quantifier API, but it actually works in multiple  patterns ." For this reason, I found the implementation logic of greedy (). In the updateWithGreedyCondition method of the NFACompiler class, the code as follows:

private void updateWithGreedyCondition(
 State<T> state,
 IterativeCondition<T> takeCondition) {
 for (StateTransition<T> stateTransition : state.getStateTransitions()) {
  stateTransition.setCondition(
   new RichAndCondition<>(stateTransition.getCondition(), 
   new RichNotCondition<>(takeCondition)));
 }
}

Read the code and found that the method actually adds a logic: "confirm the current conditions to meet the conditions required for transition to the next state, and does not meet the conditions of the current state」 . This means that if you are currently in Pattern1, but there is a piece of data that satisfies the conditions of both Pattern1 and Pattern2, it will jump to Pattern2 without adding greedy (), but if you add greedy (), it will stay In Pattern1. Let's verify it and write the following code:

var pattern = Pattern.<String>begin("start").where(new IterativeCondition<>() {
    @Override
    public boolean filter(String s, Context<String> context) {
        return s.startsWith("a");
    }
}).times(2, 3).next("middle").where(new IterativeCondition<String>() {
    @Override
    public boolean filter(String s, Context<String> context) throws Exception {
        return s.length() == 3;
    }
}).times(1, 2);

In this code, if a piece of data starts with a and has a length of 3, both "start" and "middle" are satisfied. At the same time, in order to easily distinguish which Pattern the data belongs to, we add a separator before the output:

CEP.pattern(stream, pattern).select(map -> {
    var list = map.get("start");
    list.add("|");
    list.addAll(map.get("middle"));
    return Arrays.toString(list.toArray());
}).addSink(new SinkFunction<>() {
    @Override
    public void invoke(String value, Context context) {
        System.out.println(value);
    }
});

Prepare the following data:

a
a1
a22
b33

Without greedy(), the results are as follows:

读取:a
读取:a1
读取:a22
[a, a1, |, a22]
读取:b33
[a, a1, a22, |, b33]
[a, a1, |, a22, b33]
[a1, a22, |, b33]

Observing the results, we can see that a22 jumps left and right in the two Patterns, outputting all possible results. Next we add greedy():

var pattern = Pattern.<String>begin("start").where(new IterativeCondition<>() {
    @Override
    public boolean filter(String s, Context<String> context) {
        return s.startsWith("a");
    }
}).times(2, 3).greedy().next("middle").where(new IterativeCondition<String>() {
    @Override
    public boolean filter(String s, Context<String> context) throws Exception {
        return s.length() == 3;
    }
}).times(1, 2);

The results are as follows:

读取:a
读取:a1
读取:a22
读取:b33
[a, a1, a22, |, b33]
[a1, a22, |, b33]

At this time, a22 is assigned to the "start" pattern. It can be seen that greedy () affects the "division logic of data satisfying two pattern conditions at the same time", and after adding greedy (), the result will be less, not in the intuitive impression, but as much as possible Article data.

2. Code practice

Take a brief look at the code, mainly explained in the way of comments

The input data is:

952483,310884,4580532,pv,1511712000
952483,5119439,982926,pv,1511712000
952483,4484065,1320293,pv,1511712000
952483,5097906,149192,pv,1511712000
952483,2348702,3002561,pv,1511712000
952483,2157435,1013319,buy,1511712020
952483,1132597,4181361,pv,1511712020
952483,3505100,2465336,pv,1511712020
952483,3815446,2342116,pv,1511712030
952483,3815446,2442116,buy,1511712030

The data source code is:

package com.aze.producer;

import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.io.BufferedReader;
import java.io.FileReader;

/**
 * @Author: aze
 * @Date: 2020-09-16 14:41
 */
public class ReadLineSource  implements SourceFunction<String> {

    private String filePath;
    private boolean canceled = false;

    public ReadLineSource(String filePath){
        this.filePath = filePath;
    }

    @Override
    public void run(SourceContext<String> sourceContext) throws Exception {
        BufferedReader reader = new BufferedReader(new FileReader(filePath));
        while (!canceled && reader.ready()){
            String line = reader.readLine();
            sourceContext.collect(line);
        }
    }

    @Override
    public void cancel() {
        canceled = true;
    }
}

The main code is

package com.aze.consumer;

import lombok.val;
import com.aze.producer.ReadLineSource;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.nfa.aftermatch.AfterMatchSkipStrategy;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.IterativeCondition;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

import java.time.Duration;

/**
 * 订单流失率
 *
 * @Author: aze
 * @Date: 2020-09-23 14:45
 */
public class OrderLostCEP {

    public static void main(String[] args) throws Exception {

        val env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(1);

        val dataStream = env.addSource(new ReadLineSource("src/main/resources/data.txt"));

        // 先配置一下事件时间
        // 然后利用 uid 和商品类别进行分组(商品类别的第一个字母代表一级类别)
        val keyStream = dataStream.assignTimestampsAndWatermarks(WatermarkStrategy
                .<String>forBoundedOutOfOrderness(Duration.ofSeconds(30))
                .withTimestampAssigner((SerializableTimestampAssigner<String>)
                        (s, l) -> Long.parseLong(s.split(",")[4]) * 1000))
                .keyBy((KeySelector<String, String>) s ->
                        s.split(",")[0] + "-" + s.split(",")[2].substring(0, 1));

        // 我们采用不丢弃的策略,主要逻辑在于,点击商品 A,而购买同类商品 B 和同类商品 C 算作两次订单流失
        val noSkip = AfterMatchSkipStrategy.noSkip();

        // 制定一个匹配规则;
        // 用 followedByAny 指定不确定的松散连续,读者可以试下其与 followedBy 的区别。
        val pattern = Pattern
                .<String>begin("start", noSkip).where(new IterativeCondition<String>() {
                    @Override
                    public boolean filter(String s, Context<String> ctx) {
                        return "pv".equals(s.split(",")[3]);
                    }
                }).within(Time.minutes(10))
                .followedByAny("end").where(new IterativeCondition<String>() {
                    @Override
                    public boolean filter(String s, Context<String> ctx) {
                        return "buy".equals(s.split(",")[3]);
                    }
                });

        // 经过 CEP 规则匹配后,抽取点击的事件流
        // 利用商品 id 进行分组,并利用 process 进行状态统计。
        val patStream = CEP.pattern(keyStream, pattern)
                .select(map -> map.get("start").get(0))
                .keyBy((KeySelector<String, String>) s -> s.split(",")[1])
                .process(new KeyedProcessFunction<String, String, Object>() {

                    private ValueState<Long> clickState;

                    @Override
                    public void open(Configuration parameters) {
                        clickState = getRuntimeContext().getState(
                                new ValueStateDescriptor<>("OrderLost", Long.class));
                    }

                    @Override
                    public void processElement(String in, Context ctx, Collector<Object> out)
                            throws Exception {

                        Long clickValue = clickState.value();
                        clickValue = clickValue == null ? 1L : ++clickValue;
                        clickState.update(clickValue);
                        out.collect("【" + in.split(",")[1] + "】OrderLost:" + clickValue);

                    }
                });

        patStream.print();

        env.execute("test");

    }

}

result:

【3505100】OrderLost:1
【3815446】OrderLost:1
【4484065】OrderLost:1
【5097906】OrderLost:1

3. Summary

This article mainly introduces how to use FlinkCEP and gives many demos for learning.

But the requirement to complete the beginning is that I use grouping based on uid and product category, and then use cep to mine the matching rules. Of course, you can also group based on uid first, and then use cep to mine the matching mode [click on the product, buy the product], and then use select to filter whether it is a similar product.

Guess you like

Origin blog.csdn.net/Baron_ND/article/details/109381561