Using Flink to complete flow data statistics | JD Cloud technical team

1. Statistical process



The process of all stream calculation statistics is:

1. Access data sources

2. Perform multiple data conversion operations (filtering, splitting, aggregation calculation, etc.)

3. Storage of calculation results. The data source can be multiple. After the data conversion node processes the data, it can be sent to one or more next nodes to continue processing the data.

The basic units of Flink program construction are stream and transformation (DataSet is essentially a stream). Stream is an intermediate result data. Transformation processes and operates data. This operation takes one or more streams as input, calculates and outputs one or more streams as the result, and finally sinks to store the data.



Including the data source, each emitted data result is passed to the next level through DataStream for continued processing.

Each Transformation has 2 steps:

1. Process data

2. Emit the processed data

2. Flink’s data source

Flink only needs to implement the SourceFunction interface to provide data sources. SourceFunction has an abstract implementation class RichParallelSourceFunction, which inherits this implementation class and implements 3 methods, which can be used to customize the Source public void open(Configuration parameters) //Called during initialization, some parameters can be initialized public void run(SourceContext<T> ctx)/ /Send data. In this method, call the collect method of ctx to emit the data.



In this example, an Order type entity is sent out every 20 seconds.

3. Flink’s data conversion operation

Flink provides different solutions for different scenarios, reducing users' attention to efficiency issues in the processing process.

Common operations include the following: "map" is to do some mapping. For example, we merge two strings into one string, and split a string into two or three strings.

"flatMap" is similar to splitting a record into two, three, or even four records, such as splitting a string into a character array.

"Filter" is similar to filtering.

"keyBy" is equivalent to group by in SQL.

"aggregate" is an aggregation operation, such as counting, summing, averaging, etc.

"reduce" is similar to reduce in MapReduce.

The "join" operation is somewhat similar to the join in our database.

"connect" implements connecting two streams into one stream.

"repartition" is a repartitioning operation (haven't studied it yet).

The "project" operation is similar to snacks in SQL (haven't studied it yet).

Common operations include filter, map, flatMap, keyBy (grouping), and aggregate (aggregation). The specific usage will be reflected in the examples below.

3. Window

The calculation of stream data can split the continuous data into a large number of fragments according to certain rules, and perform statistics and calculations within the fragments. For example, you can save the data within one hour to a small database table, and then perform calculations and statistics on this part of the data, while stream computing is only performed in real time.

Common windows are:

1. Time Window in time units, for example: every second, every hour, etc.

2. Count Window based on the number of data, for example: every one hundred elements

Flink provides us with some common time window models.

1. Tumbling Windows (non-overlapping)

Each piece of data in the data stream belongs to only one window. Each one has a fixed size and the windows do not overlap with each other. If you specify a tumbling window with a size of 5 minutes, a window will be launched every 5 minutes, as shown in the following figure:



2. Sliding Windows (overlapping)

Different from the Tumbling window, when building a Sliding window, you not only need to specify the window size, but also specify a window sliding parameter (window slide parameter) to determine the starting position of the window. Therefore, when the window sliding parameter is smaller than the window size, duplicate areas may appear between windows. For example, when you specify the window size as 10 minutes and the sliding parameter as 5 minutes, as shown in the following figure:



3. Session Windows

When there is no data in the data stream for a period of time, the Session window will be closed. Therefore, Session Windows do not have a fixed size and the starting position of the Session window cannot be calculated.



4. The concept of time in Flink

There are 3 different time concepts in Flink

1. Processing time Processing Time refers to the system time at that time when we performed the Transformation operation above.
2. Event Time Event Time refers to the time when business occurs. Each business record will carry a timestamp. We need to specify which attribute in the data to obtain. When counting data based on business occurrence time, we face a problem. When the time of the data we receive is out of order, when should we trigger the aggregation calculation? We cannot wait indefinitely. Flink introduced the concept of Watermark. This Watermark is added to the window and tells the window how long we can wait for the longest. Data that exceeds this time will be discarded and no longer processed.
3. Ingestion Time refers to the system time when the data enters Flink.
 

5. Examples of order statistics

 Step 4: Set timestamps and watermarks

 DataStream<Order> marksSource = vilidatedSource
.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<Order>(Time.minutes(1)){
            @Override
            public long extractTimestamp(Order o) {
                return o.getTimestamp().getTime();
            }
        });

We have already set up the use of EventTime to process data, so before calculating the time window, we must allocate fields for obtaining timestamps to the data. Here we set the timestamp field of Order to EventTime, and we also set a 1-minute Watermarks, which means that the maximum Wait for 1 minute. Data that occurs more than 1 minute after the system time will not be counted.

Step 5: Data Grouping

KeyedStream<Order, Tuple> keyedStream =
                marksSource.keyBy("biz");//先以biz来Group

Here, the biz field in Order is set to be grouped, which means that all data with the same biz will enter the same time window for calculation.

Step 6: Specify time window and aggregate calculation

DataStream<List<Tuple2<String, String>>> results = keyedStream
                .window(TumblingEventTimeWindows.of(Time.minutes(1)))
                .aggregate(new OrderSumAggregator()).setParallelism(1);

Here a non-overlapping TumblingEventTimeWindow is set in units of 1 minute. Then use OrderSumAggregator to perform aggregation calculations. It should be noted that if the first setting is to use ProcessTime to process data, the window here will become TumblingProcessTimeWinwow, and the front and back must correspond one to one. Previously, because the front and back did not correspond, the statistical results were incorrect and the reason could not be found.

6. Aggregation calculation

The core part of the above example is aggregation calculation, which is our OrderSumAggregator. For aggregation calculation, we only need to implement the AggregateFunction interface provided by Flink and rewrite its method.

ACC createAccumulator();//Create a data statistics container for subsequent operations.

ACC add(IN in, ACC acc);//Called when each element is added to the window. The first parameter is the element added to the window, and the second parameter is the container for statistics (the one created above).

OUT getResult(ACC acc); // Called when the window statistics event is triggered to return the statistical results.

ACC merge(ACC acc1, ACC acc2);//Only called when windows are merged, merge 2 containers

Depending on the situation, this container can also be provided in memory or in other storage devices.

Through the above example, we can count the number of orders per minute according to business time, and orders can be reported with a delay of up to 1 minute. However, in order to wait for the data reported within 1 minute, the data will be counted with a delay of 1 minute. For example, at 8:02 we can only count the data reported from 8:00 to 8:01. In order to solve this problem, we can add a custom statistical trigger to the window. This trigger can trigger statistical events at the hour (that is, calling the getResults method above), so that it reaches 8:00 to 8:01 The data for this time period is counted once at 8:01 and again at 8:02 (plus the data reported in the next 1 minute).

Author: JD Technology Liang Fawen

Source: JD Cloud Developer Community Please indicate the source when reprinting

IntelliJ IDEA 2023.3 & JetBrains Family Bucket annual major version update new concept "defensive programming": make yourself a stable job GitHub.com runs more than 1,200 MySQL hosts, how to seamlessly upgrade to 8.0? Stephen Chow's Web3 team will launch an independent App next month. Will Firefox be eliminated? Visual Studio Code 1.85 released, floating window Yu Chengdong: Huawei will launch disruptive products next year and rewrite the history of the industry. The US CISA recommends abandoning C/C++ to eliminate memory security vulnerabilities. TIOBE December: C# is expected to become the programming language of the year. A paper written by Lei Jun 30 years ago : "Principle and Design of Computer Virus Determination Expert System"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10320419