Flink Series 2 Overview of Flink Stateful Stream Processing

insert image description here
This article is the second article in the Flink series. The first article is about environment preparation. Students who need it can read: https://blog.csdn.net/lly576403061/article/details/130358449?spm=1001.2014.3001.5501. I hope that I can consolidate my knowledge in this area and enrich my skill tree through systematic learning. Without further ado, let's get started.

1. Traditional data processing architecture

Data and data processing are ubiquitous in our daily life. With the increasing amount of data collection and usage, various architectures have been designed and built to manage data. Traditional data processing architectures are divided into two categories: transactional processing Architecture and Analytical Processing Architecture.

1.1, transactional processing architecture

All kinds of applications we develop in normal times belong to the transactional processing architecture. For example: customer management system (CRM), task system (ZEUS), order system (SHUTTLE-ORDER) and all web-based applications.
insert image description hereThe figure above is a design of a traditional transactional application that stores data in a remote relational database. Traditional transactional organizations have several characteristics.

  1. The actual user to connect to or an external service.
  2. Continuously accept requests from the outside (user or system) and process the returned data in real time. During the processing of each request, CRUD is basically performed by executing remote database transactions.
  3. In many cases, the same DB and the same table are shared.
    One disadvantage of the above system design is that it will cause some problems when it needs to be updated or expanded. Therefore, micro-services appear. By optimizing complex, large and tightly coupled services, many independent, micro, and independent applications are differentiated. Each service communicate through standardized interfaces.

1.2. Analytical Processing Architecture

The data stored in different databases can prepare data for our business analysis, but since transactional databases are isolated from each other, we will not query data on transactional databases, so we want to make these data What the unified analysis needs to do is to convert the data of different DBs into some common form. This is where the analytical data processing architecture (data warehouse) comes in.
In order to fill the scattered data into the data warehouse, we need to copy the data in the transactional database. This process is divided into three steps: extract-transform-load (ETL). The whole process is complex and performance-challenging. In order to ensure data synchronization, periodic data synchronization is required.
insert image description here
The above figure is an analytical data warehouse architecture, and the analytical data warehouse can provide two types of queries.

  1. Regular report query: conduct periodic analysis and calculation of business data, count important indicators, and provide evaluation basis for the health status of the enterprise. (revenue, output, user growth, order volume, etc.)
  2. Instant query: Provide a relatively real-time data basis to assist key business decisions. (Advertising investment, customer acquisition, conversion, etc.) At present, the Apache HaDoop ecological formation has provided us with a powerful and rich storage query engine. Our massive log files, social media, clicks, etc. data no longer use traditional relational databases Instead, the storage uses HaDoop Distributed File System (HDFS), S3, Apache Hbase and other large-capacity storage systems, and they also provide a wealth of Hadoop-based SQL engines (Apache Hive, Apache Dirll) for query and processing.

2. Stateful stream processing architecture

We all know that data in real life is generated continuously. In the process of processing event streams, we need to support the conversion of multiple records and be able to store and access intermediate results, and sometimes business needs when performing data analysis What is more practical is the analysis result. In the processing of massive events, the traditional transactional data architecture and ETL architecture are difficult to support. Based on the above aspects, a stateful stream processing architecture has been designed. The stateful stream processing architecture (Flink) can receive a large number of requests and naturally supports parallel computing, with high throughput and low latency, and stores the intermediate results of the calculation locally or in remote storage. Flink also periodically performs checkpoints (CheckPoint) is written to persistent storage, and recovery is performed according to the checkpoint during failure recovery.
insert image description here

3. The main features of Flink

3.1, event-driven

Event-driven is actually borrowed from the traditional transactional architecture to receive event requests (which can be real-time triggered operations or event log storage media such as Kafka, redis, etc.), and store intermediate states in local or remote storage, and finally return the calculation results to Start the operation or write to the relevant storage medium (Mysql, Redis, Kafka, etc.) for the consumer to use.

3.2. Flow-based world view

In the world of Flink, there are all streams, which are divided into bounded streams and unbounded streams. Unbounded stream: The beginning is defined, but the end point is not defined, so there is no way to get all events, which requires that unbounded streams need to be processed in real time. Usually, unbounded streams need to be processed in a specific order to obtain accurate results (such as event time). An unbounded stream is a stream with a defined start and end point, because all events can be obtained so there is no need to define a specific order.
insert image description here

3.3. Layered API

Flink provides three layers of API. Each API offers a different trade-off between conciseness and expressiveness. The higher the top level, the more abstract, the more concise the meaning of the expression, and the more convenient to use. The lower the level, the more specific, the richer the expressive ability, and the more flexible the use.
insert image description here

Here we use the DataStream API for systematic learning. The following is a brief introduction to Flink's execution framework
1. Defines the execution environment of Flink.
2. Obtain data from the data source.
3. Perform conversion calculations.
4. Output to the console.

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class SocketTextStreamWordCount {
    public static void main(String[] args) throws Exception {
        //参数检查
        if (args.length != 2) {
            System.err.println("USAGE:\nSocketTextStreamWordCount <hostname> <port>");
            return;
        }

        String hostname = args[0];
        Integer port = Integer.parseInt(args[1]);


        // set up the streaming execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //获取数据
        DataStreamSource<String> stream = env.socketTextStream(hostname, port);

        //计数
        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = stream.flatMap(new LineSplitter())
                .keyBy(0)
                .sum(1);

        sum.print();

        env.execute("Java WordCount from SocketTextStream Example");
    }

    public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) {
            String[] tokens = s.toLowerCase().split("\\W+");

            for (String token : tokens) {
                if (token.length() > 0) {
                    collector.collect(new Tuple2<String, Integer>(token, 1));
                }
            }
        }
    }
}

3.4, time semantics

Flink supports the following three time semantics and uses processing time by default.

@PublicEvolving
public enum TimeCharacteristic {
   ProcessingTime,
   
   IngestionTime,

   EventTime
}
  1. Event time: Process stream data according to the timestamp of the event. Event time and watermark can provide consistent and accurate calculation results for unordered events.
  2. Processing time: The processing time is the time when a specific operator receives an event, and the application that uses the processing time must require a relatively low-latency data stream.
  3. Ingestion time: The ingestion time is the time when the time enters Flink, which is generally not used for calculation.

3.5. Accurate one-time processing

Exactly-once state guarantees: Flink's checkpointing and recovery algorithms ensure consistency of application state in the event of a failure.
Thus, failures are handled transparently and do not affect the correctness of the application.

3.6. Many storage system connections

Flink can connect to many storage media. Common sources and sinks include: Apache Kafka, Mysql, Redis, ES, S3, HDFS, etc.

3.7. Other features

1. Support high-availability configuration: K8s, Yarn, etc. cluster deployment.
2. Low latency, capable of handling millions of events per second with millisecond latency.
3. Colleagues also support batch processing and have a mature API (DataSet API).
4. It supports window operation and provides a mature computing mechanism for unlimited data flow processing.

Summarize

Apache Flink is a distributed stream processing engine that provides an intuitive and expressive API to implement stateful stream processing applications, and supports efficient and large-scale operation of such applications under the premise of fault tolerance. In this article, through the introduction of various concepts of Flink's stateful stream processing, you have an overall understanding of the relevant concepts and characteristics. In the next article, we will practice and see the operating mechanism of Flink from the actual operation. Stay tuned!

Guess you like

Origin blog.csdn.net/lly576403061/article/details/131201515