Flink notes

2.Flink study notes

2.1 Streaming processing comparison

After learning Spark Streaming, I have a basic understanding of real-time processing. Flink is mainly used for real-time computing, but the two methods of processing real-time computing are different.
![Insert image description here](https://img-blog.csdnimg.cn/bd546e6f132047d0b88247c97679f8a5.png
Figure 2.1 Micro-batch processing of Spark Streaming
Spark Streaming is micro-batch processing, which divides the data stream into very small data based on time Collection and then batch processing.
Insert image description here

Figure 2.2 Flink’s stream processing
Flink is stream processing, which divides calculations into one event and is a standard stream execution mode.

2.2 Flink core concepts

Insert image description here

Figure 2.3 Flink’s running process
JobManager: JobManager is the core of task management and scheduling in the Flink cluster, and is the main process that controls application execution.
Contains three components: JobMaster is responsible for processing individual jobs (Jobs), ResourceManager is mainly responsible for resource allocation and management, and Dispatcher is used to submit applications and is responsible for starting a new JobMaster component for each newly submitted job.
TaskManager: TaskManager is the working process in Flink, and it does the specific calculation of the data flow.

Insert image description here

Figure 2.4 Parallel view of data flow

2.2.1 Parallelism

During Flink execution, each operator can contain one or more subtasks, which are executed completely independently in different threads, different physical machines, or different containers. The number of subtasks of a specific operator is called its degree of parallelism. You can call the setParallelism() method after the operator to set the parallelism of the current operator.

2.2.2 Operator chain

The form of a data flow transmitting data between operators can be: 1. One-to-one pass-through mode. The data flow maintains the partition and the order of elements. This relationship is similar to the narrow dependency in Spark. 2. It can also be a disrupted repartitioning mode. The partitioning of the data flow will change. The subtask of each operator will send the data to different downstream target tasks according to the data transmission strategy. Similar to shuffle in Spark.
Merging operator chains: In Flink, one-to-one operator operations with the same degree of parallelism can be directly linked together to form a "large" task, so that the original operators become part of the real task. part. Each task will be executed by a thread. Such a technique is called "operator chaining".
Insert image description here
Figure 2.5 Merging operator chains.
Linking operators into tasks is a very effective optimization: it can reduce switching between threads and data exchange based on cache areas, and improve throughput while reducing latency. Flink will follow the operator chain by default. Link merging based on the principles.

2.2.3 Task slot

In order to control the amount of concurrency, we need to clearly divide the resources occupied by each task running on the TaskManager, which is the task slot. Each task slot represents a fixed-size subset of the computing resources owned by the TaskManager. These resources are used to execute a subtask independently.
Insert image description here
Figure 2.6 Task slot sharing
When we put resource-intensive and non-intensive tasks into a slot at the same time, they can allocate their own proportion of resource occupation, thereby ensuring that the heaviest tasks are evenly distributed to all TaskManagers. Another benefit of slot sharing is that it allows us to save the complete job pipeline. In this way, even if a TaskManager fails and goes down, other nodes will not be affected at all and the job tasks can continue to be executed.

2.3 DataStream

DataStream API is the core layer API of Flink. A Flink program is actually various conversions of DataStream. The execution of DataStream is mainly divided into four steps: obtaining the execution environment, reading the data source, conversion operation, and output.
Insert image description here
Figure 2.7 The four major components of DataStream
2.3.1 Get the execution environment Environment
1. Create the execution environment StreamExecutionEnvironment:
1) getExecutionEnvironment: The correct result will be obtained directly according to the current running context: if the program runs independently, a local execution will be returned Environment; if a jar package is created, then called from the command line and submitted to the cluster for execution, then the execution environment of the cluster is returned. This method will decide what kind of running environment to return based on the current running mode.
2) createLocalEnvironment: local execution environment
3) createRemoteEnvironment: cluster execution environment
2. Execution mode:
1) Streaming execution mode (Streaming): This is the most classic mode of DataStream API, generally used for unbounded data streams that require continuous real-time processing.
2) Batch execution mode (Batch): An execution mode specially used for batch processing.
3) Automatic mode (AutoMatic): In this mode, the program will automatically select the execution mode based on whether the input data source is bounded.
1.StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
2.env.setRuntimeMode(RuntimeExecutionMode.BATCH);
3.env.execute();

2.3.2 Read data source-source operator (Source)

Flink can obtain data from various sources and then build a DataStream for conversion processing.
1. Read data from the collection:
1.List data = Arrays.asList(1, 22, 3);
2.DataStreamSource ds = env.fromCollection(data);

2. Read data from the file:
1.FileSource fileSource = FileSource.forRecordStreamFormat(new TextLineInputFormat(), new Path(“input/word.txt”)).build();
2.env.fromSource(fileSource,WatermarkStrategy.noWatermarks (),"file").print();

3. Read data from Socket:
1.DataStream stream = env.socketTextStream(“localhost”, 7777);

4.从Kafka读取数据:
1.KafkaSource kafkaSource = KafkaSource.builder()
2. .setBootstrapServers(“hadoop102:9092”)
3. .setTopics(“topic_1”)
4. .setGroupId(“atguigu”)
5. .setStartingOffsets(OffsetsInitializer.latest())
6. .setValueOnlyDeserializer(new SimpleStringSchema())
7. .build();
8.DataStreamSource stream = env.fromSource(kafkaSource, WatermarkStrategy.noWatermarks(), “kafka-source”);
9.stream.print(“Kafka”);

2.3.3 Transformation operator (Transformation)

1) Basic conversion operators:
1.map: mainly used to convert data in the data stream to form a new data stream.
2.filter: Perform a filter on the data stream. If it is judged to be true, the elements will be output normally. If it is false, the elements will be filtered out. 3.flatMap
: Split the entire data stream (usually a collection type) into individuals one by one. use.
2) Aggregation operator:
1.keyBy: partition operation, keyBy can logically divide a stream into different partitions by specifying the key key. Based on different keys, the data in the stream will be allocated to different partitions.
2.sum/min/max/minBy/maxBy: On the input stream, perform superposition summation, minimum value, and maximum value on the specified fields.
3.reduce: Reduce aggregation reduce can perform reduction processing on existing data, and perform an aggregation calculation on each newly input data and the currently reduced value.
3) User-defined function (UDF):
Users can reimplement the operator logic according to their own needs. User-defined functions are divided into: function classes, anonymous functions, and rich function classes.
1. Function Classes: Flink exposes the interfaces of all UDF functions, and the specific implementation methods are interfaces or abstract classes, such as MapFunction, FilterFunction, ReduceFunction, etc. Users can customize a function class to implement the corresponding interface.
2. Rich Function Classes: All Flink function classes have their Rich versions. Different from function classes, rich function classes can obtain the context of the running environment and have some life cycle methods, so they can implement more complex functions. The open() method is the initialization method of Rich Function and will start the life cycle of an operator. When an operator such as map() or filter() method is called, open() will be called first. The close() method is the last method called in the life cycle, similar to the end method, and is used to do some cleanup work.
4) Physical partition operator:
The physical partition operator can improve the parallel processing performance of computing jobs, achieve load balancing, and effectively manage the status of tasks.
1. Random partition shuffle: randomly distribute data to parallel tasks of downstream operators.
2. Polling partition rebalance: distribute the data in sequence.
3. Rescale partitioning: The bottom layer also uses the polling partitioning algorithm for partitioning, but only sends data polling to a part of the downstream parallel tasks. The dealer only polls the partitions for everyone in his own group.
4. Broadcast: After broadcasting, the data will be kept in different partitions and may be processed repeatedly.
5) Splitting operation:
Split a data stream into two or even multiple completely independent streams. Based on a DataStream, define some filtering conditions, select the data that meets the conditions and put them into the corresponding stream.
6) Convergence operation:
jointly process data from multiple streams from different sources.
1. Union (Union): The data types must be the same. The new stream after merging will include elements in all streams, and the data type remains unchanged.
2.Connect: In order to make processing more flexible, the connection operation allows the data types of the streams to be different. Before linking, each link stream still maintains its own data form and is independent of each other. To get a new DataStream, you need to further define a "same processing" conversion operation, which is used to explain that data from different sources and different types are processed and converted separately to obtain a unified output type.

Guess you like

Origin blog.csdn.net/qq_45972323/article/details/131840535