Flink DataStream system

Preface

This article belongs to the column "Big Data Technology System". This column is original by the author. Please indicate the source when quoting. Please point out any shortcomings and errors in the comment area. Thank you!

For the directory structure and references of this column, please see Big Data Technology System


mind Mapping

Insert image description here


text

For Flink, a stream-centric distributed computing engine, data stream is the core data abstraction, representing a continuously generated data stream , similar to the concept of PCollection in Apache Beam.

DataStream is used in Flink to represent data flow. DataStream is a logical concept, not a concept of underlying execution.

DataStream defines common data processing operation APIs (converted to Transformation), and also has the ability to customize data processing of two numbers. When the common operations provided by DataStream do not meet the needs, the data processing logic can be customized.

The DataStream system is shown in the figure below.

Insert image description here

DataStreamSource itself is a DataStream. DataStreamSink, AsyncDatastream, BroadcastDataStream, BroadcastConnectedDataStream, and QueryableDataStream are all encapsulations of general DataStream objects. They implement specific functions in DataStream. Next, these DataStreams will be introduced one by one.

  1. DataStreamIt is the core abstraction of Flink data flow , which defines a series of operations on the data flow, and also defines the mutual conversion relationship with other types of DataStream. Each DataStream has a Transformation object that represents the DataStream using the Transformation from the upstream DataStream.
  2. DataStreamSourceIs the starting point of DataStream . DataStreamSource is created in StreamExecutionEnvironment and created by StreamExecutionEnvironment.addSourcce (SourceFunction). SourceFunction contains the specific logic of DataStreamSource reading data from the data source.
  3. DataStreamSinkThe data is read from the DatasourceStream, and after a series of processing operations, it finally needs to be written to external storage, which is created through DataStream.addSink(sinkFunction), where SinkFunction defines the specific logic of writing data to external storage.
  4. KeyedStreamUsed to represent the data stream grouped according to the specified key . A keyedStream can be obtained by calling DataStream.keyBy(). Any Transformation selected on the KeyedStream will be transformed back to the DataStream. In the implementation, KeyedStream writes key information into Transformation. Each record can only access the status of its key, and the aggregated data on it can easily operate and save the status of the corresponding key.
  5. WindowedStream & AllWindowedStreamWindowedStream represents a data stream grouped by key and split into windows based on WindowAssigner. Therefore, WindowedStream is derived from KeyedStream, and any Transformation performed on WindowedStream will also be converted back to DataStream.
  6. JoinedStreams & CoGroupedStreamsJoin is a special case of CoGroup, and the bottom layer of JoinedStreams is implemented using CoGroupedStreams.

The difference between Join and CoGroup is as follows: CoGrouped focuses on Group, grouping data , and operating on two sets of sets on the same key. Flexible code can be written to implement specific business functions. Join focuses on data pairs and operates on each pair of elements with the same key. CoGroup is more general, but because Join is a common operation on the database, the features of Join are provided based on CoGroup. Both JoinGroup and CoGroup perform operations on continuously generated data, but they cannot hold data infinitely in memory. It is theoretically not feasible to perform Join's Cartesian product operation on all data (theoretically the memory If it is not enough, it can be flushed to the disk, and repeated hard disk reading and writing will cause poor performance), so at the bottom level, both are based on Window.

  1. ConnectedStreamsRepresents a combination of two data streams. The two data streams can be of the same type or different types. ConnectedStreams is suitable for operations on two related data streams, sharing State. A typical scenario is dynamic rule data processing . One of the two flows is a data flow, and the other is a business rule that is updated over time. The rules in the business rule flow are stored in the State, and the rules will continue to update the State. When new data in the data flow arrives, the rules stored in the State are used for data processing.
  2. BroadcastStream & BroadcastConnectedStreamBroadcastStream is actually an encapsulation of a common DataStream, providing the broadcast behavior of DataStream. BroadcastConnectedStream is generally connected by DataStream / KeyedDataStream and BroadcastStream, similar to ConnectedStream.
  3. IterativeStreamIt is an iterative operation on a DataStream . Logically speaking, the Dataflow containing IterativeStream is a directed cyclic graph. At the underlying execution level, Flink performs special processing on it.
  4. AsyncDataStreamIs a tool that provides the ability to use asynchronous functions on DataStream .

Guess you like

Origin blog.csdn.net/Shockang/article/details/132810777