Flink basics

1. Fink knowledge framework

Flink is quick to get started with PDF, and it is highly recommended for getting started -link: https://pan.baidu.com/s/1W8Dgq40qmSmcXb110gAJ7Q   extraction code: 1234

Flink: Distributed, high-performance framework, supports real-time mode and batch mode

1. Apache Flink as a high-throughput, low-latency distributed real-time processing engine for streaming data and batch data

 

Like Storm/Spark Streaming, it is positioned as a streaming system

the difference:

– Storm: Fast speed, low latency, low throughput, cannot guarantee accurate consistency, must be independent clusters, and slowly abandoned

– Spark Streaming: non-real-time, slow, high throughput, high utilization of yarn resources (micro-batch processing-"quasi real-time effect)

It can't be regarded as a real-time processing engine, but also batch processing, but each batch is very small, and then it can be processed very quickly. Let us feel that there is a real-time effect.

– Flink: Integrate the advantages of the above two frameworks and have a rich concept of time-streaming windows

It is real-time processing in the true sense. When a piece of data comes, just one piece of data is processed.

 

2. Dealing with unbounded and bounded data

Any type of data can form a stream of events. Credit card transactions, sensor measurements, machine logs, user interaction records on websites or mobile applications, all these data form a stream.

Data can be processed as unbounded or bounded streams.

  1. The unbounded flow has a definition of the beginning of the flow, but no definition of the end of the flow. They generate data endlessly. Unbounded flow data must be processed continuously, that is, the data needs to be processed immediately after being ingested. We can't wait for all the data to arrive before processing, because the input is infinite, and the input will not be completed at any time. Processing unbounded data usually requires ingesting events in a specific order, such as the order in which they occur, so that the completeness of the results can be inferred.
  2. Bounded flow has the beginning of the definition flow and the end of the definition flow. Bounded flow can be calculated after all the data is ingested. All data in a bounded stream can be sorted, so there is no need for orderly ingestion. Bounded stream processing is usually called batch processing

 

What are the component stacks of Flink?

        According to the description of Flink's official website, Flink is a layered architecture system, and the components contained in each layer provide specific abstractions to serve upper-layer components.

From bottom to top, each layer represents:

        1. Deploy layer: This layer mainly involves the deployment mode of Flink. As we can see in the above figure, Flink supports multiple deployment modes including local, Standalone, Cluster, and Cloud.

        2. Runtime layer: The Runtime layer provides core implementations that support Flink computing , such as support for distributed Stream processing, JobGraph to ExecutionGraph mapping, scheduling, etc., and provide basic services for the upper API layer.

        3. API layer: The API layer mainly implements Stream-oriented processing and batch processing APIs. Stream-oriented processing corresponds to DataStream API, and batch-oriented processing corresponds to DataSet API. In subsequent versions, Flink has plans to integrate DataStream and DataSet APIs. To unify.

        4. Libraries layer: This layer is called the Flink application framework layer. According to the division of the API layer, the implementation computing framework that meets the specific application built on the API layer also corresponds to stream-oriented processing and batch-oriented processing. Stream-oriented support: CEP (complex event processing), SQL-like-based operations (table-based relational operations); batch-oriented support: FlinkML (machine learning library), Gelly (graph processing).

 

What are the roles of Flink cluster? What is the role of each?

     Flink programs mainly have three roles: TaskManager, JobManager, and Client when they are running.

  •  JobManager plays the role of the manager Master in the cluster. It is the coordinator of the entire cluster. It is responsible for receiving Flink Jobs, coordinating checkpoints, failover failure recovery, etc., and managing the slave node TaskManager in the Flink cluster.
  •  TaskManager is the Worker that is actually responsible for performing calculations. A group of Tasks on which Flink Job is executed. Each TaskManager is responsible for managing the resource information on the node where it is located, such as memory, disk, and network. When starting, it will report the status of the resource to the JobManager report.
  •  Client is the client submitted by the Flink program . When a user submits a Flink program, a Client is first created. The Client will first preprocess the Flink program submitted by the user and submit it to the Flink cluster for processing, so the Client needs to be processed from the user Obtain the address of the JobManager from the submitted Flink program configuration, establish a connection to the JobManager, and submit the Flink Job to the JobManager.

 

 

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/114260688