Big Data Development: Introduction to Flink (4)-Programming Model

Flink is an open source big data stream processing framework. It can batch and stream processing at the same time. It has the advantages of fault tolerance, high throughput, and low latency. This article briefly describes the programming model of flink.

Data set type:

  • Infinite data sets: endless continuous integration data sets

  • Bounded data sets: limited data sets that will not change

Common infinite data sets are:

  • Real-time interactive data between users and clients

  • Logs generated by the application in real time

  • Real-time transaction records in financial markets

What are the data operation models?

  • Streaming: As long as the data has been in production, the calculation will continue to run

  • Batch processing: Run calculations within a pre-defined time and release computer resources when complete

Flink can process bounded data sets and unbounded data sets. It can process data in a stream or batch.

What is Flink?

From bottom to top:

1. Deployment: Flink supports local operation, can be run in an independent cluster or a cluster managed by YARN or Mesos, and can also be deployed on the cloud. 2. Operation: The core of Flink is a distributed streaming data engine, which means that data is processed in the form of one event at a time. 3. API: DataStream, DataSet, Table, SQL API. 4. Extended library: Flink also includes a dedicated code library for complex event processing, machine learning, graphics processing, and Apache Storm compatibility.

Flink data flow programming model

Levels of abstraction Flink provides different levels of abstraction to develop streaming or batch applications

The bottom layer provides a stateful stream, which will be embedded into the DataStream API through a process function. It allows users to freely process events from one or more stream data and use consistent and fault-tolerant states. In addition, users can register event time and handle event callbacks, so that the program can achieve complex calculations.

The DataStream / DataSet API is the core API provided by Flink. DataSet handles bounded data sets, and DataStream handles bounded or unbounded data streams. Users can use various methods (map / flatmap / window / keyby / sum / max / min / avg / join, etc.) to convert / calculate the data.

The Table API is a table-centric declarative DSL, where the table may change dynamically (when expressing streaming data). Table API provides operations such as select, project, join, group-by, aggregate, etc., but it is more concise to use (less code).

You can seamlessly switch between tables and DataStream / DataSet, and also allow programs to mix Table API with DataStream and DataSet.

  • The highest level of abstraction provided by Flink is SQL. This layer of abstraction is similar to the Table API in syntax and expression, but expresses the program in the form of SQL query expressions. The SQL abstraction interacts closely with the Table API, and SQL queries can be executed directly on the tables defined by the Table API.

Flink program and data flow structure

The structure of the Flink application is shown above:

  • Source: Data source. Flink's sources for stream processing and batch processing are about 4 types: local collection-based source, file-based source, network socket-based source, and custom source. Customized sources are common Apache Kafka, Amazon Kinesis Streams, RabbitMQ, Twitter Streaming API, Apache NiFi, etc. Of course, you can also define your own source. If you want to learn big data systematically, you can join big data technology to learn the deduction : 522189307

  • Transformation: Various operations for data conversion, including Map / FlatMap / Filter / KeyBy / Reduce / Fold / Aggregations / Window / WindowAll / Union / Window join / Split / Select / Project, etc. There are many operations that can convert the data into you The data you want.

  • Sink: receiver, where Flink will send the converted and calculated data, you may need to store it. The common sinks of Flink may be as follows: write files, print out, write sockets, and custom sinks. Custom sinks are common, such as Apache kafka, RabbitMQ, MySQL, ElasticSearch, Apache Cassandra, Hadoop FileSystem, etc. Similarly, you can also define your own sink.

Published 207 original articles · praised 5 · 40,000+ views

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/105519419