Introduction to the Flink

A.  F. Introduction of the link

        In recent years the rapid development of big data, there have been many popular open source community, including the famous Hadoop, Storm, and later the Spark, they all have their own dedicated application scenarios. Spark opened the first of its kind in-memory computing, but also to the memory for the bet to win the rapid development of memory computing. Spark fiery figure more or less overshadowed other distributed computing system. Like Flink, also at this time quietly developing.

In some communities abroad, there are a lot of people will be large calculation engine data is divided into four generations, of course, there are a lot of people will not agree. Let's let's think and discuss.

First of all the first-generation computing engine, no doubt carried Hadoop MapReduce. Here we should not be unfamiliar MapReduce, it will divide into two phases, respectively Map and Reduce. For the upper layer applications, we will have to find ways to split the algorithm, even have to implement a series of multiple Job in the upper layer application, in order to complete a full algorithms, such as iterative calculations.

Due to such drawbacks, which gave birth to support DAG framework. Thus, DAG support frame is divided into a second-generation compute engine. Tez as well as more upper Oozie. Here we go closer look at the differences between the various DAG achieve, but for the time Tez and Oozie, the most of them or batch tasks.

The next step is to Spark as the representative of the third generation of computing engine. The third-generation computing engine characteristics are mainly supported by internal DAG Job (do not cross Job), and real-time computing emphasized. Here, a lot of people will think that the third generation of computing engine also can run a good batch of Job.

With the advent of third-generation computing engines, promote the rapid development of upper layer applications, such as performance computing as well as convection and SQL support various iterative calculation. Flink's birth was classified in the fourth generation. This should be mainly in the calculation of the convection Flink support, and more real-time step above. Of course Flink can also support Batch tasks, and DAG operations.

Two.  F Link Profile

Apache Flink is a distributed processing engine and the frame, for bounded and unbounded data streams stateful calculated. Flink designed to run on all common cluster environment, memory speed and scale to perform any calculations.

1. unbounded and bounded flow stream

Any type of data are generated as a stream of events. Credit card transactions, user interaction on the sensor measuring machine log or a website or a mobile application, all these data are generated as a stream.

Data may be bounded or unbounded as streaming.

  1. Unbounded stream has a beginning but no end definition. They do not terminate in the generation and provide data. We must continue to deal with unbounded flow, which must handle the event immediately after ingestion event. Can not wait for all input data arrives, because the input is unbounded, and will not be done at any point in time. Unbounded data processing typically requires a particular order (e.g., sequence of events) ingestion event, so that the integrity of the results can be inferred.

  2. Bounded flow with a defined beginning and end. It may be treated by ingesting bounded all data streams before performing any calculations. Bounded process does not require an orderly flow intake, because you can always bound to have sorted data sets. Bounded processing flow is also known as a batch.

Apache Flink good at dealing with bounded and unbounded data set. Capable of running any type of application flow is unbounded in time and precise control of the state where the operation of Flink. Bounded by the internal stream processing algorithms and data structures, algorithms and data structures of fixed size specific data sets designed to produce excellent performance.

2. deploy applications anywhere

Apache Flink is a distributed system, the computing resources required to execute the application. Flink with all common cluster resource manager (such as Hadoop YARN , the Apache Mesos and Kubernetes) integration, but can also be set to run as an independent cluster.

Flink is designed to be well suited for each resource manager listed previously. This is done in the deployment mode resource managers through a specific, these models allow Flink interact with each resource manager in its usual way.

Flink when deployment application, Flink automatically recognizes the resource requirements of the application according to the parallel configuration, and request them from the resource manager. If a failure occurs, Flink will replace the failed container by requesting a new resource. Submit all communication or control applications are carried out by REST calls. This simplifies Flink integrated in many environments.

3. Run the application in any proportion

Flink is designed to run on any scale stateful streaming applications. The application can be distributed and parallel into thousands of tasks simultaneously executed in the cluster. Therefore, the application can use almost unlimited amount of CPU, main memory, disk and network IO. Moreover, Flink can easily maintain a very large application state. Its asynchronous and incremental checkpointing algorithm ensures minimal impact on processing delays, while ensuring one-time state consistency.

Users reported Flink applications running in a production environment for the scalability of digital impressive , such as

  • Application processing trillions of events per day,
  • Application maintenance status of multiple TB, and
  • Thousands of applications running in the kernel.

4. Use of the performance of the memory

Flink stateful applications for the local state visit has been optimized. Task state always remain in memory, or, if the state size exceeds available memory, then stored in a data structure on the disk access and efficient. Thus, by accessing the local task (usually the memory) state to perform all calculations, resulting in a very low processing delay. Flink through regular inspection and asynchronous state to the local persistent storage to ensure a consistent state at the time of the failure.

5.Flink architecture

    Flink can support local fast iteration loop iterations and some tasks. Flink and can be customized memory management. At this point, if you want to compare Spark Flink and then, Flink and no memory entirely to the application layer. That's why Spark relative Flink, the cause of the OOM (out of memory) are more prone. And scenarios on framework itself is, Flink is more similar to the Storm. If you find out about the Storm before or Flume readers, it may be easier to understand many of the concepts and architecture of Flink. Let us first look at the architecture of Figure Flink.

 

Flink we can understand some of the most basic concepts, Client, JobManager and TaskManager. Client to submit the task to the JobManager, JobManager TaskManager to distribute tasks to perform, and then TaskManager will report job status heartbeat. See here, some people should have a kind of back to the illusion Hadoop generation. Indeed, from the architecture diagram to see, JobManager like JobTracker year, TaskManager also like TaskTracker year. However, there is one of the most important difference is between TaskManager Yes stream (Stream). Secondly, the Hadoop generation, only between Map and Reduce Shuffle, while Flink, it may be a number of stages, and will have a data transfer between the internal and TaskManager TaskManager, unlike the Hadoop, it is fixed to Reduce Map .

Three. Flink technical characteristics

1. The  stream processing characteristics

Support high throughput, low latency, high-performance streaming

Support window with event times (Window) operation

Exactly-once semantics supports state calculation

Highly flexible support window (Window) operation, supports time, count, session, and data-driven window operator

Support the continued flow model with Backpressure Function

Lightweight support fault tolerance distributed snapshot (Snapshot) implemented

Supports Batch processing a run on Streaming and Streaming treatment

Flink inside the JVM implements its own memory management

Support iterative calculation

Support automatic optimization procedure: Shuffle avoid certain cases, expensive sorting operation, it is necessary to cache the intermediate results

2.  API support

Streaming data classes of applications, provided DataStream API

For batch type applications, providing DataSet API (support for Java / Scala)

3.  the Libraries support

Support Machine Learning (FlinkML)

Support for chart analysis (Gelly)

Support for relational data processing (Table)

Support for complex event processing (CEP)

4.  Integration Support

Support Flink on YARN

HDFS support

Support of input data from Kafka

Support Apache HBase

Hadoop support program

Support Tachyon

Support ElasticSearch

RabbitMQ Support

Support Apache Storm

S3 Support

Support XtreemFS

5. The  F. Link ecosystem

Flink first supporting Scala and Java API, Python is also being tested. Flink by Gelly support the view of the operation, as well as FlinkML machine learning. Table is an interface of SQL support, which is the API support, rather than the text of the SQL parsing and execution. For a complete Stack we can refer to the diagram.

 

     Flink broader support for big data ecosystem, its sub-projects also achieved a lot of Connector. The most familiar, of course, is integrated with Hadoop HDFS. Secondly, Flink also announced support for the Tachyon, S3 and MapRFS. But for the support of the Tachyon and S3, Hadoop HDFS are achieved through this layer of packaging, that is to use Tachyon and S3, there must Hadoop, but also to change the configuration of Hadoop (core-site.xml). If the browser code directory Flink, we will see more Connector project, such as Flume and Kafka.

Four. Flink programming model

Flink provide different levels of abstraction to develop flow / batch applications.

ç¼ç¨æ½è±¡çº§å«

发布了19 篇原创文章 · 获赞 149 · 访问量 80万+

Guess you like

Origin blog.csdn.net/truelove12358/article/details/103999983