[Introduction to Flink in Apache]

1. Introduction to Apache Flink

Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

flink is an open source processing engine for batch data and stream data, which has developed into one of the top projects of ASF. The core of Flink is a stream data processing engine that provides data distribution and parallel computing. It already supports API-based SQL queries, including graph operations and machine learning related algorithms.

Flink is a distributed processing engine for streaming and batch data. It is mainly implemented by Java code. At present, it mainly relies on the contribution of the open source community to develop. For Flink, the main scenario to be processed is streaming data, and batch data is just a special case of streaming data. In other words, Flink will process all tasks as streams, which is also its biggest feature. Flink can support local fast iteration, as well as some circular iterative tasks. And Flink can customize memory management. At this point, if you want to compare Flink and Spark, Flink does not fully hand over memory to the application layer. This is why Spark is more prone to OOM than Flink (out of memory). In terms of the framework itself and application scenarios, Flink is more similar to Storm. If you have known Storm or Flume before, it may be easier to understand Flink's architecture and many concepts.



 

2. Features of Apache Flink

Apache Flink, as a new stream processing system, is characterized by:

1. Low-latency stream processor

2. Rich API can help programmers to quickly develop streaming data applications

3. Flexible operation state and flow window

4. Efficient stream and data fault tolerance

 

 

3. Apache Flink Ecosystem

For a computing framework to have long-term development, a complete Stack must be built. Otherwise, it is just like talking on paper, meaningless. Only when the upper layer has specific applications and can make good use of the advantages of the computing framework itself, can the computing framework attract more resources and make faster progress. So Flink is also working hard to build its own Stack.

 

Fourth, a brief description of scheduling in Flink

In a Flink cluster, computing resources are defined as Task Slots. Each TaskManager will have one or more Slots. The JobManager will schedule Tasks in units of Slots. But the Task here is different from what we understand in Hadoop. For Flink's JobManager, it schedules a Pipeline Task, not a point. For example, in Hadoop, Map and Reduce are two independently scheduled tasks, and both will occupy computing resources. For Flink, MapReduce is a pipeline task, which only occupies one computing resource. Similarly, if there is an MRR Pipeline Task, it is also an overall scheduled Pipeline Task in Flink. In TaskManager, according to the number of Slots it has, it will have multiple Pipelines at the same time.

In the deployment mode of Flink StandAlone, this is relatively easy to understand. Because Flink itself also needs to simply manage computing resources (Slot). When Flink is deployed on Yarn, Flink does not weaken resource management. That is to say, Flink is doing something that Yarn should do at this time. From a design standpoint, I don't think it makes sense. If Yarn's Container cannot completely isolate CPU resources, at this time, if multiple Slots are configured on Flink's TaskManager, unfair utilization of resources should occur. If Flink wants to better share computing resources with other computing frameworks in the data center, it should try not to interfere with the allocation and definition of computing resources.

 

 

5. Flink deployment

Flink has three deployment modes, namely Local, Standalone Cluster and Yarn Cluster. For Local mode, JobManager and TaskManager will share a JVM to complete Workload. If you want to verify a simple application, the Local mode is the most convenient. In practical applications, Standalone or Yarn Cluster are mostly used.

 

 

Flink is well-suited for:

  1. A variety of (sometimes unreliable) data sources: When data is generated by millions of different users or devices, it’s safe to assume that some events will arrive out of the order they actually occurred–and in the case of more significant upstream failures, some events might come hours later than they’re supposed to. Late data needs to be handled so that results are accurate.

  2. Applications with state: When applications become more complex than simple filtering or enhancing of single data records, managing state within these applications (e.g., counters, windows of past data, state machines, embedded databases) becomes hard. Flink provides tools so that state is efficient, fault tolerant, and manageable from the outside so you don’t have to build these capabilities yourself.

  3. Data that is processed quickly: There is a focus in these use cases on real-time or near-real-time scenarios, where insights from data should be available at nearly the same moment that the data is generated. Flink is fully capable of meeting these latency requirements when necessary.

  4. Data in large volumes: These programs would need to be distributed across many nodes (in some cases, thousands) to support the required scale. Flink can run on large clusters just as seamlessly as it runs on small ones.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326497945&siteId=291194637