How to read, write, and modify the application of state Flink

How to read, write, and modify the application of state Flink

This paper describes Flink state management, very practical.

Whether the past is used in production or research Apache Flink, there will always be a problem: how to access and update Flink save point (savepoint) saved state? Apache Flink 1.9 introduces a state of the processor (State Processor) API, which is based on the powerful expansion DataSet API allows reading, writing, and modifying the state of preservation Flink and the checkpoint (the checkpoint) is .

This article will explain how this feature is important to Flink, the purpose and usage as well as the function. Finally, we will discuss the future plans of the state of the processor API, in order to maintain the flow Flink batch unified whole plan for the future agreement.

1.9 Flink state before streaming Status

Almost all the complex stream processing applications are stateful, most of which are designed to run several months or even years. Over time, these jobs has accumulated a lot of valuable state if lost due to a failure to rebuild more than the cost of these states is very high even impossible. In order to ensure consistency and durability application state, Flink from the outset we designed a complex clever checkpoint and recovery mechanisms. In each version, Flink communities have added more and more features associated with the state to improve the implementation of checkpoints and speed recovery, improve maintenance and management applications.

However, Flink user can often put forward demands "from the outside" of the state to access the application. This demand may be verified motive state or debugging an application, or the migration status of the application to another application, or from an external system (e.g. a relational database) introduced into the initial state of the application.

Although the starting point of these requirements are reasonable, but so far still very limited access applications from outside the state of this feature. Flink can query the status (queryable state) function only supports lookup key (inquiry point), and does not guarantee the return value of consistency (in application failure before and after recovery, the return value may be different), and can query the state support only read and write does not support modifications. In addition, consistent snapshot state: save points, is inaccessible because it is using a custom binary format for encoding. Want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307

State Processor API using an application to read and write status

Flink 1.9 introduced state of the processor (State Processor) API, a real change in the status quo, to achieve the operating state of the application. This feature means DataSet API, expanded input and output formats for reading and writing data stored points or checkpoints. Since the DataSet and Table API interoperability, users can even use the API or SQL query relational tables to analyze and process state data.

  • For example, users can create a save point in the stream processing applications that are running, and use the batch program be analyzed to verify the behavior of the application is correct.
  • Alternatively, the user can arbitrarily read, process and write data to the storage point, which was an initial state flow computing applications.
  • At the same time, it also supports the restoration saved in an inconsistent state entry points.
  • Finally, the state of the processor API opens up many ways to develop the application state has to bypass the previously done in order to ensure normal recovery many restrictions: The user can now modify any data type state, adjusting the maximum degree of parallelism operator, split or merge operator status, and so redistribute operator UID.

The application program mapped with the data set

The state of the processor status API streaming applications mapped to one or more data sets may be processed separately. To be able to use the API, you need to know this works map.

First, let's have a look at the state of the job is what Flink. Flink operation by the operator (operator) composition, or more usually a source operator, some of the data processing operator and one or more sink operators. Each operator to run in parallel or a plurality of tasks, and may use different types of status: may have zero, one or more of a list of operator states, their scope operator scope is the current instance; if operator to the flow key (keyed stream), it may have zero, one or more keyed states, their scope range is extracted from each process in the record key. You can keyed states deemed distributed key - value mapping.

The following figure shows the application "MyApp", referred to as the "Src", three "Proc" and "Snk" Operator composition. Having a Src operator state (os1), Proc having an operator state (os2) and two keyed state (ks1, ks2), and Snk is stateless.

 

MyApp save data point or a checkpoint by all states composition, the organization of the data can be restored the status of each task. When using a batch job processing save point (or checkpoints) of data, our minds need to be mapped data for each task state to the data set or table. Because in fact, we can save the point as a database. Each operator (UID its logo) represents a namespace. Each operator state operators are irradiated in a separate namespace private table, the data storage status of all tasks of the column. All keyed state operator is mapped to a multi-key list table comprises a key and a key value for each one mapped State composition. The figure below shows how the MyApp save points are mapped to the database.

The figure shows how the "Src" operator state values mapped to a table and having five lines, one line for all data representing Src parallel tasks in a parallel instance. Similarly, "Proc" the operator state os2, is also mapped to a single table. For keyed state, ks1 and ks2 are combined into a single table with three columns, one represents the primary key, a representative of KS1, a representative of ks2. The table for each different key are keyed state to maintain two row. Since "Snk" no state, its mapping table is empty.
API provides a state of the processor to create, load and write methods to save points. The user can read the data stored from the loaded set point, the data set may be converted to the state and add it to the savepoint. In short, you can use the full feature set DataSet API to process these data sets. Using these methods, can be used to solve all the aforementioned cases (use cases and more).

 

Why use DataSet API?

If you are familiar Flink's future plans, it might be the state of the processor based on DataSet API API surprised, because the current Flink community plans to use the concept of the extended DataStream API BoundedStreams and abandoned DataSet API . However, in the design of this state processor feature, we also evaluated the DataStream API and Table API, they can not provide the appropriate support functions. Not wanting to develop this feature is therefore hampered, we decided to build this functionality in the DataSet API, and minimize dependence on the DataSet API. Based on this, it will migrate to another API should be fairly easy.

to sum up

Flink has been a long time user to access and modify the flow from the external state of the application requirements, by means of state of the processor API, Flink for users to maintain and manage workflow applications opens up many new possibilities, including the evolution of any streaming applications as well as export and guide the application state. In short, save points state of the processor API is no longer a black box.

Published 171 original articles · won praise 3 · views 20000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104806222
Recommended