Flink State possible to replace the databases?

Stateful fault-tolerant computing, as well as to ensure data consistency, is one of the essential characteristics of real-time computing, the popular real-time calculation engine including Google Dataflow, Flink, Spark (Structure) Streaming, Kafka Streams were all built to provide State support. State makes the introduction of real-time applications can not rely on an external database to store metadata and intermediate data, in some cases can even store results data directly with State, which makes the industry can not help but think: State and Database What is the relationship? Is it possible to replace the database with the State do?

On this subject, Flink community is relatively early start to explore. Overall, Flink community efforts can be divided into two lines: one is the ability to query interface to access the State's job by job runs, that QueryableState; the second is through State offline dump file (Savepoint) to off-line query and modify State the ability that the upcoming introduction of Savepoint Processor API.

QueryableState

In the version released in Flink 1.2 2017, Flink characteristics QueryableState introduced to allow users to query the contents of a particular client work State of [1], which means that applications can Flink completely independent of external memory storage medium other than the State of the case under the ability to provide real-time access to the results of the calculation.

Database .jpg

Only provides real-time data access by Queryable State

However, QueryableState although more idealistic vision, but more dependent on the underlying architecture changes and functions are relatively limited, it has been in Beta version and can not be used in a production environment. To address this issue, some time ago Tencent engineer Yang proposed QueryableState improvement plan [2]. In the message list, the community will be used in place of QueryableState whether the database was discussed and the emergence of different points of view. Combine the personal opinions of the main advantages and disadvantages State as Database summarized as follows.

advantage:

  • Lower data latency. The results Flink applications in general need to be synchronized to an external database, such as the timing of the trigger output window calculations, and this is usually synchronized timing will bring some delay, leading to calculate in real time and real-time query is not embarrassed situation, and direct State can avoid this problem.
  • Stronger data consistency guarantees. Depending on the characteristics of the external storage, Flink Connector or custom SinkFunction consistency guarantees provided are also different. For example, do not support multi-line transactions HBase, Flink can only be guaranteed by the idempotency business logic Exactly-Once delivery. In contrast then the State has properly properly Exactly-Once delivery guarantee.
  • save resources. By reducing the need to synchronize the external data storage, we can save costs and network transmission sequence, of course, additional databases can also save costs.

Disadvantages:

  • Lack of SLA guarantee. Database technology is very mature, have a lot of accumulation in the availability, fault tolerance, and operation and maintenance, at this point in the State is also equivalent to primitive times. In addition, from the positioning point of view, Flink job has encountered an error or maintenance iterative version automatically restart to bring down time, and can not achieve high availability database on data access.
  • It could lead to instability of the job. Has not been considered in Ad-hoc Query may be required to scan and return to exaggerate the magnitude of the data, which will bring great load system, it is likely to affect the normal execution of the job. Even reasonable Query, if a large number of concurrent circumstances may also affect the efficiency of operations.
  • Data storage can not be too large. State run TaskManager mainly stored in the local memory and disk, State over the General Assembly resulting in insufficient TaskManager OOM or disk space. In addition State means that a large checkpoint, leading checkpoint may time out and long job recovery was significantly prolonged.
  • Supports only the most basic queries. State only the most simple data structure query, not like a relational database to provide the same computing power functions, etc., etc. does not support the predicate pushdown optimization techniques.
  • Can only be read, not modified. State run only when the job itself can be modified, if you really want to modify the State only through the following Savepoint Processor API to achieve.

Overall, the current shortcomings State instead of the database or far more than its advantages, but for some of the less demanding jobs data availability, the use of State as the database is entirely reasonable. Depending on the location, Flink State may be hard to see in a short time completely replace the possibility of the database, but the State to the database is no need to question the direction in data access features.

Savepoint Processor API

Savepoint Processor API is a new feature community recently proposed (see FLIP-42 [3]), is used off-line to the State Savepoint dump file for analysis, to build or modify an initial Savepoint directly from the data. Savepoint Processor API belong Flink State Evolution of State Management. If QueryableState is DSL, then, Flink State Evolution is DML, while DML Savepoint Processor API is the most important part.

Predecessor Savepoint Processor API that third parties Bravo project [4], the main idea and the ability to provide Savepoint DataSet conversion, the typical application is read into a DataSet Savepoint, be modified on the DataSet, and then write a new Savepoint. This is suitable for the following scenarios:

  • State job analysis to study its mode and rules
  • Troubleshooting or auditing
  • Construction of new applications for initial State
  • Modify Savepoint, such as:

    • The maximum degree of parallelism job change
    • Schema changes to be huge
    • State correction in question

Savepoint as the State of dump file, can be exposed by Savepoint Processor API functions to query and modify data, similar to an offline database, but the concept and the concept of State of the typical relational data still have a lot of different, FLIP-43 also carried out these differences analogy and summary.

First, a set of physical storage state Savepoint plurality of operator, the state is different operator independent, table between different namespace This is similar to the database. Savepoint we can get the corresponding database, a single operator corresponds Namespace.

Database Savepoint
Namespace Uid
Table State

But table, its corresponding concept in Savepoint in which varies by type of State. State has Operator State, Keyed State and State Broadcast three kinds, which belong to the Operator State State Broadcast and non-partitioned state, i.e., not drawn state key partition, and conversely as Keyed State belongs partitioned state. For non-partitioned state, he is a state table, each state that is the element of a row in a table; for Partitioned state, a state with all corresponding to an operator at a table. This table has the same like a row key HBase, then each particular state corresponding to a table in the column.

For example, suppose you have a gamer score and online time data streams, we need Keyed State to record the duration of the player's group scores and game, with Operator State player's total score and record the total length of time.

The input data stream over time as follows:

user_id user_name user_group score
1001 Paul A 5,000
1002 Charlotte A 3,600
1003 Kate C 2,000
1004 Robert B 3,900
user_id user_name user_group time
1001 Paul A 1,800
1002 Charlotte A 1,200
1003 Kate C 600
1004 Robert B 2,000

Used as Keyed State, and we are registered group_score MapState group_time two groups representing the group total score and the total length, and then according to the updated data stream user_group keyby two metrics to the accumulated value of State, the resulting table is as follows:

user_group group_score group_time
A 8,600 3,000
C 2,00 600
B 3,900 2,000

In contrast, if the use of Operator State to record the total length and total score (degree of parallelism is set to 1), and we registered total_score State total_time two, two tables are obtained:

total_score | 
------- | 
14,500 |

total_time
5,600

At this point correspondence between Savepoint and Database should be more clarity. For Savepoint is also different StateBackend State to determine the specifics of how persistence, which is obviously a corresponding database storage engine. In MySQL, we can command a single line of ALTER TABLE xxx ENGINE = InnoDB; to change the storage engine will automatically complete the cumbersome format conversion work behind MySQL. For Savepoint, because incompatible StateBackend their storage format, it is not yet easily switch StateBackend. To this end, the community recently created FLIP-41 [5] Savepoint to further improve the operability.

to sum up

State as Database is calculated in real time trend of development, it is not meant to replace the use of the database, but the experience of database interfaces allow the field to expand the State operation closer to the familiar database. For Flink, the State can be divided into external use to access and modify real-time access to online and offline, respectively by Queryable State and Savepoint Processor API supports two properties.

Guess you like

Origin www.cnblogs.com/yunqishequ/p/11906650.html