How to look up and update the state of a record from a database in Apache Flink?

jbx :

I am working on a data streaming application and I am investigating the possibility of using Apache Flink for this project. The main reason for this is that it supports nice high-level streaming constructs, very similar to Java 8's Stream API.

I will be receiving events which correspond to a specific record in a database, and I want to be able to process these events (coming from a message broker such as RabbitMQ or Kafka) and eventually update the records in the database and push the processed/transformed events to another sink (probably another message broker).

Events related to a specific record will ideally need to be processed in FIFO ordering (although there will be a timestamp which helps detect out of order events too), but events related to different records can be processed in parallel. I was planning to use the keyBy() construct to partition the stream by record.

The processing that needs to be done depends on the current information in the database about the record. However, I am unable to find an example or recommended approach to query a database for such records to enrich the event that it is being processed with the additional information I need to process it.

The pipeline I have in mind is as follows:

-> keyBy() on the id received -> retrieve the record from the database corresponding to the id -> perform processing steps on the record -> push the processed event to an external queue and update the database record

The database record will need to be updated because another application will be querying the data.

There might be additional optimisations one could do after this pipeline is achieved. For example one could cache the (updated) record in a managed state so that the next event on the same record will not need another database query. However, if the application does not know about a specific record it will need to retrieve it from the database.

What is the best approach to use for this kind of scenario in Apache Flink?

Yassine Marzougui :

You can perform database look up by extending a rich function for e.g. a RichFlatMap function, initialize the database connection once in its open() method and then process each event in the flatMap() method:

public static class DatabaseMapper extends RichFlatMapFunction<Event, EncrichedEvent> {

    // Declare DB coonection and query statements

    @Override
    public void open(Configuration parameters) throws Exception {
      // Initialize Database connection
      // Prepare Query statements
    }

    @Override
    public void flatMap(Event currentEvent, Collector<EncrichedEvent> out) throws Exception {
      // look up the Database, update record, enrich event
      out.collect(enrichedEvent);        
    }
})

And then you can use DatabaseMapper as follows:

stream.keyby(id)
      .flatmap(new DatabaseMapper())
      .addSink(..);

You can find here an example using cached data from Redis.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=452029&siteId=1