Apache Kafka - Stream Processing


insert image description here


overview

Kafka is widely recognized as a powerful message bus that can reliably deliver event streams and is an ideal data source for stream processing systems. A stream processing system usually refers to a computing system that processes real-time data streams, which can process and analyze data in real time, and perform corresponding responses and operations as required . Unlike traditional batch processing systems, streaming systems are able to process data as soon as it arrives, which makes them especially suitable for applications that require real-time response, such as real-time monitoring and alerting, real-time recommendations, real-time ad serving, etc.

Kafka's design makes it an ideal data source for stream processing systems because of its high throughput, low latency, and reliability, and its ability to easily scale to handle large amounts of data. Many Kafka-based stream processing systems, such as Apache Storm, Apache Spark Streaming, Apache Flink, and Apache Samza, have been successfully applied in various scenarios.

Kafka's stream processing class library provides a simple and powerful way to process real-time data streams, and it is provided as part of the Kafka client library. This allows developers to read, process and generate events directly within the application without relying on an external processing framework. Kafka's stream processing class library provides many useful functions, such as window processing, state storage, and stream processing topology construction, etc. , enabling developers to easily build powerful stream processing applications.

With the popularity of Kafka and the development of stream processing technology, stream processing systems have become an important field of data processing and are widely used in more and more application scenarios. Kafka's stream processing class library provides developers with a powerful tool to process real-time data streams and extract useful information from them. It is an ideal choice for building complex stream processing systems.


What is streaming

Streaming is a programming paradigm for processing one or more streams of events in real time. Event streams are abstract representations of unbounded datasets that are infinite and continuously growing, with new records being added over time.

Unlike batch processing, streaming enables real-time processing of event streams without waiting for all the data to be available. This makes streaming ideal for business scenarios that require real-time response, such as suspicious transaction alerts, network alerts, real-time price adjustments, and package tracking, among others.

Stream processing has the following characteristics:

  • Ordered : The data records in the event stream are ordered in the order in which they occurred. This means that streaming can process events in the order they occur, leading to correct results.

  • Immutable : Data records in an event stream are immutable, i.e. once a record is created, it cannot be modified. This makes streaming easier to implement because it does not need to worry about concurrent modification issues.

  • Replayable : Data records in an event stream can be processed repeatedly, making streaming fault-tolerant. If an error occurs during processing, the same data record can be reprocessed until the correct result is obtained.

  • Low latency : Streaming has low latency, that is, the time to process a stream of events is short, usually on the order of milliseconds or microseconds. This makes streaming very suitable for business scenarios that require real-time response.

  • High throughput : Stream processing has high throughput, that is, it can process a large number of data records. This makes streaming ideal for processing large-scale datasets.

  • Does not depend on specific frameworks or APIs: The definition of streams does not depend on any specific frameworks, APIs or features, as long as data is read from an unbounded data set and processed, stream processing can be performed. This enables greater flexibility and scalability for streaming.

Stream processing is a programming paradigm that can process unbounded data sets in real time. It has the characteristics of order, immutability, replayability, low latency, high throughput, and flexibility. It is very suitable for business scenarios that require real-time response.


Some concepts of stream processing

time

Time is perhaps the most important concept in streaming, and the most confusing. How to understand the complex concept of time when discussing distributed systems? In stream processing, time is a very important concept, because most stream application operations are based on time windows. thing

  • Event Time: The time when the event actually occurred. This is the most important concept of time. Most streaming applications perform window operations and aggregations based on event time.
  • Log append time (Log Append Time): The time when the event is written to Kafka. This time is mainly used internally by Kafka, and has little to do with streaming applications.
  • Processing Time: The time when the application receives the event and starts processing it. This time is unreliable and may produce different values, so it is rarely used by streaming applications.

Readers are recommended to read Justin Sheehy's paper "There is No Now" to gain a deeper understanding of these time concepts, especially their complexities in the context of distributed systems.

In a streaming system, if the producer has a network problem and is offline for a few hours, and then a large amount of data floods in, it will bring great difficulties to the system. Because the event time of most of the data has exceeded the window range we set, normal aggregation calculations cannot be performed.

To solve this problem, the streaming system provides several mechanisms:

  1. Drop data outside the window: simple but leads to data loss
  2. Adjust the window: expand the window to contain more data, but the larger window range will affect the calculation accuracy
  3. Resend data: The producer will resend the data during the offline period, and the system will perform supplementary calculations to produce correct results
  4. Watermark: It is allowed to specify the maximum time for data to be late, and the system will wait for the data within the watermark time to arrive and then start calculation and output the result. The watermark mechanism can effectively solve the problem of late data arrival while ensuring the accuracy of the results.

Therefore, these time concepts need to be considered when designing streaming applications, especially the late arrival and offline data, and choose an appropriate mechanism to deal with it to ensure the accuracy of the system.


state

  1. It's simple to deal with a single event, but when multiple events are involved, more information needs to be tracked, which is called "state".
  2. State is typically stored in application-local variables, such as hash tables. However, there is a risk of losing the local state. After restarting, the state changes, and the latest state needs to be persisted and restored.
  3. Local state or internal state: can only be accessed by a single application instance, maintained using an embedded database, fast but limited by memory size. Many designs split data into substreams that use local state processing.
  4. External state: maintained using an external data store, such as the NoSQL system Cassandra. Unlimited size, accessible by multiple application instances, but with added latency and complexity. Most stream processing applications avoid external storage, or cache locally to reduce interaction to reduce latency and introduce internal and external state consistency issues

The duality of streams and tables

  1. A table is a collection of records, with attributes defined by the primary key and schema. The records are variable and can be queried to obtain the status at a certain time. For example, the CUSTOMERS_CONTACTS table obtains all customer contact information. But the table has no historical information.
  2. A stream is a sequence of events, and each event is a change. The table is the current state of the multimutation result . Tables and streams are two sides of the same coin: the world changes, focusing on changing events or the current state. A system that supports both approaches is more robust.
  3. Converting a table into a stream requires capturing table change events (insert, update, delete), such as CDC solutions sending changes to Kafka stream processing.
  4. Converting a stream to a table requires applying all changes to the stream to change the state, creating a table in memory, internal state storage, or an external database, traversing all events of the stream to change the state one by one, and obtaining a state table at a certain point in time.

Suppose there is a shoe store, a retail activity can be represented by an event flow:

"Red, blue and green shoes arrived"
"Blue shoes sold"
"Red shoes sold"
"Blue shoes returned"
"Green shoes sold"
If you want to know what inventory is in the warehouse now, or so far How much money has been made so far, the view needs to be materialized.

Turnover table needs

Apply all change events in the stream to change the state and build the table, and the table-to-flow needs to capture the change events on the table and send them to the stream for subsequent stream processing. The table represents the state at a certain moment, and the flow represents the change. The two are mutually transformed. The system that supports the two methods is more powerful.


time window

There are mainly the following types of time window operations for streams:

  1. Window size: 5 minutes, 15 minutes, 1 day, etc. The size affects the change detection speed and smoothness. The smaller the window, the faster the change detection but the more noise; the larger the window, the smoother the change but the greater the delay.
  2. How often the window moves ("moving interval"): The 5-minute average changes every minute or every second or every new event. The moving interval equal to the window size is called "rolling window", and the moving with each record is called "sliding window".
  3. Window updateable time: calculate the average value of 00:00-00:05, 00:02 event after 1 hour, whether to update the window result of 00:00-00:05? You can define the event within the time period to add the corresponding time segment, such as 4 hours updated, otherwise ignored.
  4. The window is aligned with time or not: the 5-minute window moves every minute, the first slice 00:00-00:05, the second 00:01-00:06; or the application starts at any time, the first slice 03: 17-03:22. The sliding window moves with new records and never aligns with time.

The window size affects the sensitivity and smoothness of the operation results, the moving interval determines the result update frequency, and the updateable time determines whether late events participate in the operation. Windows can be time-aligned or not.

Sliding windows move with each new event, and rolling windows move at predetermined intervals, but neither moves by more than the window size. When the moving interval of the rolling window is equal to the window size, the adjacent windows do not overlap; when the moving interval of the sliding window is smaller than the window size, the adjacent windows overlap.

insert image description here
[The difference between scrolling window and jumping window]


Design Patterns for Streaming

single event handling

Handling a single event is the most basic pattern of streaming. This mode is also called map or filter mode, because it is often used to filter useless events or to transform events

The term map comes from the Map-Reduce pattern, where the map phase transforms events and the reduce phase aggregates the transformed events).

Read stream events, modify and write to other streams. Such as reading log streams, writing ERROR level messages to high-priority streams, and writing other low-priority streams; or converting JSON to Avro format. No need to maintain state, easy to recover from errors or load balancing.

[Single Event Processing Topology]
insert image description here
This pattern can be implemented using one producer and one consumer.


use local state

Most stream processing applications aggregate information such as daily high and low stock prices and moving averages. The state of the stream needs to be maintained, such as saving the minimum and maximum values ​​and comparing them with new values. It can be realized through the local state, and each operation is a group of aggregates, as shown in the figure below. Kafka partitions ensure the same partition for the same code events. Each application instance obtains allocation partition events and maintains a set of stock symbol states.
insert image description here


Multi-stage processing and repartitioning

The local state is suitable for aggregation within the group. Two stages are required to obtain full information results such as the top 10 stocks per day: the first stage is for each instance to calculate the rise and fall of each share, and write a new topic for a single partition; the second stage is for a single application instance to read a new topic to find the top 10 shares. The new theme only has stock summaries, the traffic is small, and a single instance is enough. More steps are like MapReduce with multiple reduce steps, and each step is applied in isolation. The stream processing framework can be applied in multiple steps, and the framework schedules which application instance to run for each step.

insert image description here

[Topology including local state and repartition steps]


Using external lookups - joins of streams and tables

insert image description here
[Stream processing using external data sources]

External lookups introduce significant latency

In order to achieve better performance and stronger scalability, it is necessary to cache the information of the database in the streaming application. However, managing this cache is also a challenge.

For example, how to ensure that the data in the cache is up to date? If you refresh too frequently, it will still put pressure on the database and the cache will be useless. If the refresh is not timely, the data used in streaming will become outdated.

If it is possible to capture database change events and form an event stream, the stream processing job can monitor the event stream and update the cache in time. The process of capturing database change events and forming an event stream is called CDC—Change Data Capture. If you use Connect, you will find that there are several connectors that can be used to perform CDC tasks, turning database tables into change event streams.

In this way, you have a private copy of the database table. Once the database changes, the user will receive a notification and update the data in the private copy according to the change event, as shown in the figure

insert image description here

[Topology of connecting streams and tables, no external data sources required]


Stream to stream connection

In Streams, the above two streams are partitioned by the same key, which is also the key used to join the two streams. This way, click events for user_id:42 are saved on partition 5 of the click topic, and all search events for user_id:42 are saved on partition 5 of the search topic. Streams can ensure that events from partition 5 of these two topics are assigned to the same task, and this task will get all events related to user_id:42.

Streams maintains the connection time window of the two topics in the embedded RocksDB, so it can perform the connection operation
insert image description here

out of order events

Key points for handling out-of-order and late-arriving events:

  1. Identify out-of-sequence events: Check the event time and compare it with the current time. If it exceeds the time window, it will be regarded as out-of-sequence or late.
  2. Rearrange out-of-order events within a specified time window: For example, reorder events within 3 hours, and discard events outside 3 weeks.
  3. The ability to rearrange out-of-order events within a time window: Unlike batch processing, stream processing has no concept of "rerunning yesterday's job" and must handle out-of-order and new events at the same time.
  4. The ability to update the result: if the result is in the database, use put or update to update; if the result is sent by email, the update method needs to be clever.
  5. Frameworks that support time-independent events: such as Dataflow and Streams maintain multiple aggregation time windows, update events, and the window size can be configured. The larger the window, the higher the local state memory requirements.
  6. Streams API aggregation results are written to a topic, usually a compressed log topic, where only the latest value is kept for each key. If the aggregation window result needs to be updated, write the new result directly to the window and overwrite the previous result.
    Handling out-of-order and late-arriving events requires:
  1. Identify events outside the time window, drop or handle them specially
  2. Define a reorder window for out-of-order events within the time window, within which out-of-order events are reordered
  3. Ability to reorder out-of-sequence events and update results within a defined time window
  4. Choose a streaming framework that supports time-independent events and local state management, such as Dataflow or Streams
  5. Overwrite the updated aggregation results directly, and use the compressed log topic to avoid infinite growth of the result topic

Out-of-order and late arrival of events is a common scenario for stream processing, but it is not suitable for batch recalculation. Defining multiple time windows to manage historical states, rearranging out-of-order events in time windows, and directly overwriting update results can effectively solve such problems.

The local state management, time window support, and compressed log topic writing provided by Streams allow it to efficiently handle out-of-order and late-arriving events. By configuring different time windows, developers can implement state management and event rearrangement at different granularities.

The challenge brought by event disorder and lateness lies in the management of historical state and the update of results. The emergence of streaming frameworks such as Streams allows developers to focus on the business logic of stream processing applications without paying too much attention to these underlying issues.
insert image description here

Reprocess

Two modes for reprocessing events:

  1. Improve the stream processing application, the new version of the application processes the same event stream, generates new results, compares the results of the two versions, and switches the client's new result stream at a point in time.
  2. The existing application is flawed, and after repairing, reprocess the event stream and recalculate the results.

The first mode implementation:

  1. The new version applies as a new consumer group
  2. Start reading events from the first offset of the input topic, and get a copy of your own input stream events
  3. Check the result flow, the new version application catches up with the progress, switch the client to apply the new result flow
    The second mode challenge:
  4. Reset applies to the input stream starting point for reprocessing, resets the local state, and avoids confusing the results of the two versions
  5. Might need to clean up the output stream before
  6. Although Streams provides tools for resetting the application state, it is safer to run two applications conditionally to generate two result streams, so that different versions of the results can be compared, without the risk of data loss or cleanup introducing errors

Reprocessing event patterns requires:

  1. Event streams are long-term in scalable data stores such as Kafka
  2. Run different versions of the application as different consumer groups, each processing the event stream and generating results
  3. The new version of the application reads events from scratch, builds its own copy of the input stream and results, and avoids affecting the current version
  4. Compare results from different versions, determine when to switch, and carefully switch clients to new result streams
  5. Optionally clean up existing results and state, use the reset tool carefully, or use parallel mode to avoid cleanup

Long-term retention of event streams provides the possibility to reprocess events and AB test different versions of the application. There is some risk in resetting a currently running application, which can be minimized by running multiple versions of the application in parallel.

Regardless of the model employed, reprocessing events requires careful planning and execution. The comparison of the result streams generated by different versions of the application can let us know clearly whether the new version achieves the expected improvement, which provides a basis for reprocessing events and releases.

Streams' consumer group management and tooling support make it performant in reprocessing events and AB testing scenarios. By adding different versions of applications to different consumer groups, each processing the event stream and generating independent results, and then carefully migrating the client, this is a safer and more reliable reprocessing event mode.

Long-term retention of event streams and reliable state management are the cornerstones of reprocessing events. AB testing of different versions of applications can also be achieved through this mechanism, which provides the possibility for continuous optimization and evolution of streaming applications.

insert image description here

Guess you like

Origin blog.csdn.net/yangshangwei/article/details/131008975