Real-time data system design: Kafka, Flink and Druid

Click "JavaEdge" below and select "Set as Star"

Pay attention to technical information as soon as possible!

Disclaimer~

Don’t overthink any article!

Everything cannot withstand scrutiny, because the world does not have the same growth environment, nor the same level of cognition, and“There is no solution that applies to everyone”< /span>;

Don’t rush to judge the views listed in the article, just put yourself into it and take a moderate look at yourself. You can “jump out and look at the current situation from an outsider’s perspective. What stage are you in before you are a commoner?.

What you think and do is all up to you"Find the path that suits you through constant practice"

0 Preface

For data teams using batch workflows, meeting today's real-time demands isn't easy. why? Because batch workflows, from data delivery and processing to analysis, involve a lot of waiting.

There is waiting for the data to be sent to the ETL tool, waiting for the data to be batch processed, waiting for the data to be loaded into the data warehouse, and even waiting for the query to finish running.

However, there is a solution to this problem from the open source world. When used together, Apache Kafka, Flink, and Druid create a real-time data architecture that eliminates all these wait states. In this blog post, we’ll explore how a combination of these tools enables a variety of real-time data applications.

9825e74927a7019bf1c98b6305e2d9ef.png

Schematic data flow from source to application for Kafka-Flink-Druid.

1 Architecture for building real-time data applications

First, what are real-time data applications? Just consider any UI or API-driven application that uses fresh data to provide real-time insights or decisions. This includes alerts, monitoring, dashboards, analytics, personalized recommendations, and more.

To provide these workflows, specialized tools that can handle the entire pipeline from events to applications are required. This is where the Kafka-Flink-Druid (KFD) architecture comes in.

7ef93e05f50af6d31708a29d22559061.png

Open source real-time data architecture

Large companies with real-time needs like Lyft, Pinterest, Reddit, and Paytm use all three because they are all built with complementary streaming-native technologies that can seamlessly deliver the freshness of data required for real-time use cases degree, scale and reliability.

This architecture makes it simple to build high-throughput and QPS real-time data applications such as observability, IoT/telemetry analytics, security detection/diagnostics, and customer-facing insights.

Let's look at each tool in more detail and how they work together.

2 Pipeline: Apache Kafka

Over the past few years, Apache Kafka has become the de facto standard for streaming data. Before it, RabbitMQ, ActiveMQ, and other message queuing systems were used to provide various messaging modes to distribute data from producers to consumers, but there were scale limitations.

Fast forward to today, Kafka has become ubiquitous, with over 80% of Fortune 100 companies using it¹. This is because Kafka's architecture goes far beyond simple messaging. The versatility of its architecture makes Kafka ideally suited for stream processing at massive “Internet” scale, with fault tolerance and data consistency to support mission-critical applications, while its variety of connectors through Kafka Connect connects to any data Source integration.

9c4ec0f0fa415192ed5bf703b1089ccf.png

3 Stream processing: Apache Flink

As Kafka delivers real-time data, appropriate consumers are needed to take advantage of its speed and scale. One of the popular choices is Apache Flink.

Why choose Flink? First, Flink is very powerful at processing large-scale continuous data streams, with a unified batch and stream processing engine. As Kafka's stream processor, Flink is a natural choice due to its ability to seamlessly integrate and support just-once semantics, ensuring that each event is processed only once, even in the event of system failure.

Using it is very simple: connect to a Kafka topic, define query logic, and continuously emit results, i.e. "set and forget". This makes Flink very flexible in use cases where streams need to be processed immediately and ensure reliability.

Here are some common use cases for Flink:

  • enrich and transform

  • Continuous monitoring and alerting

enrich and transform

If the stream requires any data manipulation before consumption (such as modifying, enhancing, or reconstructing the data), then Flink is the ideal engine to make these changes or enhancements because it can keep the data fresh through continuous processing.

For example, let's say we have an IoT/telemetry use case that deals with temperature sensors in smart buildings. Each event passed into Kafka has the following JSON structure:

{
  "sensor_id": "SensorA",
  "temperature": 22.5,
  "timestamp": "2023–07–10T10:00:00"
}

If each sensor ID needs to be mapped to a location, and the temperature needs to be expressed in degrees Fahrenheit, Flink can update the JSON structure to:

{
  "sensor_id": "SensorA",
  "location": "Room 101",
  "temperature_Fahreinheit": 73.4,
  "timestamp": "2023–07–10T10:00:00"
}

Send it directly to the application or send it back to Kafka.

![img](https://miro.medium.com/v2/resize:fit:700/0*GZfCTvfy

hhQOxZqb.png)

Example of event-based data enrichment using Flink (image courtesy of simply.io)

Here, one of Flink’s strengths is the scale to handle huge Kafka streams — reaching millions of events per second — in real time. Furthermore, enrichment/transformation is typically a stateless process, and each data record can be modified without the need to maintain persistent state, making it minimally cost-effective and performance-efficient.

Continuous monitoring and alerting

Flink’s combination of real-time continuous processing and fault tolerance also makes it an ideal solution for real-time detection and response for a variety of critical applications.

When the sensitivity to detection is very high (think sub-second) and the sampling rate is also high, Flink's continuous processing is ideally suited to be used as a data service layer that monitors conditions and triggers corresponding alerts and actions.

One advantage of Flink when it comes to alerts is that it supports both stateless and stateful alerts. Thresholds or event triggers, like "notify the fire department when the temperature reaches X", are straightforward but not always smart enough. Therefore, Flink can monitor and update state to identify deviations and anomalies in use cases that require monitoring and updating state to identify deviations and anomalies through a continuous stream of data.

One thing to consider is that monitoring and alerting with Flink involves continuous CPU—and therefore continuous cost and resources—for evaluating conditions against thresholds and patterns, unlike databases that only use the CPU during query execution. Therefore, it's a good idea to know whether continuity is required.

4 Real-time analysis: Apache Druid

Apache Druid is the final piece of the data architecture puzzle, becoming a stream consumer together with Kafka and Flink to support real-time analysis. Although it is a database for analytics, its design center and purpose are different from other databases and data warehouses.

First of all, Druid is like the brother of Kafka and Flink. It is also stream native. In fact, it connects directly to the Kafka topic without connecting to the Kafka connector, supporting only-once semantics. Druid is also designed for ingesting streaming data quickly at scale and querying events in memory immediately as they arrive.

95a5522dfa824b12823d65416a871159.png

Druid's ingestion process is designed natively for each event ingestion.

On the query side, Druid is a high-performance, real-time analytics database that can deliver sub-second queries at scale and load. If the use case is performance sensitive and needs to handle terabytes to petabytes of data (e.g. aggregation, filtering, GroupBy, complex joins, etc.) and high query volume, then Druid is an ideal database as it always provides lightning-fast queries, And it can easily scale from a single laptop to a cluster of thousands of nodes.

This is why Druid is called a real-time analytics database: it's ideal when real-time data feeds real-time queries.

Here’s how Druid complements Flink:

  • Highly interactive queries

  • Real-time and historical data

Highly interactive queries

Engineering teams use Druid to power analytics applications. These are data-intensive applications that include both internal (i.e., operational) and external (i.e., customer-facing) use cases, covering areas such as observation, security, product analytics, IoT/telemetry, manufacturing operations, and more. Applications powered by Druid typically have the following characteristics:

  • **Performance at Scale:** No precomputation is required when analytical queries are required on large data sets. Druid delivers extremely high performance even when the application's users arbitrarily group, filter, and slice/dice terabyte-petabyte-sized amounts of data.

  • **High Query Volume:** Requires high QPS for analytical queries. An example here would be any externally facing application - i.e. a data product - that needs to provide a sub-second SLA for a workload that generates 100 to 1000 (distinct) concurrent queries.

  • **Time Series Data:**Applications that need to provide insights on data with a time dimension (a strength of Druid, but not a limitation). Druid can process time series data very quickly due to its time partitioning and data format. This makes time-based filters very fast. WHERE

These applications either have a very interactive data visualization/synthetic result set UI, with the flexibility to change queries at runtime (because Druid is so fast), or in many cases they are leveraging Druid's API to achieve Deliver queries at sub-second speeds in decision-making workflows at scale.

The following are examples of analytics applications powered by Apache Druid:

f7ece258f54b0ef5bc0d3ebcd99bd8c0.gif

Confluent Health+ is powered by Apache Druid.

Confluent, the original creator of Apache Kafka, provides analytics services to its customers through Confluent Health+. The application above is very interactive and has rich insights about the customer's Confluent environment. Behind the scenes, events are flowing into Kafka and Druid at a rate of 5 million events per second, while the application is serving 350 queries per second.

Real-time and historical data

While the above example shows Druid supporting a very interactive analytics application, you may be wondering "How does streaming data relate to this?"This is a good question because Druid is not limited to streaming data. It's great for ingesting large batches of files.

However, what makes Druid relevant in real-time data architecture is that it can provide interactive data experiences based on real-time data versus historical data for richer context.

While Flink is good at answering "what's happening now" (i.e. emitting the current state of the Flink job), Druid can technically answer "what's happening now, how does this compare to before, and what factors/conditions affected the outcome". The combination of these questions is powerful and can, for example, eliminate false positives, help detect new trends, and lead to deeper, real-time decisions.

Answering “how does it compare to before” requires historical context—a day, week, year, or other time frame—to make correlations. And "which factors/conditions affected the results" need to be mined through the entire data set. Because Druid is a real-time analytics database, it ingests streams to provide real-time insights, but it also persists the data so historical data and all other dimensions can be queried for ad-hoc exploration.

9b5eac30c040e4db23f6c2434dac2b4b.gif

Apache Druid extends real-time ingestion, mapping topics to ingestion tasks.

For example, let's say we're building an application that monitors secure logins for suspicious behavior. We may want to set a threshold within a 5 minute window: that is when the status of the login attempt is updated and issued. This is easy with Flink. However, with Druid, current login attempts can also be correlated with historical data to identify similar login spikes in the past that had no security issues. So the historical context here helps determine whether the current spike indicates a problem or is just normal behavior.

So when an application needs to provide a lot of analysis on changing events - such as current state, various aggregations, groupings, time windows, complex joins, etc. - but also provide historical context and explore that data set through a highly flexible API At this time, Druid is its strongest area.

5 Flink and Druid Checklist

Both Flink and Druid are built for streaming data. While they share some similarities at a high level - both are in-memory, both can be scaled, both can be parallelized - their architectures are actually built for completely different use cases, like the ones we have above. As seen.

Here is a simple decision list based on workload:

  1. Do you need to transform or join data in real time on streaming data? Check out Flink, because this is what it does best, it is designed for real-time data processing.

  2. Need to support many different queries simultaneously? Check out Druid as it supports high QPS analysis without the need to manage queries/jobs.

  3. Does the metric need to be continuously updated or aggregated? Check out Flink as it supports stateful complex event processing.

  4. Is the analysis more complex and does it require historical data for comparison? Check out Druid as it makes querying real-time data with historical data easy and fast.

  5. Is support being provided for user interface applications or data visualizations? Look at Flink for enrichment, and then send the data to Druid as the data service layer.

In most cases, the answer is not Druid or Flink, but Druid and Flink. The technical features they each offer make them collectively well-suited to support a variety of real-time data applications.

6 Conclusion

Businesses increasingly need real-time data from their data teams. This means data workflows need to be rethought from start to finish. This is why many companies consider Kafka-Flink-Druid as the de facto open source data architecture for building real-time data applications. They are the perfect three musketeers.

To try out the Kafka-Flink-Druid architecture, you can download these open source projects here — Kafka, Flink, Druid — or just get a free trial of Confluent Cloud and Imply Polaris, Kafka-Flink (Confluent) and Druid (Imply) respectively cloud services.

reference:

  • Programming Selection Network

write at the end

Programming Select Network (www.javaedge.cn), a lifelong learning website for programmers, is now online!

Click to read the original text and visit the website!

Welcoming长按图片加好友, Our first time together and sharing软件行业趋势, 面试资源, 学习途径etc.

c401a9a825bedaad4df690cc35246c47.jpegAdd friends' notes [Technical Group Communication] to bring you into the group, where you can find more tutorial resources.

After following the official account, send a private message in the background:

  • Reply[Architect], get the architect learning resource tutorial

  • Reply to [Interview] to obtain the latest and most complete interview materials for major Internet companies.

  • Reply to [Resume] to get a variety of resume templates with beautiful styles and rich content.

  • return route line, Java P7Technology management complete list of best learning routes

  • ReplyBig Data, get Java transformation The most comprehensive mind map of the entire Internet for big data research and development

  • WeChat [ssshflz] private message [Side Business], join the side business communication group

  • Click[Read the original text] to access One-stop learning website for programmers

Guess you like

Origin blog.csdn.net/qq_33589510/article/details/134980217