In-depth explanation of Druid system

Introduction

Druid is an open source, distributed, column storage storage system suitable for real-time data analysis, capable of fast aggregation, flexible filtering, millisecond-level query, and low-latency data import.

  • Druid was designed with high availability in mind. The failure of various nodes will not cause Druid to stop working (but the status will not be updated);
  • The coupling between the various components in Druid is low. If real-time data is not needed, the real-time node can be completely ignored;
  • Druid uses Bitmap indexing to accelerate the query speed of column storage, and uses the CONCISE algorithm to compress bitmap indexing, making the generated segments much smaller than the original text file;

Architecture

Overall structure

A Druid cluster contains different types of nodes, and each node is designed to do a certain set of things. Such a design can isolate concerns and simplify the overall system complexity.

Different nodes operate almost independently and have minimal interaction with other nodes, so communication failures within the cluster have very little impact on data availability.

The composition and data flow of the Druid cluster are shown in Figure 1:

Druid itself contains five types of nodes: Realtime, Historical, Coordinator, Broker, Indexer

  • The Historical node is a workspace for storing and querying "historical" data (non-real-time). It loads data segments (Data/Segments) from the deep storage area (Deep Storage), responds to the query request of the Broker node and returns the results. .
    The historical node usually synchronizes some data segments in the deep storage area locally, so even if the deep storage area is inaccessible, the historical node can still query the synchronized data segments.
  • Realtime The real-time node is a workspace that stores and queries real-time data. It will also respond to the query request of the Broker node and return the results.
    Real-time nodes will regularly build data into data segments and move them to historical nodes.
  • Coordinator The coordination node can be considered as the master in Druid. It manages historical nodes and real-time nodes through Zookeeper , and manages data segments through metadata in Mysql.
  • The Broker node is responsible for responding to external query requests, forwarding the requests to historical nodes and real-time nodes respectively by querying Zookeeper , and finally merging and returning the query results to the outside . The Broker node determines which historical nodes and real-time nodes provide services through Zookeeper.
  • Indexer index node is responsible for data import, loading batch and real-time data into the system, and can modify the data stored in the system.

Druid contains 3 external dependencies: Mysql, Deep storage, Zookeeper

  • Mysql:
    stores metadata about Druid instead of storing actual data. It contains 3 tables:
    "druid_config" (usually empty), "druid_rules" (some rule information used by collaborative nodes, such as which segment is loaded from which node) and "druid_segments" (stores the metadata information of each segment);
  • Deep storage: Storage segments, Druid currently supports local disks, NFS mounted disks, HDFS, S3, etc.
    Deep Storage data comes from two sources, one is batch data ingestion, and the other is from real-time nodes;
  • ZooKeeper: used by Druid to manage the status of the current cluster, such as recording which segments have been moved from real-time nodes to historical nodes;

real time node

Real-time nodes encapsulate the functions of importing and querying event data. Event data imported through these nodes can be queried immediately . The real-time node only cares about the event data within a short period of time, and regularly imports the batch of data collected during this period into the deep storage area. Live nodes announce their online status and the data they provide via Zookeeper.

As shown in Figure 2, the real-time node caches event data into an index in memory, and then persists it to disk regularly . Persistent indexes are periodically merged together before being transferred. The query hits both in-memory and persisted indexes. All real-time nodes will periodically start background scheduled tasks to search for local persistent indexes. The background scheduled tasks merge these persistent indexes together and generate a piece of immutable data. These data blocks contain all data within a period of time. The event data that has been imported by the real-time node is called "Segment" . During the transfer phase, the real-time node uploads these segments to a permanent and persistent backup storage, usually a distributed file system, such as S3 or HDFS, called "Deep Storage".

historical node

The architecture that history nodes follow shared-nothingso there is no single point of problem between nodes. The nodes are independent of each other and the services provided are simple. They only need to know how to load, delete and process Segments. Similar to real-time nodes, historical nodes advertise their online status and what data they serve in Zookeeper. Instructions for loading and deleting segments will be issued through Zookeeper. The instructions will include information about where the segments are stored in deep storage and how to decompress and process these segments.

As shown in Figure 3, before the historical node downloads a segment from the deep storage area, it will first check the local cache information to see if the segment already exists in the node. If the segment does not exist in the cache, the historical node will download it from the deep storage area. segment to local. After this stage of processing is completed, the segment will be notified in Zookeeper. At this point, the segment can be queried. The segment needs to be loaded into memory before querying.

Coordinating node

The coordination node is mainly responsible for segment management and distribution on historical nodes. The coordinating node tells the historical nodes to load new data, unload expired data, copy data, and move data for load balancing. In order to maintain a stable view, Druid uses a multi-version concurrency control exchange protocol to manage immutable segments. If any immutable segment contains data that has been completely obsolete by new segments, the expired segment will be unloaded from the cluster. The coordination node will go through a leader election process to determine an independent node to perform the coordination function, and the remaining coordination nodes will serve as redundant backup nodes.

Broker node

Broker nodes are query routes for historical nodes and real-time nodes . The Broker node knows the segment information published in Zookeeper, and the Broker node can route the incoming query request to the correct historical node or real-time node. The Broker node will also merge the local results of the historical node and the real-time node, and then return The final merged result is given to the caller. The Broker node contains a cache that supports LRU expiration policy .

As shown in Figure 4, every time the Broker node receives a query request, it will first map the query to a set of segments. The results for a certain set of segments may already exist in the cache and do not need to be recalculated. For those results that do not exist in the cache, the Broker node will forward the query to the correct historical node and real-time node. Once the historical node returns the results, the Broker node will cache these results for future use. This process is shown in Figure 6 shown. Real-time data is never cached, so query requests for data from real-time nodes are always forwarded to the real-time node. Real-time data is constantly changing, so caching real-time data is unreliable.

Indexer node

Indexing service is a high-availability, distributed service related to running indexing tasks. The indexing service creates (and sometimes destroys) Druid segments. The indexing service has a master/slave-like architecture.

The indexing service is composed of three main parts: the peon component that can run a single task, the middle management component for managing peons, and the overlord component that distributes management tasks to the middle management component. The overlord component and the middle management component can run on the same node or across multiple nodes, while the middle management component and the peon component always run on the same node.

ZooKeeper

Druid uses ZooKeeper (ZK) to manage the current cluster status. The operations that occur on ZK are:

1. Coordinate the leader election of nodes

2. Historical and real-time nodes publish segment protocols

3. Segment Load/Drop protocol between coordination nodes and historical nodes

4.Overlord leader election

5. Index service task management


Druid vs other systems

Druid vs Impala/Shark

The comparison between Druid, Impala and Shark can basically be boiled down to what kind of system needs to be designed.

Druid is designed for:

  • Always-on service
  • Get real-time data
  • Handle slice-n-dice style on-the-fly queries

Query speeds vary:

  • Druid is a column storage method. Data is compressed and added to the index structure. Compression increases the data storage capacity in RAM and enables RAM to adapt to the rapid access of more data.
    The index structure means that when adding filters to a query, Druid has to do less processing and the query will be faster.
  • Impala/Shark can be thought of as a background program caching layer on top of HDFS.
    But they didn't go beyond caching to really improve query speed.

Data acquisition is different:

  • Druid can get real-time data.
  • Impala/Shark is based on HDFS or other backing storage, which limits the speed of data acquisition.

The query takes different forms:

  • Druid supports time series and groupby style queries, but not joins.
  • Impala/Shark supports SQL-style queries.

Druid vs Elasticsearch

Elasticsearch(ES) is a search server based on Apache Lucene. It provides full-text search mode and provides access to raw event-level data. Elasticsearch also provides analysis and aggregation support. According to research, ES uses more resources for data acquisition and aggregation than Druid.

Druid focuses on OLAP workflows. Druid is optimized for high performance (fast aggregation and fetching) at low cost and supports a wide range of analysis operations. Druid provides some basic search support for structured event data.

Druid vs Spark

Spark is a cluster computing framework built around the concept of elastic distributed data sets (RDD), which can be regarded as a background analysis platform. RDD enables data reuse and keeps intermediate results in memory, providing Spark with an iterative algorithm for fast calculations. This is especially beneficial for certain workflows, such as machine learning, where the same operations can be applied over and over again until the results converge. Spark provides analysts with the ability to run queries with a variety of different algorithms and analyze large amounts of data.

Druid focuses on data acquisition and providing services for querying data. If a web interface is established, users can view the data at will.


Suitable scenarios for Druid to be used as a database:

  • There are many data insertions and few updates.
  • Most of the query data is aggregation and report queries, but there is also search and scan data.
  • Query in 100ms to seconds
  • There is time in the data (Druid has optimized and specially designed time).
  • Each query selects some fields in a large table. (Join is not supported).
  • There are high-cardinality columns (eg URLs, user IDs) where counting and sorting can be done quickly.
  • Ingest data from Kafka, HDFS, file systems, Amazon S3.

Scenarios not suitable for Druid

  • Low latency updates to primary key records. Druid supports streaming data, but does not support streaming data updates. (Streaming data updates can be updated in batches in the background)
  • With an offline reporting system, you don’t pay much attention to data latency .
  • A large join is required (real-time tables are connected to each other) , and the query takes a long time to complete.


TopN query

TopN · ApacheDruid Chinese technical documentation www.apache-druid.cn/Querying/topn.html

The Apache Druid TopN query returns a sorted result set of values ​​in a given dimension based on certain criteria. Conceptually, they can be viewed as an approximate GroupByQuery with sorting in a single dimension . In this scenario, TopN queries are more efficient than GroupBy queries.  These types of queries take a topN query object and return an array of JSON objects, where each object represents the value requested by the topN query.

TopN is an approximate query because each data process will sort its top K results and only return those top K results to the Broker. The default value of K in Druid is max(1000, threshold) . In practice, this means that if you ask to query the first 1,000 data, the accuracy of the first 900 data will be 100%, and the ordering of the subsequent results will not be guaranteed. TopNs can be made more accurate by increasing the threshold.

Copyright statement: The original text is reproduced from an in-depth explanation of the Druid system - Zhihu

Guess you like

Origin blog.csdn.net/TangYuG/article/details/132756011