Detailed interpretation of Druid

Druid efficient architecture

We know that Druid can simultaneously provide real-time ingestion of large data sets and efficient complex query performance. The main reason is its unique architectural design and data storage structure based on Datasource and Segment. Next, we will take an in-depth look at Druid's architecture from the aspects of data storage and system node architecture.

data storage

Druid organizes data into a Read-Optimized structure, which is the key to Druid's ability to support interactive queries. Data in Druid is stored in what is called a datasource, similar to a table in RDMS. Each datasource is divided by time, and can be further divided by other attributes if you have needs. Each time range is called a chunk (for example, if you partition by day, one chunk is one day). In the chunk, the data is divided into one or more segments (segment is the actual storage structure of the data, Datasource and Chunk are just logical concepts). Each segment is a separate file, usually containing millions of rows of data. These segments It is organized into chunks according to time, so it is very efficient when querying data according to time.

Datasource

data partition

Any distributed storage/computing system requires reasonable partitioning of data to achieve a balance of storage and computing, as well as data parallelization. Druid itself processes event data, and each piece of data has a timestamp, so it is natural to use time for partitioning. For example, in the figure above, we have specified the partition granularity as daily, so each day's data will be stored and queried separately (see below for the reason why there are multiple Segments under one partition).
When using time partitioning, we can easily think of a problem, that is, the amount of data in each time period is likely to be unbalanced (think about our business scenario), and Duid provides "secondary partitioning" to solve this problem. , each secondary partition is called a Shard (this is the physical partition). Complete the shard partitioning by setting the target value that each shard can store and the shard policy. Druid currently supports two Shard strategies: Hash (Hash based on dimension value) and Range (based on the value range of a certain dimension). In the figure above, each partition of 2000-01-01 and 2000-01-03 is a Shard. Since the amount of data in 2000-01-02 is relatively large, there are two Shards.

Segment

After Shard is persisted, it is called Segment. Segment is the basic unit of data storage, replication, balancing (Historical load balancing) and calculation. Segment is immutable. Once a Segment is created (after the MiddleManager node is released), it cannot be modified. The only way to replace the old version of Segment is to generate a new Segment.

Segment internal storage structure

Next we can look at the internal storage structure of the Segment file. Because Druid uses columnar storage, each column of data is stored in an independent structure (not an independent file, but an independent data structure, because all columns will be stored in one file). The data types in Segment are mainly divided into three types: timestamp, dimension column and indicator column.


Segment data column

For timestamp columns and indicator columns, the actual storage is an array, and Druid uses LZ4 to compress the integers or floating point numbers of each column. When a query request is received, the required row data will be pulled out (unnecessary columns will not be pulled out) and decompressed. After decompression, the specific aggregation function is applied.
Dimension columns are not as simple as indicator columns and timestamps, because they need to support filter and group by, so Druid uses dictionary encoding (Dictionary Encoding) and bitmap index (Bitmap Index) to store each dimension column. Each dimension column requires three data structures:

  1. A dictionary data structure is needed to map dimension values ​​(dimension column values ​​are considered to be string types) into an integer ID.
  2. Using the dictionary encoding above, place all dimension values ​​for that column in a list.
  3. For distinct values ​​in a column, use a bitmap data structure to identify which rows contain those values.

Druid uses these three data structures for dimension columns because:

  1. Using a dictionary to map strings to integer IDs can compactly represent the values ​​in structures 2 and 3.
  2. Using Bitmap bitmap indexes allows you to perform fast filtering operations (find row numbers that meet the criteria to reduce the amount of data read) because Bitmap can quickly perform AND and OR operations.
  3. For group by and TopN operations, the column value list in structure 2 needs to be used.

Let's take the "Page" dimension column above as an example. You can see in detail how Druid uses these three data structures to store dimension columns:

1. 使用字典将列值映射为整数
{
"Justin Bieher":0,
"ke$ha":1
}
2. 使用1中的编码,将列值放到一个列表中
[0,0,1,1]
3. 使用bitmap来标识不同列值
value = 0: [1,1,0,0] //1代表该行含有该值,0标识不含有
value = 1: [0,0,1,1]

The following figure takes the advertiser column as an example and describes the actual storage structure of the advertiser column:


advertiser column value storage

The first two storage structures will grow linearly according to the amount of data in the worst case (each row in the column data is different), while the third one uses Bitmap storage (itself a sparse matrix), so it is Compression can achieve a very objective compression ratio. Druid also uses Roaring Bitmap ( http://roaringbitmap.org/ ) to directly perform Boolean operations on compressed bitmaps, which can greatly improve query efficiency and storage efficiency (no need to decompress).

Segment naming

Efficient data query is not only reflected in the storage structure of file content, but also is very important, that is, the naming of the file. Just imagine, if there are millions of Segment files under a Datasource, how can we quickly find the files we need? The answer is fast index lookup by file name.
The naming of Segment contains four parts: data source (Datasource), time interval (including start time and end time), version number and partition (only available when Segment has shards).

test-datasource_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T16:00:00.000Z_1
数据源名称_开始时间_结束时间_版本号_分区

The fragment number starts from 0. If the partition number is 0, it can be omitted: test-datasource_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T16:00:00.000Z is also
required Note that if a time interval segment consists of multiple shards, when querying the segment, you need to wait until all shards are loaded before you can query it (unless you use a linear shard spec, which allows Query when loading is not complete).

Field Is it necessary describe
datasource yes Datasource where segment is located
Starting time yes The earliest data stored in this Segment is in ISO 8601 time format. The start time and end time are the time intervals set by segmentGranularity
End Time yes The latest data stored in this segment, the time format is ISO 8601
version number yes Because Druid supports batch overwrite operations, when data from the same data source and the same time interval is ingested in batches, the data will be overwritten, and the version number will be updated at this time. After other parts of the Druid system sense this signal, they will delete the old data and use the new version of the data (this switch is very fast). The version number also uses the ISO 8601 timestamp, but this timestamp represents the time of first startup.
Partition number no Segment will only have this logo if it uses partitioning.

Segment physical storage instance

Let's take an example to see in what form Segment is stored. We import the following data into Druid using local import.

{
    
    "time": "2018-11-01T00:47:29.913Z","city": "beijing","sex": "man","gmv": 20000}
{
    
    "time": "2018-11-01T00:47:33.004Z","city": "beijing","sex": "woman","gmv": 50000}
{
    
    "time": "2018-11-01T00:50:33.004Z","city": "shanghai","sex": "man","gmv": 10000}

We run Druid in stand-alone mode, so the Segment files generated by Druid are in the ${DRUID_HOME}/var/druid/segments directory.

Segment directory

Segment is uniquely identified through datasource_beginTime_endTime_version_shard, which is represented in the form of a directory in actual storage.

Segment directory

You can see that the Segment contains the Segment description file (descriptor.json) and the compressed index data file (index.zip). We mainly look at the index.zip file and decompress it.


Segemnt data file

First, take a look at the factory.json file. This file is not a file that specifically stores segment data. Because Druid accesses Segment files by using MMap (a memory-mapped file method). By looking at the contents of this file, it seems that it is used for MMap to read files (I don’t know much about MMap)?

#factory.json文件内容
{
    
    "type":"mMapSegmentFactory"}

The actual Segment data files stored by Druid are: version.bin, meta.smoosh and xxxxx.smoosh. Let’s take a look at the contents of these three files respectively.
version.bin is a binary file that stores 4 bytes. It is the Segment internal version number (as Druid develops, the Segment format also develops). It is currently V9. When you open the file with Sublime, you can see:

0000 0009 

meta.smoosh stores metadata about other smoosh files (xxxxx.smoosh), which records the file corresponding to each column and the offset in the file. In addition to column information, the smoosh file also contains index.drd and metadata.drd, which are some additional metadata information about Segment.

#版本号,该文件所能存储的最大值(2G),smooth文件数
v1,2147483647,1
# 列名,文件名,起始偏移量,结束偏移量
__time,0,0,154
city,0,306,577
gmv,0,154,306
index.drd,0,841,956
metadata.drd,0,956,1175
sex,0,577,841

Before looking at the 00000.smoosh file, let's first think about why this file is named this way? Because in order to minimize the number of open file handles, Druid stores all column data of a Segment in a smoosh file, which is the file xxxxx.smoosh. However, because Druid uses MMap to read Segment files, and MMap needs to ensure that the size of each file cannot exceed 2G (MMapByteBuffer limit in Java), so when a smoosh file is larger than 2G, Druid will write the new data to the next smoosh. in the file. This is why these files are named like this, and this also corresponds to why the file name where the identification column is located in the meta file is needed.
It can also be seen from the offset of meta.smoosh that the data in the 00000.smoosh file is stored in columns, and the time column, indicator column, and dimension column are stored from top to bottom. Each column mainly contains two parts of information: ColumnDescriptor and binary data. columnDescriptor is an object serialized using Jackson, which contains some metadata information about the column, such as data type, whether it is multi-valued, etc. Binary is binary data that is compressed and stored according to different data types.

^@^@^@d{
     
     "valueType":"LONG","hasMultipleValues":false,"parts":[{
     
     "type":"long","byteOrder":"LITTLE_ENDIAN"}]}^B^@^@^@^C^@^@ ^@^A^A^@^@^@^@"^@^@^@^A^@^@^@^Z^@^@^@^@¢yL½Ìf^A^@^@<8c>X^H^@<80>¬^WÀÌf^A^@^@^@^@^@d{"valueType":"LONG","hasMultipleValues":false,"parts":[{"type":"long","byteOrder":"LITTLE_ENDIAN"}]}^B^@^@^@^C^@^@ ^@^A^A^@^@^@^@ ^@^@^@^A^@^@^@^X^@^@^@^@1 N^@^A^@"PÃ^H^@<80>^P'^@^@^@^@^@^@^@^@^@<9a>{
     
     "valueType":"STRING","hasMultipleValues":false,"parts":[{
     
     "type":"stringDictionary","bitmapSerdeFactory":{
     
     "type":"concise"},"byteOrder":"LITTLE_ENDIAN"}]}^B^@^@^@^@^A^A^@^@^@#^@^@^@^B^@^@^@^K^@^@^@^W^@^@^@^@beijing^@^@^@^@shanghai^B^A^@^@^@^C^@^A^@^@^A^A^@^@^@^@^P^@^@^@^A^@^@^@^H^@^@^@^@0^@^@^A^A^@^@^@^@^\^@^@^@^B^@^@^@^H^@^@^@^P^@^@^@^@<80>^@^@^C^@^@^@^@<80>^@^@^D^@^@^@<9a>{
     
     "valueType":"STRING","hasMultipleValues":false,"parts":[{
     
     "type":"stringDictionary","bitmapSerdeFactory":{
     
     "type":"concise"},"byteOrder":"LITTLE_ENDIAN"}]}^B^@^@^@^@^A^A^@^@^@^\^@^@^@^B^@^@^@^G^@^@^@^P^@^@^@^@man^@^@^@^@woman^B^A^@^@^@^C^@^A^@^@^A^A^@^@^@^@^P^@^@^@^A^@^@^@^H^@^@^@^@0^@^A^@^A^@^@^@^@^\^@^@^@^B^@^@^@^H^@^@^@^P^@^@^@^@<80>^@^@^E^@^@^@^@<80>^@^@^B^A^@^@^@^@&^@^@^@^C^@^@^@^G^@^@^@^O^@^@^@^V^@^@^@^@gmv^@^@^@^@city^@^@^@^@sex^A^A^@^@^@^[^@^@^@^B^@^@^@^H^@^@^@^O^@^@^@^@city^@^@^@^@sex^@^@^AfÌ<91>Ð^@^@^@^AfѸ,^@^@^@^@^R{
     
     "type":"concise"}{
     
     "container":{},"aggregators":[{
     
     "type":"longSum","name":"gmv","fieldName":"gmv","expression":null}],"timestampSpec":{
     
     "column":"time","format":"auto","missingValue":null},"queryGranularity":{
     
     "type":"none"},"rollup":true}

The binary data in the smooth file is compressed by LZ4 or Bitmap, so the original content of the data cannot be seen.

The smooth file also contains two parts of data at the end, namely index.drd and metadata.drd. The index.drd contains which measures, dimensions, time ranges, and which bitmaps are used in the Segment. Metadata.drd stores indicator aggregation functions, query granularity, timestamp configuration, etc. (the last part of the above content).
The figure below is a physical storage structure diagram. The storage of uncompressed and encoded data is the rightmost content.

Segment physical storage

Segment creation

Segments are created in the MiddleManager node, and the Segments in the MiddleManager are mutable and uncommitted (after being submitted to DeepStorage, the data cannot be changed).
Segment will go through the following steps from being created in MiddleManager to being propagated to Historical:

  1. Create Segment files in MiddleManager and publish them to Deep Storage.
  2. Segment-related metadata information is stored in MetaStore.
  3. After the Coordinator process learns the Segment-related metadata information from the MetaStore, it assigns it to the Historical node of the compound condition according to the rule settings.
  4. After receiving the Coordinator instruction, the Historical node automatically pulls the segment data file from DeepStorage and declares to the cluster through Zookeeper that it is responsible for providing query services related to the segment data.
  5. After MiddleManager learns that Historical is responsible for the Segment, it will discard the Segment file and declare to the cluster that it is no longer responsible for queries related to the Segment.

How to configure partitions

The segment time interval can be set through segmentGranularity in granularitySpec ( http://druid.io/docs/latest/ingestion/ingestion-spec.html#granularityspec ). In order to ensure Druid query efficiency, the size of each Segment file is recommended to be between 300MB and 700MB. If it exceeds this range, you can modify the time interval or use partitioning for optimization (configure targetPartitionSize in partitioningSpec, the official recommendation is to set more than 5 million rows; http://druid.io/docs/latest/ingestion/hadoop.html#partitioning- specification ).

Detailed explanation of system architecture

We know that there are five types of Druid nodes: Overload, MiddleManager, Coordinator, Historical and Broker.


Druid architecture

Overload and MiddleManager are mainly responsible for data ingestion (for unpublished segments, MiddleManager also provides query services); Coordinator and Historical are mainly responsible for querying historical data; Broker nodes are mainly responsible for receiving Client query requests and splitting subqueries to MiddleManager and Historical nodes. Then the merged query results are returned to the Client. Among them, Overload is the master node of MiddleManager, and Coordinator is the master node of Historical.

Indexing service

Druid provides a set of components that support Indexing Service, namely the Overload and MiddleManager nodes. The index service is a highly available distributed service used to run index-related tasks. The index service is the main way to create and destroy segments for data ingestion (there is also a way to use real-time nodes, but it is now abandoned. ). The index service supports ingesting external data in pull or push mode.
The index service adopts a master-slave architecture, with Overload as the master node and MiddleManager as the slave node. The index service architecture diagram is shown below:

Indexing service

The indexing service consists of three components: the Peon (labor) component for executing tasks, the MiddleManager component for managing Peons, and the Overload component for assigning tasks to the MiddleManager. The MiddleManager and Overload components can be deployed on the same node or across nodes, but Peon and MiddleManager are deployed on the same node.
The index service architecture is very similar to Yarn’s architecture:

  • The Overlaod node is equivalent to Yarn's ResourceManager and is responsible for cluster resource management and task allocation.
  • The MiddleManager node is equivalent to Yarn's NodeManager and is responsible for accepting tasks and managing the resources of this node.
  • Peon nodes are equivalent to Yarn's Containers and perform specific tasks on the nodes.

Overload node

As the master node of the indexing service, Overload is responsible for accepting indexing tasks externally and decomposing the tasks internally and issuing them to MiddleManager. Overload has two operating modes:

  • Local Mode: Default mode. Overload in local mode is not only responsible for task coordination, but also responsible for starting some peons to complete specific tasks.
  • Remote Mode: In this mode, Overload and MiddleManager run on different nodes. It is only responsible for task coordination and is not responsible for completing specific tasks.

Overload provides a UI client that can be used to view tasks, run tasks, terminate tasks, etc.

http://<OVERLORD_IP>:<port>/console.html

Overload provides a RESETful access form, so the client can submit tasks to the requesting node through HTTP POST.

http://<OVERLORD_IP>:<port>/druid/indexer/v1/task //提交任务
http://<OVERLORD_IP>:<port>/druid/indexer/v1/task/{task_id}/shutdown //杀死任务

MiddleManagernode

MiddleManager is a working node that executes tasks. MiddleManager will send tasks to each Peon running on a separate JVM (because resources and logs must be isolated). Each Peon can only run one task at a time.

Peon node

Peons run a single task in a single JVM, and the MiddleManager is responsible for creating Peons for tasks.

Coordinator node

Coordinator is Historical's mater node, which is mainly responsible for managing and distributing Segments. The specific work is to tell Historical to load or delete Segments, manage Segment copies, and balance load Segments on Historical.
Coordinator runs periodically, and the running interval can be configured through configuration parameters. Each time the Coordinator runs, it will obtain the current cluster status through Zookeeper and take appropriate actions (such as balancing load segments) by evaluating the cluster status. The Coordinator will connect to the database (MetaStore), which stores segment information and rules (Rule). The Segment table lists all Segments that need to be loaded into the cluster. Each time the Coordinator runs, it will pull the Segment list from the Segment table and compare it with the Segments of the current cluster. If a Segment is found that does not exist in the database, but is still in the cluster, If there is, it will be deleted from the cluster; the rule table defines how to process Segment. The function of the rule is that we can configure a set of rules to operate the cluster to load Segment or delete Segment. For information on how to configure rules, you can view: http://druid.io/docs/latest/operations/rule-configuration.html .

Before the Historical node loads Segments, it will be sorted by capacity. Which Historical node has the fewest Segments will have the highest loading right. The Coordinator does not directly communicate with the Historical node, but puts the Segment information into a queue. The Historical node goes to the queue to get the Segment description information, and loads the Segment to this node.
Coordinator provides a UI interface for displaying cluster information and rule configuration:

http://<COORDINATOR_IP>:<COORDINATOR_PORT>

Historical node

The Historical node is responsible for managing historical Segments. The Historical node monitors the specified path through Zookeeper to discover whether there is a new Segment that needs to be loaded (Coordinator specifies the specific Historical through the allocation algorithm).
As we know from the Coordinator above, when a new Segment needs to be loaded, the Coordinator will put it in a queue. When the Historical node receives a new Segment, it will check the local cache and disk to see if there is information about the Segment. If there is no Historical node, the Segment-related information will be pulled from Zookeeper and then downloaded.


HistoricalLoadSegment

Broker

The Broker node is responsible for forwarding Client query requests. The Broker can know which Segment is on which node through zookeeper, and the Broker will forward the query to the corresponding node. After all nodes return data, the Broker will merge the data of all nodes and return it to the Client.
The Broker will have an LRU (cache invalidation policy) to cache the results of each Segment. This cache can be a local cache or use an external cache system (such as memcached). The third-party cache can share segment results among all brokers. When the Borker receives the query request, it will first check whether there is corresponding query data locally. For segment data that does not exist, the request will be forwarded to the Historical node.

broker query

Broker will not cache real-time data because real-time data is in an unreliable state.

Guess you like

Origin blog.csdn.net/qq_42264264/article/details/96998307