Dry goods丨Comparison test of time series database DolphinDB and Druid

Both DolphinDB and Druid are distributed analytical time series databases. Although the former is developed using C++ and the latter is developed using Java, the two have a lot in common in terms of architecture, functions, and application scenarios. This report compares the performance of the two in terms of SQL query, data import, and disk space.

The test data set uses approximately 300GB of US stock market transaction and quotation data. Through testing, we found:

  • The data writing speed of DolphinDB is approximately 30 times that of Druid .
  • The query speed of DolphinDB is about 10 times that of Druid .
  • The static space usage of the DolphinDB database is 80% higher than that of Druid , and the total disk space used at runtime is slightly lower than that of Druid.

1. System Introduction

DolphinDB is an analytical distributed time series database, written in C++, with built-in streaming data processing engine, parallel computing engine and distributed computing functions. DolphinDB database has a built-in distributed file system that supports horizontal and vertical cluster expansion. Provide SQL and Python-like scripting languages, not only can use SQL to operate on data, but also can complete more complex memory calculations. Provide APIs in other commonly used programming languages ​​to facilitate integration with existing applications. DolphinDB can quickly process trillions of data. It has excellent performance in historical data analysis modeling and real-time streaming data processing in the financial field, as well as massive sensor data processing and real-time analysis in the Internet of Things field.

Druid is an OLAP data warehouse implemented by Java language, which is suitable for low-latency query and insertion of trillion-level data and real-time streaming data analysis. Druid adopts key technologies such as distributed, SN architecture and columnar storage, inverted index, and bitmap index, which are characterized by high availability and high scalability. At the same time, Druid provides multiple language interfaces and supports some SQL.

2. System configuration

2.1 Hardware configuration

The hardware configuration of this test is as follows:

Equipment: DELL OptiPlex 7060

CPU: Inter(R) Core™ i7-8700 CPU @ 3.20GHz, 6 cores and 12 threads

Memory: 32GB

Hard disk: 256GB SSD, 1.8TB Seagate ST2000DM008-2FR102 mechanical hard disk

Operating system: Ubuntu 16.04 x64

2.2 Environment configuration

The test environment this time is a multi-node cluster under a single server. Set the number of data nodes of DolphinDB to 4, and set the maximum available memory of a single data node to 4GB. Set the number of Druid nodes to 5, which are overload, broker, historical, coordinator, and middleManager. Druid caches query results by default, which affects the correctness of the method of averaging through multiple queries during testing, so the query cache function is turned off. In order not to affect Druid's write performance test, Druid's roll up function was disabled. All other configurations obey the default configuration.

The original csv file is stored on the HDD. The database is stored on the SSD.

3. Test Data Set

This test uses the TAQ data set of the US stock market level 1 in August 2007. The TAQ data set is divided into 23 csv files on a daily basis. The size of a single file ranges from 7.8G to 19.1G. The size of the entire data set is about 290G, with a total of 6,561,693,704 pieces of data.

The data types of each field in the test data set TAQ in DolphinDB and Druid are as follows:

3d62ba9f8bfcd07d604821c8b9ae116b.png

In Druid, the DATE field is designated as a timestamp column. Other fields are used as dimension fields.

4. Data partition scheme

In DolphinDB, the stock code + date combination partition is adopted, which is divided into 128 partitions according to the stock code range and 23 partitions according to the date.

Druid only supports time range partitions, so we specify the DATE column as a timestamp type, with a unit of day, divided into 23 partitions.

5. Comparison test

We compared DolphinDB and Druid in terms of database query performance, I/O performance, and disk footprint.

5.1 Database query performance

The DolphinDB scripting language supports SQL syntax and has expanded functions for time series data. Druid provides a language based on the Json data format for query, and also provides dsql for SQL query. This test uses Druid's own dsql.

We performed several common SQL queries on the TAQ dataset. In order to reduce the impact of accidental factors on the results, this query performance test performed 10 times for each query operation, and the total time was averaged, and the time was in milliseconds. When testing DolphinDB, we used the timer statement to evaluate the execution time of the SQL statement on the server. Since Druid does not provide a tool or function to output the query time, the execution time printed by the client command line tool dsql is used. Compared with DolphinDB, the execution time returned by Druid has more transmission and display time for query results. Since the amount of data returned by the query is very small, dsql and the Druid server are on the same node, and the impact time is about 1ms. The time of about 1ms does not affect our test conclusions, so no special treatment is done.

The SQL representation of the 7 queries is shown in the following table.

335ec03747bcb0a29ef29b3ba22f9372.png

The test results are shown in the table below.

e6cc39a3384839a6d9bfcae7df9afd35.png

From the results, it can be seen that for almost all queries, the performance of DolphinDB is better than Druid, and the speed is about 3 to 30 times that of Druid.

Since Druid only allows segmentation based on timestamp, DolphinDB allows data to be divided from multiple dimensions. When TAQ partitioning uses two dimensions, time and stock code, the query needs to be filtered or grouped according to stock codes The advantages of DolphinDB are more obvious in the tests (such as tests 1, 3, 6, and 7).

5.2 I/O performance test

We tested the performance of DolphinDB and Druid when importing a single file (7.8G) and multiple files (290.8G). To be fair, we turned off Druid’s Roll up feature. The test results are shown in the table below, and the time is in seconds.

69edbf1e57afcf86f9c07079dc34f024.png

Under the same circumstances, when importing a single file, the import time of Druid is more than 16 times that of DolphinDB. When importing multiple files, because DolphinDB supports parallel import, the speed is faster than Druid. See Appendix 2 for the data import script.

5.3 Disk space test

After importing data into DolphinDB and Druid, we compared the data compression rate of the two. The test results are shown in the table below.

ed803bb15f18c4e6c14f379c5f1c1ac4.png

DolphinDB uses the LZ4 compression algorithm to quickly compress data stored in columns. Before compression, DolphinDB's SYMBOL type uses dictionary encoding to convert strings into integers. In the process of data storage, Druid uses LZ4 algorithm to directly compress timestamp and metrics, and uses dictionary encoding, bitmap index and roaring bitmap to compress the dimensions field. The use of dictionary encoding can reduce the storage space of strings, bitmap indexes can quickly perform bitwise logical operations, and bitmap index compression further saves storage space.

In this test, the disk space occupied by the DolphinDB database is about 80% higher than that of Druid. The main factor causing this difference is that the compression ratios of the two floating-point fields of BID and OFR are quite different on DolphinDB and Druid. On DolphinDB, the compression ratio of these two fields is 20%, while on Druid it is as high as 5%. The reason is that the test data set is a historical data set, and the data has been sorted according to the two fields of date and stock. The price change of a stock in a short period of time is very small, the number of unique quotations is very limited, and Druid uses bitmap compression to achieve very good results.

Although the compression ratio of the Druid database is higher and the static disk space usage is smaller, the segment cache directory will be generated when Druid is running, and the total disk space usage can reach 65 GB. While DolphinDB does not require additional space when running, the total disk space is slightly smaller than Druid.

6. Summary

The performance advantages of DolphinDB for Druid come from many aspects, including (1) the difference in storage mechanism and partitioning mechanism, (2) the difference in development language (c++ vs java), (3) the difference in memory management, and ( 4) Differences in the implementation of algorithms (such as sorting and hashing).

In terms of partitioning, Druid only supports time-type range partitioning, which lacks flexibility compared to DolphinDB, which supports value partitioning, range partitioning, hash partitioning, and list partitioning, and each table can be partitioned based on multiple fields. DolphinDB's partition granularity is finer, and it is not easy for data or queries to be concentrated on a certain node. DolphinDB needs to scan fewer data blocks during query, the response time is shorter, and the performance is better.

Except for performance, DolphinDB is more functional than Druid. In terms of SQL support, DolphinDB supports a very powerful window function mechanism, and supports SQL join more comprehensively. It has good support for sliding function, asof join, window join, and DolphinDB specific to time series data. DolphinDB integrates database, programming language and distributed computing. In addition to regular database query functions, DolphinDB also supports more complex memory computing, distributed computing, and stream computing.

DolphinDB and Druid are also slightly different in operating mode. After Druid crashes or restarts after clearing the segment-cache, it takes a lot of time to reload the data, decompress each segment into the segment-cache and then query, the efficiency is low, and the cache will occupy a larger space. , So Druid needs to wait a long time when restarting, and requires more space.

appendix

Appendix 1. Environment Configuration

(1) DolphinDB configuration

controller.cfg

localSite=localhost:9919:ctl9919
localExecutors=3
maxConnections=128
maxMemSize = 4
webWorkerNum=4
workerNum=4
dfsReplicationFactor=1
dfsReplicaReliabilityLevel=0
enableDFS=1
enableHTTPS=0

cluster.nodes

localSite,mode
localhost:9910:agent,agent
localhost:9921:DFS_NODE1,datanode
localhost:9922:DFS_NODE2,datanode
localhost:9923:DFS_NODE3,datanode
localhost:9924:DFS_NODE4,datanode

cluster.cfg

maxConnection=128
workerNum=8
localExecutors=7
webWorkerNum=2
maxMemSize = 4

agent.cfg

workerNum=3
localExecutors=2
maxMemSize = 4
localSite=localhost:9910:agent
controllerSite=localhost:9919:ctl9919

(2) Druid configuration

_common

# Zookeeper
druid.zk.service.host=zk.host.ip
druid.zk.paths.base=/druid
# Metadata storage
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://db.example.com:3306/druid
# Deep storage
druid.storage.type=local
druid.storage.storageDirectory=var/druid/segments
# Indexing service logs
druid.indexer.logs.type=file
druid.indexer.logs.directory=var/druid/indexing-logs

broker:

Xms24g
Xmx24g
XX:MaxDirectMemorySize=4096m

# HTTP server threads
druid.broker.http.numConnections=5
druid.server.http.numThreads=25

# Processing threads and buffers
druid.processing.buffer.sizeBytes=2147483648
druid.processing.numThreads=7

# Query cache
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false

coordinator:
Xms3g
Xmx3g

historical:
Xms8g
Xmx8g

# HTTP server threads
druid.server.http.numThreads=25

# Processing threads and buffers
druid.processing.buffer.sizeBytes=2147483648
druid.processing.numThreads=7

# Segment storage
druid.segmentCache.locations=[{"path":"var/druid/segment-cache","maxSize":0}]
druid.server.maxSize=130000000000

druid.historical.cache.useCache=false
druid.historical.cache.populateCache=false

middleManager:
Xms64m
Xmx64m

# Number of tasks per middleManager
druid.worker.capacity=3

# HTTP server threads
druid.server.http.numThreads=25

# Processing threads and buffers on Peons
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=4147483648
druid.indexer.fork.property.druid.processing.numThreads=2

overload:

Xms3g
Xmx3g

Appendix 2. Data Import Script

DolphinDB database 脚本:

if (existsDatabase("dfs://TAQ"))
dropDatabase("dfs://TAQ")

db = database("/Druid/table", SEQ, 4)
t=loadTextEx(db, 'table', ,"/data/data/TAQ/TAQ20070801.csv")
t=select count(*) as ct from t group by symbol
buckets = cutPoints(exec symbol from t, 128)
buckets[size(buckets)-1]=`ZZZZZ
t1=table(buckets as bucket)
t1.saveText("/data/data/TAQ/buckets.txt")

db1 = database("", VALUE, 2007.08.01..2007.09.01)
partition = loadText("/data/data/buckets.txt")
partitions = exec * from partition
db2 = database("", RANGE, partitions)
db = database("dfs://TAQ", HIER, [db1, db2])
db.createPartitionedTable(table(100:0, `symbol`date`time`bid`ofr`bidsiz`ofrsiz`mode`ex`mmid, [SYMBOL, DATE, SECOND, DOUBLE, DOUBLE, INT, INT, INT, CHAR, SYMBOL]), `quotes, `date`symbol)

def loadJob() {
filenames = exec filename from files('/data/data/TAQ')
db = database("dfs://TAQ")
filedir = '/data/data/TAQ'
for(fname in filenames){
jobId = fname.strReplace(".csv", "")
jobName = jobId 
submitJob(jobId,jobName, loadTextEx{db, "quotes", `date`symbol,filedir+'/'+fname})
}
}
loadJob()
select * from getRecentJobs()
TAQ = loadTable("dfs://TAQ","quotes");

Druid script:

{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "TAQ",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "csv",
"dimensionsSpec" : {
"dimensions" : [
"TIME",
"SYMBOL",
{"name":"BID", "type" : "double"},
{"name":"OFR", "type" : "double"},
{"name":"BIDSIZ", "type" : "int"},
{"name":"OFRSIZ", "type" : "int"},
"MODE",
"EX",
"MMID"
]
},
"timestampSpec": {
"column": "DATE",
"format": "yyyyMMdd"
},
"columns" : ["SYMBOL",
"DATE",
"TIME",
"BID",
"OFR",
"BIDSIZ",
"OFRSIZ",
"MODE",
"EX",
"MMID"]
}
},
"metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2007-08-01/2007-09-01"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "/data/data/",
"filter" : "TAQ.csv"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
}
}


Guess you like

Origin blog.51cto.com/15022783/2577739