Dry goods丨Performance comparison test of DolphinDB and Elasticserach on financial data sets

Elasticsearch is a very popular log retrieval and analysis tool, especially in terms of real-time, scalability, ease of use, and full-text retrieval. There is an article on Zhihu , Golion: Dimensionality Reduction! Use ElasticSearch as a time series database , and achieved very good results. Many Zhihu users can't help but ask whether Elasticsearch can be used for the storage and analysis of massive financial data?

For this reason, we conducted comprehensive comparative tests on financial data sets of different scales between DolphinDB and Elasticsearch. The content of the test includes I/O, disk space occupation, memory consumption, database query (filter query and group statistics) 4 major items. The test results are no surprises. DolphinDB, a time-series database that is very eye-catching in the field of financial data processing, beats Elasticsearch.

Group statistics (aggregate calculation), DolphinDB outperforms Elasticsearch by about 10 times, and the advantages become more obvious as the data set increases. In particular, when the test case involves the time type field, the performance of DolphinDB is particularly prominent.
Simple filter query, the performance of DolphinDB is 100 times that of Elasticsearch .
In terms of data import, Elasticsearch takes 25 to 75 times that of DolphinDB , and it tends to become larger as the data set increases.
In terms of disk space occupation, DolphinDB has achieved compression of the original data, while Elasticsearch occupies more space on the disk than the original data in order to maintain document index and other information (excluding temporary data). The overall gap is about 10 times.

1. System Introduction

1.1 Introduction to DolphinDB

DolphinDB is an analytical distributed time series database, using columnar storage, built-in streaming data processing engine, parallel computing and distributed computing engine, and provides a distributed file system to support cluster expansion. DolphinDB is written in C++, and the response speed is extremely fast. Provide SQL and Python-like scripting languages to manipulate data. Provide APIs in other commonly used programming languages to facilitate integration with existing applications. It performs well in scenarios such as historical data analysis and modeling and real-time streaming data processing in the financial field, and massive sensor data processing and real-time analysis in the Internet of Things field.

1.2 Introduction to Elasticsearch

Elasticsearch is a Lucene-based search server. It is a distributed system that stores data based on local disks and stores documents for documents. It has the following similar comparison relationships with traditional databases:

Relational DB =>Databases =>Tables => Rows => Columns

Elasticsearch =>Indices=>Types=>Documents => Fields

An Elasticsearch cluster can contain multiple indexes (Indices), corresponding to DolphinDB databases; each index can contain multiple types (Types), corresponding to tables in DolphinDB; each type contains multiple documents (Documents), corresponding to DolphinDB data Row; then each document contains multiple fields (Fields), corresponding to the concept of columns in DolphinDB.

2. System configuration

2.1 Hardware configuration

The hardware configuration of this test is as follows:

Equipment: DELL OptiPlex 7060

CPU: Inter(R) Core™ i7-8700 CPU @ 3.20GHz, 6 cores and 12 threads

Memory: 32GB

Hard disk: 2TB mechanical hard disk

Operating system: Ubuntu 16.04 x64

2.2 Environment configuration

The test environment this time is a multi-node cluster under a single server. In order to maximize the performance of the two in a stand-alone environment, it is necessary to set the node parameters for DolphinDB and Elasticsearch. Set the number of data nodes of DolphinDB to 4, and set the maximum available memory of a single data node to 7 GB. Set the number of Elasticsearch nodes to 4. Since Elasticsearch is based on Lucene, it needs to allocate a certain amount of memory for the segments in Lucene to be loaded into the memory, which will also have a great impact on the performance of Elasticsearch. This test allocates 8 GB memory is given to Lucene, and the maximum available memory of a single node in Elasticsearch is set to 6 GB, and swapping is prohibited.

3. Test Data Set

In order to test the performance of DolphinDB and Elasticsearch more comprehensively, we used three stock data sets of different sizes. The data table CN_Stock contains the daily quotation data of China's Shanghai and Shenzhen stocks from 2008.01.01 to 2017.12.31. The data table US_Prices contains daily quotation data from the US stock market from January 02, 1990 to December 30, 2016. The data sheet TAQ contains high frequency data of level 1 of the US stock market for 4 days in August 2007, totaling 60.6 GB. The overview of the test data set is shown in the following table:

The data types of each field in the test data set in DolphinDB and Elasticsearch are as follows:

(1) CN_Stock table data type mapping

(2) US_Prices table data type mapping

(3) TAQ table data type mapping

4. Partition/sharding scheme

DolphinDB database provides flexible partitioning mechanisms, including value partitioning, range partitioning, list partitioning, hash partitioning and combined partitioning, while Elasticsearch only supports hash-based sharding mechanisms.

In DolphinDB, for the table CN_Stock, it is divided into 20 partitions every six months according to time; for the table US_Prices, it is divided into 27 partitions according to time every year; for the table TAQ, a combination of date and stock code is used Partition method, a total of 100 partitions. The number of copies is set to 1.

Elasticsearch only allows to define the number of shards. For table CN_Stock and table US_Prices, the number of fragments is defined as 4; for table TAQ, the number of fragments is defined as 100. The number of copies is set to 1.

5. Comparison test

We compared DolphinDB and Elasticsearch in terms of database query performance, I/O performance, disk footprint, and memory consumption.

5.1 Database query performance test

The DolphinDB scripting language supports SQL syntax, and at the same time it has been extended to a certain extent on its basis, and its functions are more powerful. In Elasticsearch, you need to install a plug-in to query SQL statements. At the same time, it also provides a DSL (Domain Specific Language) language based on the JSON data format for querying. This test uses the DSL language.

The main application scenario of Elasticsearch is a search engine, which supports fuzzy queries. For ordinary queries, Elasticsearch returns only 10 query hits by default; for aggregate queries, the default size of the buckets returned is also 10. The results returned by each query in DolphinDB are all results, and there is no fuzzy query.

In Elasticsearch's aggregation query, there are fields doc_count_error_upper_bound and sum_other_doc_count in the returned results, which respectively represent the potential aggregation results that were not returned in this aggregation but may exist, and the number of documents that were not counted in this aggregation. This is also a good proof that Elasticsearch's default data query operation only performs fuzzy queries on part of the data in the database, rather than accurately querying all data records in the database. In order to test the two in a fair environment, we need to turn off the fuzzy query of Elasticsearch. The processing method is to use the scroll interface in Elasticsearch and define the size of the buckets to control Elasticsearch to return all query results.

In this test, the DolphinDB script was used to complete the DolphinDB query performance test. Use Python script +DSL to complete Elasticsearch query performance test.

We performed several common SQL queries on three data tables. In order to reduce the impact of accidental factors on the results, this query performance test performed 10 queries for each query operation, and then averaged the total time, and the time was in milliseconds. The test scripts and results of each test data set are shown in the following table.

(1) CN_Stock table

The query script in DolphinDB:

Query performance test results (data volume: 5,332,932):

（2）US_Prices表

The query script in DolphinDB:

Query performance test results (data volume: 50,591,907):

(3) TAQ table

The query script in DolphinDB:

Query performance test results (data volume: 1,366,036,384):

For this query performance test, we can draw the following conclusions:

(1) In all tests on the same table, the performance of DolphinDB is many times ahead of Elasticsearch. In particular, for simple filtering queries, the performance of DolphinDB is 1 to 2 orders of magnitude of the performance of Elasticsearch (see the test results of CN_Stock table 1~4 and the test results of US_Prices table 1~4).

(2) In the test results related to aggregation queries, the performance of DolphinDB is also better than Elasticsearch, with an average of 8-9 times. In particular, in the aggregate query grouped by time, the performance of DolphinDB is 13~15 times that of Elasticsearch (see 5~10 in the test result of CN_Stock table and 5~10 in the test result of US_Prices table).

(3) In the same type of query test with different data scales, we can see that as the data scale increases, the time-consuming growth of Elasticsearch's precise query is much greater than DolphinDB, and DolphinDB has excellent stability under different data scales. In Elasticsearch.

5.2 I/O performance test

Elasticsearch provides the _bulk API to write data in batches. When creating a new document, you first need to describe the attributes, data types (such as keyword, text, integer or date) of each field that may be contained in the document, and whether these fields need to be indexed or stored by Lucene. Then Elasticsearch builds the corresponding mapping for these attributes of the document, and creates the inverted index to form the segment in Lucene. Finally, the inverted index is stored on the disk through the refresh and flush mechanism. The process of flushing the inverted index in memory to disk is the key to determining the performance of Elasticsearch. It is worth noting that although Elasticsearch provides the _bulk API to import data in batches, you can also set index.refresh_interval = -1 and index. number_of_replicas = 0 for import optimization. However, in the case of large-scale data import, when the buffer in the memory is full, refresh will still be triggered, and flush will be performed to store the data on the disk, so the optimization effect is not very obvious. Elasticsearch data import is slow. Very significant disadvantage.

When creating a distributed data table in DolphinDB and writing data, first determine the data node location where the data of different partitions are written according to the partition type of the distributed data table. Within the partition, the data is organized in columnar storage, and operations such as data import and query are carried out through cooperation between nodes. The data import is fast and the performance is extremely high.

The following table is the I/O performance test results of the two data imports. It can be clearly observed that the load time ratio of ES/DDB increases with the increase of the data volume, especially when the data volume is 60.6GB At the time, the Elasticsearch import took more than 12 hours. See the appendix for the data import script.

5.3 Disk space test

Elasticsearch is known for its search efficiency and timeliness. It is a distributed search engine based on Lucene and compresses the content of the source field, but internally it builds an inverted index for each document created and stores the inverted index on disk. Because additional index information needs to be added to each document on the disk, more storage space is needed to store it. DolphinDB does not need the rest of the index information, and it truly compresses and stores the original data. The test results are shown in the table below.

5.4 Memory usage

In order to observe the memory usage of DolphinDB and Elasticsearch more comprehensively, use the Linux command htop to monitor the memory usage of DolphinDB and Elasticsearch (the total memory size is 32GB). The results are as follows:

5.5 Comparison in other aspects

(1) Elasticsearch supports the SQL language through the need to install a plug-in, and its built-in DSL language is JSON format, which has a complicated syntax. DolphinDB has a built-in complete scripting language, which not only supports SQL language, but also supports imperative, vectorized, functional, metaprogramming, RPC and other programming paradigms, which can easily achieve more functions.

(2) The main purpose of Elasticsearch is to provide a full-text search engine with distributed multi-user capabilities, supporting fuzzy queries, documents (rows) do not need a fixed structure, and different documents can have different field sets. DolphinDB only supports structured data.

(3) DolphinDB provides more than 600 built-in functions to meet the needs of different scenarios such as historical data modeling and real-time streaming data processing in the financial field, and real-time monitoring and real-time data analysis and processing in the Internet of Things field. It provides functions of multiple indicators such as lead, lag, accumulation window, sliding window, etc. required for time series data processing, and has been optimized in performance with excellent performance. Therefore, compared with Elasticsearch, DolphinDB has more applicable scenarios.

(4) Elasticsearch does not support table joins when used in time series databases. DolphinDB not only supports table joins, but also optimizes non-simultaneous join methods such as asof join and window join.

(5) DolphinDB supports distributed transactions for data writing, while Elasticsearch does not support transactions.

6. Summary

Elasticsearch supports structured data and unstructured data, fuzzy query, precise query, and aggregation calculation, suitable for many application scenarios. But compared with professional time series databases like DolphinDB, there is a big gap in terms of function and performance. Especially when the amount of data expands rapidly and exceeds the upper limit of physical memory, the shortcomings of high memory consumption and high disk space occupation are exposed, and the performance of historical data calculations is significantly reduced.

The detailed configuration information of DolphinDB and Elasticsearch, the test code of DolphinDB and Elasticsearch, and the data import script are shown in the appendix .