Dry goods丨Performance comparison test report of DolphinDB and Spark

1 Overview

Spark is a general-purpose big data parallel computing framework based on in-memory computing, with a variety of built-in components, such as batch processing, stream processing, machine learning, and graph processing. Hive is a Hadoop-based data warehouse that supports SQL-like command queries, which improves the ease of use of Hadoop. Spark is usually used in conjunction with Hive and Hadoop. The data partitioning in Hive can be used to easily manage and filter data and improve query efficiency.

DolphinDB is a high-performance distributed time series database written in C++. It has a built-in column memory engine with high throughput and low latency. It integrates a powerful programming language and supports scripting languages ​​like Python and SQL. It can directly perform complex programming and programming in the database. Operation. DolphinDB uses Data Source internally to abstract partition data. On Data Source, computing tasks such as SQL, machine learning, batch processing and stream processing can be completed. A Data Source can be either a built-in database partition or external data. If the Data Source is a partition of the built-in database, most of the calculations can be done locally, which greatly improves the efficiency of calculation and query.

This report will conduct performance comparison tests on DolphinDB, Spark directly accessing HDFS (Spark+Hadoop, hereinafter referred to as Spark), and Spark accessing HDFS through Hive components (Spark+Hive+Hadoop, hereinafter referred to as Spark+Hive). The test content includes data import, disk space occupation, data query and multi-user concurrent query. Through comparative tests, we can have a deeper understanding of the main factors that affect performance and the best application scenarios for different tools.

2. Environment configuration

2.1 Hardware configuration

This test uses two servers (machine 1, machine 2) with exactly the same configuration, and the configuration parameters are as follows:

Host: DELL PowerEdge R730xd

CPU: Intel Xeon(R) CPU E5-2650 v4 (24 cores 48 threads 2.20GHz)

Memory: 512 GB (32GB × 16, 2666 MHz)

Hard Disk: 17T HDD (1.7T × 10, 222 MB/s read; 210 MB/s write)

Network: 10 Gigabit Ethernet

OS : CentOS Linux release 7.6.1810 (Core)

2.2 Cluster configuration

The version of DolphinDB tested is Linux v0.95. The control node of the test cluster is deployed on machine 1, with three data nodes deployed on each machine, for a total of six data nodes. Each data node is configured with 8 workers, 7 executors, and 24G memory.

The Spark version tested is 2.3.3, equipped with Apache Hadoop 2.9.0. Hadoop and Spark are configured in a fully distributed mode, machine 1 is the Master, and both machines 1 and 2 have slaves. The version of Hive is 1.2.2, and both machines 1 and 2 have Hive. The metadata is stored in the MySql database on machine 1. Spark and Spark + Hive use the client method in the Standalone mode to submit applications.

During the test, DolphinDB, Spark, and Spark+Hive are all equipped with 6 hard disks, and the sum of CPU and memory used under different concurrency numbers are the same, with 48 threads and 144G memory. The resources used by Spark and Spark+Hive are only for specific applications. Each application has 6 executors. In the case of multi-user concurrent, the resources used by a single user of Spark and Spark+Hive will decrease as the number of users increases. The resources used by each user under different concurrency counts are shown in Table 1.

Table 1. Resources used by single users under different concurrent numbers of Spark and Spark+Hive

485cc1bb8aace70b9f01acae9f09a7bc.png

3. Data set and database design

3.1 Data set

The test data set is the TAQ data set provided by the New York Stock Exchange (NYSE). It contains Level 1 quotation data of more than 8,000 stocks in a month from 2007.08.01 to 2007.08.31, including trading time, stock code, buying price, Quotation information such as selling price, buying volume, and selling volume. There are a total of 6.5 billion (6,561,693,704) quotation records in the data set. One CSV contains a record of one trading day. There are 23 trading days in the month. The 23 uncompressed CSV files total 277 GB.

Data source: https://www. nyse.com/market-data/hi storical

3.2 Database design

Table 2. TAQ data types in various systems.

989cad57de7916b265b6b9cc048ae9b2.png

In DolphinDB database, we partition according to the combination of date and symbol columns. The first partition uses date DATE for value partitioning, a total of 23 partitions, and the second partition uses stock code SYMBOL for range partitioning. The number of partitions is 100, each partition About 120M or so.

The data stored by Spark on HDFS is 23 csv corresponding to 23 directories. Spark+Hive uses two-tier partitioning. The first-tier partition uses the date DATE column for static partitioning, and the second-tier partition uses the stock code SYMBOL for dynamic partitioning.

See the appendix for specific scripts.

4. Data import and query test

4.1 Data import test

The original data is evenly distributed on the 6 hard disks of the two servers, so that all resources in the cluster can be fully utilized. DolphinDB imports data in parallel through an asynchronous multi-node method. Spark and Spark+Hive start 6 applications in parallel to read the data and store the data in HDFS. The time for each system to import data is shown in Table 3. The disk space occupied by data in each system is shown in Table 4. See the appendix for the data import script.

Table 3. DolphinDB, Spark, Spark+Hive import data time

049d04ae72a28c87e3190307b69e796e.png

Table 4. Disk space occupied by data in DolphinDB, Spark, Spark+Hive

cdbd88e2d29bc620c066c84b9dfa1af8.png

The import performance of DolphinDB is significantly better than Spark and Spark+Hive, which is about 4 times that of Spark and about 6 times that of Spark + Hive. DolphinDB is written in C++ and has many internal optimizations, which makes great use of disk IO.

DolphinDB occupies more disk space than Spark and Spark+Hive, which is about 2 times that of them. This is because both Spark and Spark+Hive use the Parquet format on Hadoop, and the Parquet format is written to Hadoop through Spark using snappy compression by default.

4.2 Data query test

In order to ensure the fairness of the test, each query statement must be tested multiple times, and the system's page cache, directory item cache, and hard disk cache are respectively cleared through Linux system commands before each test. DolphinDB also clears its built-in cache.

The query statements in Table 5 cover most query scenarios, including grouping, sorting, conditions, aggregation calculations, point queries, and full table queries to evaluate the performance of DolphinDB, Spark, and Spark+Hive under different numbers of users.

Table 5. DolphinDB, Spark, Spark+Hive query statements

ed6446529cd6a13be4b564ec8ef9abe6.png

4.2.1 DolphinDB and Spark single-user query test

The following are the results of DolphinDB and Spark single-user queries. The time in the results is the average time of 8 queries.

Table 6. DolphinDB, Spark single-user query results

67ddd680339bbd0a70d36d8c93cc5871.png

It can be seen from the results that the query performance of DolphinDB is about 200 times that of Spark+HDFS. Queries Q1 to Q6 are all based on DolphinDB's partition field as the filter condition. DolphinDB only needs to load the data of the specified partition without a full table scan. Spark requires a full table scan from Q1 to Q6, which consumes a lot of time. For query Q7, both DolphinDB and Spark require a full table scan, but DolphinDB only loads related columns, not all columns, while Spark needs to load all data. Since Query runtime is dominated by data loading, the performance gap between DolphinDB and Spark is not as large as the previous query statement.

4.2.2 DolphinDB and Spark+Hive single-user query test

Because DolphinDB's data is partitioned and the predicate is pushed down during query, the efficiency is significantly higher than Spark. Here we use Spark with Hive components to access HDFS, and compare the query performance of DolphinDB and Spark+Hive. The following are the results of DolphinDB and Spark+Hive single-user queries. The time in the results is the average time of 8 queries.

Table 7. DolphinDB, Spark+Hive single-user query results

126746ffe11a9886d0fdb69e0e7d1833.png

The results show that the query performance of DolphinDB is significantly better than Spark+Hive, dozens of times that of Spark+Hive. Compared with the results in Table 6, the query speed of Spark+Hive is much faster than Spark, and the advantages of DolphinDB are significantly reduced. This is because Hive partitions the data, and only part of the data is loaded when the condition of the query statement contains the partition field, which realizes data filtering and improves efficiency. When the query statement Q7 scans the entire table, memory overflow occurs.

DolphinDB and Spark+Hive both partition the data, and can achieve predicate push down when loading data, achieving the effect of data filtering, but the query speed of DolphinDB is better than Spark+Hive. This is because the Spark+Hive area reads the data on HDFS through access between different systems. The data has to go through the process of serialization, network transmission, and deserialization, which is very time-consuming and affects performance. Most of DolphinDB's calculations are done locally, which reduces data transmission and is therefore more efficient.

4.2.3 Comparison of computing power between DolphinDB and Spark

The query performance of DolphinDB above is compared with Spark and Spark+Hive respectively. Because data partitioning, data filtering and transmission during query affect the performance of Spark, here we first load the data into memory, and then perform related calculations, compare DolphinDB And Spark+Hive. We omitted Spark+Hive because Hive is only used for data filtering. It is more efficient to read data on HDFS. The test data here is already in memory.

Table 8 is a sentence for testing computing power. Each test contains two statements, the first statement is to load data into the memory, and the second statement is to calculate the data in the memory. DolphinDB will automatically cache the data, and Spark will recreate a temporary table TmpTbl through its default caching mechanism.

Table 8. Comparing statement of computing power between DolphinDB and Spark

5099c7831c255f31d6465ad4ca8181a6.png

The following are the test results of DolphinDB and Spark's computing power. The time in the results is the average time of 5 tests.

Table 9. DolphinDB and Spark computing power test results

afdc101929128061f0771b43a8a87395.png

Since the data is already in memory, compared to Table 6, the time used by Spark has been greatly reduced, but the computing power of DolphinDB is still superior to that of Spark. DolphinDB is written in C++ and manages memory by itself, which is more efficient than Spark using JVM to manage memory. In addition, DolphinDB has built-in more efficient algorithms to improve computing performance.

DolphinDB's distributed computing uses partitions as a unit to calculate data in designated memory. Spark loads the entire block on HDFS. A data block contains data with different symbol values. Although it is cached, it still needs to be filtered, so the ratio of Q1 to Q2 is larger. The broadcast variables used in Spark calculations are compressed and transmitted to other executors and decompressed to affect performance.

4.2.4 Multi-user concurrent query

We use the query statements in Table 5 to test DolphinDB, Spark, and Spark+Hive for multi-user concurrent query. The following is the test result. The time in the result is the average time of 8 queries.

Table 10. DolphinDB, Spark, Spark+Hive multi-user concurrent query results

d7546d50da374bfafa96227bba8d1448.png

Figure 1. Comparison of DolphinDB and Spark multi-user query results

c0993b3f4be83d815128f56f0f60158c.png

Figure 2. Comparison of DolphinDB and Spark+Hive multi-user query results

2958ddf62a44f16e10672c97a78ec710.png

As can be seen from the above results, as the number of concurrent increases, the query time of the three gradually increases. When it reaches 8 concurrent users, the performance of Spark drops significantly compared to the previous case of a small number of concurrent users. Spark will cause the death of the worker when executing Q7. Spark+Hive is basically stable like DolphinDB when accessed by multiple users, but the memory overflow exception always occurs when the Q7 query statement is executed.

The query configuration of Spark+Hive is the same as that of Spark. Because it has the function of partitioning and can filter data, the query data volume is relatively small, so the efficiency is better than that of Spark scanning all data.

The performance of DolphinDB in concurrent queries is significantly better than Spark and Spark+ Hive. From the above figure, it can be seen that in the case of concurrent access by multiple users, as the number of users increases, the advantage of DolphinDB over Spark is almost linear growth, compared to Spark + The advantages of Hive remain basically unchanged, reflecting the importance of data partitioning to achieve data filtering when querying.

DolphinDB realizes multi-user data sharing in the case of multi-user concurrency, unlike Spark's data is only for specific applications. Therefore, in the case of 8 concurrent users, Spark allocates fewer resources to each user, and performance is significantly reduced. DolphinDB's data sharing can reduce the use of resources. With limited resources, more resources are reserved for users to calculate and use, which improves the efficiency of user concurrency and increases the number of users concurrently.

5. Summary

In terms of data import, DolphinDB can be loaded in parallel, while Spark and Spark+Hive can import data through simultaneous loading of multiple applications. The import speed of DolphinDB is 4-6 times that of Spark and Spark+Hive. In terms of disk space, the disk space occupied by DolphinDB is about twice the disk space occupied by Spark and Spark + Hive on Hadoop. Spark and Spark + Hive use snappy compression.

In terms of data SQL query, DolphinDB has more obvious advantages. The advantages mainly come from four aspects: (1) localized computing, (2) partition filtering, (3) optimized memory computing, (4) cross-session data sharing. In the case of single-user query, the query speed of DolphinDB is several to hundreds of times that of Spark, and dozens of times that of Spark + Hive. Spark reading HDFS is a call between different systems, which includes data serialization, network, and deserialization is very time-consuming and occupies a lot of resources. Most of DolphinDB's SQL queries are localized calculations, which greatly reduces the time of data transmission and loading. Spark + Hive is relatively faster than Spark. The main reason is that Spark + Hive only scans the data in the relevant partitions and achieves data filtering. After removing the factors of localization and partition filtering (that is, all data is already in memory), DolphinDB's computing power is still several times better than Spark. DolphinDB's partition-based distributed computing is highly efficient, and its memory management is better than Spark's JVM-based management. Spark's multi-user concurrency will gradually decrease its efficiency as the number of users increases. When querying large amounts of data, too many users will cause the death of workers. Spark + Hive multi-user concurrency is relatively stable, but too much data will cause memory overflow errors. In the case of multiple users, DolphinDB can realize data sharing, thereby reducing the resources used to load data. The query speed is hundreds of times that of Spark and dozens of times that of Spark+Hive. As the number of users increases, the performance advantage of DolphinDB over Spark becomes more obvious. In the case of partitioned queries, Spark+ Hive and DolphinDB significantly improve query performance.

Spark is an excellent general-purpose distributed computing engine with excellent performance in SQL query, batch processing, stream processing, and machine learning. However, because SQL queries usually only need to calculate the data once, compared to the hundreds of iterations required for machine learning, the advantages of memory computing cannot be fully realized. Therefore, we recommend using Spark for computationally intensive machine learning.

During the test, we also found that DolphinDB is a very lightweight implementation, the cluster is simple and fast to build, and the installation and configuration of the Spark + Hive + Hadoop cluster is very complicated.

appendix

Appendix 1. Data Preview

875a21dad95a9fd3ed425dd92c3595ab.png

Appendix 2. Hive Create Table Statement

CREATE TABLE IF NOT EXISTS TAQ (time TIMESTAMP, bid DOUBLE, ofr DOUBLE, bidsiz INT, ofrsiz INT, mode INT, ex TINYINT, mmid STRING)PARTITIONED BY (date DATE, symbol STRING) STORED AS PARQUET;

Appendix 3.

DolphinDB import data script:

fps1 and fps2 represent the vectors of all csv paths on machines 1 and 2 respectively
fps is a vector containing fps1 and fps2
allSites1 and allSites2 represent the vector of data node names on machines 1 and 2, respectively
allSite is a vector containing allSites1 and allSites2
DATE_RANGE=2007.07.01..2007.09.01
date_schema=database('', VALUE, DATE_RANGE)
symbol_schema=database('', RANGE, buckets)
db=database(FP_DB, COMPO,[date_schema,symbol_schema])
taq = db.createPartitionedTable(schema, `taq, `date`symbol)
for(i in 0..1){
	for(j in 0..(size(fps[i])-1))  {
		rpc(allSites[i][j] % size(allSite[i])],submitJob,"loadData" , "loadData" ,loadTextEx{database(FP_DB), "taq", `date`symbol, fps[i][j]} )
	}
}

Spark and Hive import data configuration:

--master local[8]
--executor-memory 24G


Guess you like

Origin blog.51cto.com/15022783/2589313