Dry Goods丨Timing Database Partition Tutorial (2)

1. Zoning principles

The general principle of partitioning is to make data management more efficient, improve query and calculation performance, and achieve low latency and high throughput.

1.1 Choose the appropriate partition field

The data type of DolphinDB partition field can be integer, date and SYMBOL type. Note that STRING, FLOAT and DOUBLE data types cannot be used as partition fields.

Although DolphinDB supports the partitioning of TIME, SECOND, and DATETIME type fields, it should be used with caution in actual use and avoid value partitioning, so as to avoid the partition granularity is too fine, and a lot of time is spent creating or querying hundreds of millions of only containing a few The recorded file directory.

The partition field should be quite important in the business. For example, in the field of securities trading, many tasks are related to stock trading dates or stock codes, so it is more reasonable to use these two fields to partition.


1.2 The partition granularity should not be too large

The maximum number of records supported by a single partition of DolphinDB is 2 billion. But the reasonable number of records should be far less than this number. Multiple columns in a partition are stored independently on the disk as files, and the data is usually compressed. When in use, the system reads the required columns from the disk, decompresses them and loads them into the memory. If the partition granularity is too large, it may cause insufficient memory when multiple worker threads are parallel, or cause the system to frequently switch between disk and working memory, which affects performance. An empirical formula is that the available memory of the data node is S and the number of workers is W. It is recommended that the size of each partition in the memory after decompression does not exceed S/8W. Assuming the maximum working memory is 32GB and 8 working threads, it is recommended that the size of a single partition after decompression does not exceed 512MB.

DolphinDB's subtasks are based on partitions. Therefore, the granularity of the partition is too large, which makes it impossible to effectively use the advantages of multiple nodes and multiple partitions, and converts tasks that can be calculated in parallel into sequential calculation tasks.

DolphinDB is optimized for OLAP scenarios, supports adding data, and does not support deleting or updating individual rows. If you want to modify the data, overwrite all the data in a partition. If the partition is too large, reduce efficiency. When DolphinDB replicates data between nodes, it also uses partitions as the unit. The partitions are too large, which is not conducive to the replication of data between nodes.

Based on various factors, it is recommended that the original data size of a partition before compression be controlled between 100M and 1G. Of course, this figure can be adjusted according to actual conditions. For example, in big data applications, we often see wide table designs, one expresses several hundred fields, but only a part of the fields are used in a single application. In this case, the upper limit range can be enlarged appropriately.

If you find that the granularity of the partition is too large, you can use several methods, (1) use a combined partition (COMPO), (2) increase the number of partitions, (3) change the range partition to a value partition.


1.3 The partition granularity should not be too small

If the partition granularity is too small, a query and calculation job will often generate a large number of subtasks, which will increase the communication and scheduling costs between data nodes and control nodes, as well as between control nodes. If the granularity of the partition is too small, it will also cause many inefficient disk accesses (read and write small files), causing the system to be overloaded. In addition, all partition metadata will reside in the memory of the control node. If the partition granularity is too small and the number of partitions is too large, the memory of the control node may be insufficient. We recommend that the data volume of each partition before compression should not be less than 100M.

For example, if the high-frequency trading data of stocks is partitioned according to the transaction date and the value of the stock code, it will lead to many extremely small partitions, because the transaction data volume of many inactive stocks is too small. If the dimensions of the stock code are divided into data according to the range partition method, and multiple inactive stocks are combined in one partition, the problem of too small partition granularity can be effectively solved and the performance of the system can be improved.

2. How to partition the data evenly

When the data volume of each partition is very different, it will cause the system load to be unbalanced, some nodes are too heavy, and other nodes are in idle waiting state. When a task has multiple subtasks, only the last subtask is completed before the result is returned to the user. Since a subtask corresponds to a partition, if the data is not evenly distributed, it may increase the job delay and affect the user experience.

In order to facilitate partitioning according to the distribution of data, DolphinDB provides a very useful function cutPoints(X, N, [freq]). X is a piece of data, N represents the number of groups generated, freq is an array of the same length as X, and each element corresponds to the frequency of the element in X. This function returns an array with (N+1) elements, so that the data in X is evenly distributed in N groups.

In the following example, you need to partition the stock quote data according to the two dimensions of date and stock code. If you simply partition the range by the first letter of the stock, it is easy to cause uneven data distribution, because a very small number of stock codes start with U, V, X, Y, Z and other letters. It is recommended to use the cutPoints function to divide partitions based on sample data.

// Import the data of 2007.08.01
t = ploadText(WORK_DIR+"/TAQ20070801.csv")

// Select the stock code distribution of the data on 2007.08.01 to calculate the grouping rules
t=select count(*) as ct from t where date=2007.08.01 group by symbol

// Generate 128 intervals in alphabetical order of stock codes. The number of data rows in each interval is equivalent on 2007.08.01.
buckets = cutPoints(t.symbol, 128, t.ct)

// The end boundary of the last interval is determined by the data of 2007.08.01. In order to exclude new ones after 2007.08.01, replace the ending boundary of the last interval with the largest stock code that will not appear.
buckets[size(buckets)-1] = `ZZZZZ

The results of //buckets are as follows:
//["A",'ABA','ACEC','ADP','AFN','AII','ALTU','AMK',..., 'XEL','XLG','XLPRACL','XOMA','ZZZZZ']

dateDomain = database("", VALUE, 2017.07.01..2018.06.30)
symDomain = database("", RANGE, buckets)
stockDB = database("dfs://stockDBTest", COMPO, [dateDomain, symDomain])

In addition to using range partitioning, list partitioning is also an effective way to solve uneven data distribution.

3. Timing type partition

Time is the most common dimension in actual data. DolphinDB provides a wealth of time types to meet user needs. When we use the time type field as the partition field, we need to reserve enough space for the time value to accommodate future data. In the following example, we create a database to partition the dates from 2000.01.01 to 2030.01.01 in days. Note that only when the actual data is written to the database, the database will actually create the required partitions.

dateDB = database("dfs://testDate", VALUE, 2000.01.01 .. 2030.01.01)

DolphinDB also has a special advantage when using the time type as the partition field. The partition field type defined by the database and the actual time type used in the data table can be inconsistent, as long as the accuracy of the defined partition field data type is less than or equal to the actual data type. For example, if the database is partitioned by month (month), the fields of the data table can be month, date, datetime, timestamp and nanotimestamp. The system will automatically convert the data type.

4. Data of the same partition of different tables are stored on the same node

In a distributed database, it is usually time-consuming to join data tables in multiple partitions, because the partitions involved may be on different nodes, and data needs to be replicated between different nodes. To solve this problem, DolphinDB has introduced a partitioning mechanism that shares storage locations. DolphinDB ensures that the data of all tables in the same partition in the same distributed database is stored on the same node. This arrangement ensures that these tables are very efficient when connected. The current version of DolphinDB does not provide join functions for multiple partition tables that use different partitioning mechanisms.

dateDomain = database("", VALUE, 2018.05.01..2018.07.01)
symDomain = database("", RANGE, string('A'..'Z') join `ZZZZZ)
stockDB = database("dfs://stockDB", COMPO, [dateDomain, symDomain])

quoteSchema = table(10:0, `sym`date`time`bid`bidSize`ask`askSize, [SYMBOL,DATE,TIME,DOUBLE,INT,DOUBLE,INT])
stockDB.createPartitionedTable(quoteSchema, "quotes", `date`sym)

tradeSchema = table(10:0, `sym`date`time`price`vol, [SYMBOL,DATE,TIME,DOUBLE,INT])
stockDB.createPartitionedTable(tradeSchema, "trades", `date`sym)

In the above example, the two partition tables of quotes and trades use the same partitioning mechanism.

DolphinDB is a system designed for OLAP, mainly to solve the rapid storage and calculation of massive structured data, and to achieve high-performance data processing through memory database and streaming data. DolphinDB is not suitable for OLTP business systems with frequent data changes. DolphinDB's data writing is similar to Hadoop HDFS, which quickly inserts data in batches at the end of each partition or file. The inserted data will be compressed and stored to disk, and the general compression ratio is 20%~25%. Once the data is added to the disk-based data table, some qualified records cannot be quickly updated or deleted, and the data table must be modified by partition. This is also one of the reasons why a single partition should not be too large.

5. Multiple copies mechanism

DolphinDB allows to keep multiple copies for each partition. The default number of copies is 2. You can modify the parameter dfsReplicationFactor of the control node to set the number of copies.

There are two purposes for setting redundant data: (1) When a data node fails or disk data is damaged, the system provides fault tolerance to continue to provide services; (2) When a large number of concurrent users access, multiple copies provide load balancing Function to improve system throughput and reduce access delay.

DolphinDB uses a two-phase transaction commit mechanism to ensure strong data consistency between multiple nodes when the same copy is written.

In the parameter file controller.cfg of the control node, there is a very important parameter dfsReplicaReliabilityLevel. This parameter determines whether multiple copies are allowed to reside on multiple data nodes of the same physical server. In the development stage, multiple nodes are allowed to be configured on one machine, and multiple copies are allowed to reside on the same physical server (dfsReplicaReliabilityLevel=0), but the production stage needs to be set to 1, otherwise it will not function as a fault-tolerant backup.

 // The number of copies of each table partition or file block. The default value is 2.
dfsReplicationFactor=2

 // Whether multiple copies can reside on the same physical server. Level 0: Allow; Level 1: Not run. The default value is 0.
dfsReplicaReliabilityLevel=0

6. Transaction mechanism

DolphinDB supports transactions for the read and write of database tables based on disk (distributed file system), which means to ensure the atomicity, consistency, isolation and durability of transactions. DolphinDB uses a multi-version mechanism to achieve snapshot level isolation. Under this isolation mechanism, data read operations and write operations do not block each other, which can optimize the performance of data warehouse reads to the greatest extent.

In order to optimize the performance of data warehouse query, analysis and calculation, DolphinDB imposes some restrictions on transactions:

First of all, a transaction can only include writing or reading, not writing and reading at the same time.

Second, a write transaction can span multiple partitions, but the same partition cannot be concurrently written by multiple writers. That is to say, when a partition is locked by a certain transaction A, and another transaction B tries to lock the partition again, the system will immediately throw an exception and cause transaction B to fail and roll back.

7. Parallel writing by multiple Writers

DolphinDB provides a powerful partitioning mechanism. A single data table can support several million partitions, which creates conditions for high-performance parallel data loading. Especially when importing massive amounts of data from other systems into DolphinDB, or when real-time data needs to be written to the data warehouse in a quasi-real-time manner, parallel loading is particularly important.

The following example loads stock quote data (quotes) into the database stockDB in parallel. stockDB uses date and stock code as a composite partition. The data is stored in csv files, and each file saves one day's quote data.

//Create database and data table
dateDomain = database("", VALUE, 2018.05.01..2018.07.01)
symDomain = database("", RANGE, string('A'..'Z') join `ZZZZZ)
stockDB = database("dfs://stockDB", COMPO, [dateDomain, symDomain])
quoteSchema = table(10:0, `sym`date`time`bid`bidSize`ask`askSize, [SYMBOL,DATE,TIME,DOUBLE,INT,DOUBLE,INT])
stockDB.createPartitionedTable(quoteSchema, "quotes", `date`sym)

def loadJob(){
	fileDir='/stockData'

    // Get the data file name under the path
	filenames = exec filename from files(fileDir)

	// Load the database
	db = database("dfs://stockDB")

	// For each file, the jobId prefix is ​​generated by the file name.
	// Use the submitJob function to submit the daemon and call loadTextEx to load the data into the stockDB database.
	for(fname in filenames){
		jobId = fname.strReplace(".txt", "")
		submitJob(jobId,, loadTextEx{db, "quotes", `date`sym, fileDir+'/'+fname})
	}
}

//The loadJob task is sent to each data node of the cluster through pnodeRun for parallel loading.
pnodeRun(loadJob)

When multiple writers load data in parallel, ensure that these writers do not write data to the same partition at the same time, otherwise the transaction will fail. In the above example, each file stores one day's data, and a partition field of the quotes table is the date, so as to ensure that all data loading operations will not generate overlapping transactions.


Welcome to visit the  official website to download the trial version of DolphinDB


Guess you like

Origin blog.51cto.com/15022783/2562783